$HgGddlZddlZddlmZddlTddlmZddl Z ddl m Z ddl m Z ddl mZddlmZmZdd lmZdd lmZmZmZmZmZddlZd d d didd dddd diddd did dd didZdZGddZdejvreej_ej dde!e"de#e$e$ffdZ%dZ&e'dkr e&dSdS)N) set_trace)*)Path)number_breakdown_from_df)VA_ROOT)USR_SUB) QueryWrapper get_base_url) load_prompt)default_page_settingescape_markdown set_nav_barshow_linebreak_in_md visualizationfontz Gothic A1)sizefamilytickfont )titlerxaxisyaxislegendc|dS|||z}d|cxkrt|kr.ndS||tj|<tjdSdS)Nr)indexlenst session_statererun)tsourcekeyval target_indexs F/home/deftson/nfs-deftson/2024/public_varco_arena/pages/see_results.pynavigater( ss ~776??S(LL!!!!3q66!!!!!! !,  "!c@eZdZdZdedefdZdedeefdZdS) DataCacheci|_dSNcache)selfs r'__init__zDataCache.__init__,s  r)r$datac||j|<dSr-r.)r0r$r2s r'storezDataCache.store/s 3r)returnc6|j|Sr-)r/get)r0r$s r'r7z DataCache.get2sz~~c"""r)N) __name__ __module__ __qualname__r1strdictr4Optionalr7r)r'r+r++skD#s#x~######r)r+ data_cacheresult_file_pathr5c|rRtt|}tjj|}|r|d|dfSi}i}| t|}t j|}dD]#}||jvr| |gd$tj |}i}i}t|d|d<||d<|d  D]-} ||d | k} t| d || <| || <.|jd } |jd } | d | } ||| <||| <||d}tjjt||d}t!t#|jdz ddD]2}|j|t$krt&|j|dzz }n3|t)dt+|d}|D]*}|jjdvrt3j|j+n?#t6$r2}tjdt|iifcYd}~Sd}~wwxYw||fS)z Load data from file, cache it in memory, then remove the file. Returns cached data on subsequent calls. Args: result_file_path: Path to the result JSON file Returns: Tuple of (all_result_dict, df_dict) all_result_dictdf_dictN)tstamplogsT)columnsinplace) is_overallOveralltaskF/)rBrCz4Could not find user experiment directory for cleanupz*_KST_submitted)llm_example_kr mt_examplezError processing data: )r;rrr r?r7pd read_jsonrFdropauindex_test_scenarioruniquepartsr4rangerrr ValueErrorlistglobparentnameshutilrmtree Exceptionerror)r@ cache_key cached_datarBrCdfcolfig_dict_per_taskdf_dict_per_taskrJdf_taskprm_nameexp_namer$ cache_data user_exp_rootisubmits submittedes r'load_and_cache_datarr=sJ-..// &155i@@  J01;y3II IOG#8 #$455 .//B) 9 9"*$$GGSE4G888'++B " ! ,9+M+M+M i (*, Y '6 ))++ 1 1RZ4/0*7E*R*R*R!$')0 &&(-b1H'-b1H****C#4OC +GCL.=QQJ   ' - -c2B.C.CZ P P P!M3/566:BCC  #)!,77$+.>.DQU.K$KME8$ !WXXX=--.?@@AAG$ 4 4 #(,LLLM)"23333  4    H7s1vv77 8 8 8r6MMMMMM  G ##sGH== I9'I4.I94I9c  -td}td|dttjd<ttjd<t d \}}tjd|tjd|t d \}}tjd|tjd|tjd d}|t|nd}t | \tjd<tjd<tj d td tj j ttjd}|tj|dd}tj drOtjtjtj|rdtjvr tjd=tjd|}t|} |dd} tjd|} t.} tddtj | --tj|-} | d}| -}t1|\}}}tjjrXtjd-dtjd|d|tjdt7|dnWtjd-dtjd|d|tjd t7|d!tjd"\}}|5tjd#$5tjd%-dtj| d&tjtAtC|dddn #1swxYwYdddn #1swxYwY|5tjd#$5tj"| j#dwi| d#-d'(dddn #1swxYwYdddn #1swxYwYtj$tjjrtjd)ntjd*t|j%&}tjd+d}tjd,}|d-5tjd.d/0rtO||d+ddddn #1swxYwY|d15td2d3tj ||-d4-fd5d67}dddn #1swxYwY|d"5tjd8d90rtO||d+d1dddn #1swxYwY|tjd+<d}|r||j%|k}tQj)|dtjvrtjdnd:\}}dtjvr|tjd< tQj*|tjd:}tQj+|tjd}tj,||zt|j-}tjd;d}tjd,}|d-5tjd.d<0rtO||d;ddddn #1swxYwY|d15td=d>tj ||-d?d6@}dddn #1swxYwY|d"5tjd8dA0rtO||d;d1dddn #1swxYwY|tjd;<|rGt7|dBd-} |j.| }!tjdCtj/dD|dE-d5ta|-F}"tdGdHdIdJ-K}#|dLkr dM|#dN<dO|#dP<|"j1dwi|#}$|$D]T}%tjdQ|%dRdQtj2tAtC|%dSU dddn #1swxYwYtj2tA||!j3}&tjd"\}}tj4}'tj5}(|5|&dTk})|)r|'n|(}*tjdU|!j6dV|!j7|*tA|!j8|)rdWndXYdddn #1swxYwY|5|&dZk})|)r|'n|(}*tjdU|!j9dV|!j:|*tA|!j;|)rdWndXYdddn #1swxYwYn#tx$r}+d-dl=},|,>tjtjjrd[nd\tj5|+tj2|tj|gd]Yd}+~+nd}+~+wwxYwtjd^tj,td_tj@tj$tjjrtjd`ntjdatjd"\}}tjd"\}}|5tjd#$5tj"| dbj#dwdcd#i| d#-dd(dddn #1swxYwYdddn #1swxYwY|5tjd#$5tj"| dej#dwdcd#i| d#-df(dddn #1swxYwYdddn #1swxYwY|5tjd#$5tj"| dgj#dwi| d#-dh(dddn #1swxYwYdddn #1swxYwY|5 dddn #1swxYwYtjjrtjdintjdjtj/tjjrdkndl5tj2tjjrdmndntjdo|dp|tjd"\}}|5tjd#$5tj"| dqj#dwi| d#-dr(dddn #1swxYwYdddn #1swxYwY|5tjd#$5tj"| dsj#dwi| d#-dt(tj| duAdvBjCdddn #1swxYwYdddn #1swxYwYddddS#1swxYwYdS)xNwide)layoutFsee_results_init)sidebar_placeholdertoggle_hashstrrBrCz@user_submit/llm_example_kr/LLM_example_result/llmbar/result.json)r@zEuser_submit/mt_example/MT_example_result/translation_pair/result.jsonr@zSelect Result:expnamerMrOz Clear Cache alpha2namesrIelo_rating_by_taskrJz Select Task judgenameu ## 결과 ()u##### Judge 모델: u / 평가프롬: u##### 테스트셋 사이즈: u 행z ## Results (z##### Judge Model: z / prompt: z##### Size of Testset: z rowsT)borderz#### Ratings ( elo_rating_elo_rating_by_task)use_container_widthr$u7### 토너먼트 (테스트 시나리오) 별로 보기z'### Tournament Results by Test Scenarioselected_tournament)rNrNru◀prev_tournament)r$rN tournamentzSelect Tournament_tournament_selectctjtjddS)Nr)rselected_match)rr updater7rJsr'zmain..s?b.55$&$4$8$8D9T9T9T$U$U#6r) collapsed)r$ on_changelabel_visibilityu▶next_tournament)rzr prev_matchmatchz Select Match _match_select)r$r next_matchz: z#### Current Test Scenario:z#### Evaluation Prompt (evalprompt: z--rz{inst}z{src}z{out_a}z{out_b})instsrcout_aout_brJtranslation_pairz {source_lang} source_langz {target_lang} target_langz**rolecontentmodel_az#### (z) u✅u❌)iconmodel_bup**Bug: 아래 표를 복사해서 이슈로 남겨주시면 개선에 도움이 됩니다. 감사합니다🙏**u_Bug: Please open issue and attach the table output below to help me out. Thanks in advance.🙏)depthround winner_nodeswinner_resolvedwinnerrrz Sharable linkz /see_results?u### 매치 통계z### Match Stats./fraction_of_model_a_wins_for_all_a_vs_b_matchesautosize0_fraction_of_model_a_wins_for_all_a_vs_b_matches)match_count_of_each_combination_of_models*_match_count_of_each_combination_of_modelsmatch_count_for_each_model_match_count_for_each_modelu%### 참고용 LLM Judge 편향 정보z&### FYI: How biased is your LLM Judge?u펼쳐서 보기zExpand to showuS Varco Arena에서는 position bias의 영향을 최소화하기 위해 모든 모델이 A나 B위치에 번갈아 위치하도록 하였습니다. 그러나 LLM Judge 혹은 Prompt의 성능이 부족하다고 느껴진다면, 아래 알려진 LLM Judge bias가 참고가 될겁니다. * position bias (왼쪽) * length bias (오른쪽) 결과의 왜곡이 LLM Judge의 부족함 떄문이었다는 점을 규명하려면 사용하신 LLM Judge와 Prompt의 binary classification 정확도를 측정해보시길 바랍니다 (Varco Arena를 활용하여 이를 수행해볼 수 있습니다!).a In Varco Arena, to minimize the effect of position bias, all models are alternately positioned in either position A or B. However, if you feel the LLM Judge or Prompt performance is insufficient, the following known LLM Judge biases may be helpful to reference: * position bias (left) * length bias (right) To determine if result distortion was due to LLM Judge limitations, please measure the binary classification accuracy of your LLM Judge and Prompt (You could use Varco Arena for this purpose!). z#### z + prompt = counts_of_match_winners_counts_of_match_winners length_bias _length_biaslength_bias_dfcategoryr>)Dr rr<rr rrrr7r;sidebarrr selectboxr[keysstopsplitstripbuttonrlclearcache_resourcer!DEFAULT_LAYOUT_DICTrkoreanmarkdownintrF containertablewriterr plotly_chart update_layoutdivider idx_inst_srcrWr(rUinit_tournament_dataframedrawmake_legend_strcodehuman_readable_idxlocexpanderr complete_promptinforsuccessrbrhuman_readable_model_a generated_arhuman_readable_model_b generated_bra traceback print_excr get_sharable_linkgroupbydescribeT).rwal_r_ddf_dal_r_d1df_d1most_recent_run result_selecteval_prompt_namerg task_listr{rhdefault_layout_dict figure_dictr|reinterpretationn_models size_testsetcol1col2d default_idxcolstournament_prm_selectdf_now_processeddf_now _alpha2namesbracket_drawingrmmatch_idx_human match_idxrowpromptkwargs prompt_cmplmsgr winnerboxloserboxiswinnerwritemsgrqrrJs. @r'mainrs:.f=== /)+/&&B&'"&&&BY&8z{{{LFD&'..v666Y&&t,,,(;BCCCNGU&'..w777Y&&u---&**+=tDDO.=.Ic/***tO _=== *+ #J%&&&+L++  R / 0 5 5 7 788M   $**3//399;; z''  !!!  0 B, , , /():;MJ&++--..I*956JK' 2=A- .< . .r|Y G GD |  #D)KK(I $ B-Eb-I-I*NHl H )$)))*** Y9YYGWYYZZZ LS5F5FLLLMMMM *4***+++ R)RR@PRRSSS Fc,.?.?FFFGGGAJD$ LL \ & & & L L K0000 1 1 1 H[. / / / H)/.*I*IJJ K K K L L L L L L L L L L L L L L LLLLLLLLLLLLLLLL  \ & & &   O0"0GG3FGG$(000                   JLLL ? MNNNN =>>> R_ # # % %&&A"&&' > >???????????????/DB*+ zBO'<<=)+)E  0 5 5 7 777(77 * * * &,  0 0 0.:B ] +o  g ,];O' ""2="AF GOf, - - -%899A*../?FFK:j))Da C C9U 555CQ -=rBBB C C C C C C C C C C C C C C Ca  "G,w"G"GL...%0 ###               a B B9U 555BQ -=qAAA B B B B B B B B B B B B B B B2AB - . /  5 5d ; ;A >?? &*95 9:::[U:JUUdUUUWW))9EEEF!%#''! F(+===0?}-0?}-"8&"8"B"B6"B"BK*WW $8V$8$8$8999 4_S^5T5T U UVVVVWWWWWWWWWWWWWWWW&,-BCCDDDZ]] dJ 8%2H,4Byy(HK T T T8R T TUUUH,S_==&.9UUE %2H,4Byy(HK T T T8R T TUUUH,S_==&.9UUE            ! ! ! K#*wCCv    HQKKK G) * * * H           0H_G|~~ N NL,J,L,L N NOOOJLLL ( '(((( &'''AJD$AJD$  \ & & &   O EFF)-F1DFF%)MMM                      \ & & &   OV GHV!%8%)GGG                      \ & & &   OG 89G)%)888                                        > ;<<<< <=== 2+;+BX''HX Y Y#Y#Y &  D EJEJELELEL  EGG   EIEE3CEEFFFZ]] d   T***  HK 9:H-)-999                                Y YT*** Y Y>zJJSSUUWXXX  Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y9#Y#Y#Y#Y#Y#Y#Y#Y#Y#Y#Y#Y#Y#Y#Y#Y#Y#YsS40A!S S4S! !S4$S! %S44S8;S8U &U = U  U U U U  U$'U$)YYY"0ZZ"%Z"1)[&&[*-[*?B2m1)a& m&a**m-a*. m9,b1% m1b55m8b59 m)c9- m9c==mc=A9m:Bh! m!h%%m(h%)AmAk% m%k))m,k)-m2Am mmmmm o5(Bo00o5?t&.t t&t t&t t&&t*-t*3v .v7 vv v v vv!v'x =,w5) x 5w9 9x <w9 =x  xxx''x+.x+A?AA}7(,}  }7 }$ $}7'}$ (}7+ AA7}; ;AA>}; ?AAA@-A0A@@ A@-@A@ @A@-@A@ @A@-@! AA@-A@1 @1AA@4A@1 @5AAAAA A AA __main__r-)(pandasrR streamlitripdbrtypingpathlibranalysis_utilsrUrapprr query_compr r $varco_arena.varco_arena_core.promptsr view_utilsr r rrrr_rr(r+r r?rlr=r;TupleDictrrrr8r>r)r'r s@33333311111111<<<<<< r[99 :; / /2== >2== >k::; ########r'''"+)++B Q$Q$(3-Q$5tCTQ$Q$Q$Q$hPYPYPYf  zDFFFFFr)