}gl,bdZddlZddlZddlZddlZddlZddlmZdZ dZ dZ gdZ dZ d d gZd gZd Zejje e Gd dejZedkr>ddgZddgZddgZeZeeeeejdddZdSdS)z L3Score metric to score the quality of a free-form answer given a question and a ground-truth answer. The metric is based on the log-probability of the Yes/No token of an LLM judge. Metric is based on the paper: https://arxiv.org/pdf/2407.09413 N)init_chat_modelz@article{pramanick2024spiqa, title={Spiqa: A dataset for multimodal question answering on scientific papers}, author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini}, journal={arXiv preprint arXiv:2407.09413}, year={2024} } aImplements the L3Score metric to score the quality of a free-form answer given a question and a ground-truth answer. The metric is based on the log-probability of the Yes/No token of an LLM judge. Metric is based on the paper: https://arxiv.org/pdf/2407.09413 a Implements the L3Score metric to score the quality of a free-form answer given a question and a ground-truth answer. Args: questions: list of questions to score. Each question should be a string. predictions: list of predictions to score. Each predictions should be a string. references: list of reference for each prediction. Each reference should be a string. Returns: L3Score: mean L3Score for all (question, prediction, reference) triplets. Examples: Example 1: High certainty the prediction is the same as the ground-truth. >>> L3Score = evaluate.load("L3Score") >>> L3Score.compute(questions=["What is the capital of France?"], predictions=["Paris"], references=["Paris"], api_key="your-openai-api-key", provider="openai", model="gpt-4o-mini") {'L3Score': 0.99...} Example 2: High certainty the prediction is not the same as the ground-truth. >>> L3Score = evaluate.load("L3Score") >>> L3Score.compute(questions=["What is the capital of Germany?"], predictions=["Moscow"], references=["Berlin"], api_key="your-openai-api-key", provider="openai", model="gpt-4o-mini") {'L3Score': 0.00...} )openaideepseekxaizYou are given a question, ground-truth answer, and a candidate answer. Question: {question} Ground-truth answer: {gt} Candidate answer: {answer} Is the semantic meaning of the ground-truth and candidate answers similar? Answer in one word - Yes or No.z yesz yeahz nog@c`eZdZdZdZdZdZdZ ddZd Z d e d e d e fd Z de d e fdZ dS)L3Scorea L3Score metric to score the quality of a free-form answer given a question and a ground-truth answer. The metric is based on the log-probability of the Yes/No token of an LLM judge. Metric is from the paper: https://arxiv.org/pdf/2407.09413 c tjdtttt jt jdt jdt jddddggdS)Nmetricstring) questions predictions referenceshttps://github.com/google/spiqazLhttps://github.com/google/spiqa/blob/main/metrics/llmlogscore/llmlogscore.py)z https://arxiv.org/pdf/2407.09413rz,https://huggingface.co/datasets/google/spiqa) module_type descriptioncitationinputs_descriptionfeatureshomepage codebase_urlsreference_urls)evaluate MetricInfo _DESCRIPTION _CITATION_KWARGS_DESCRIPTIONdatasetsFeaturesValue)selfs I/home/niklas/Desktop/Work/PhD/LLM4ScienceBench/metrics/L3Score/L3Score.py_infoz L3Score._infoZs" $2&!)!9!9#+>(#;#;"*.":":7^QQQ!    cdS)zBOptional: download external resources useful to compute the scoresN)r dl_managers r!_download_and_preparezL3Score._download_and_preparens r#c|tvr'tdt |dkrbtj|}t d|jD}||vrtd|d|d|n|dkrVtj|d }d |jD}||vrtd|d|d|n[|d krUtj|d }d|jD}||vrtd|d|d|n3#tj$r!} dd| icYd} ~ Sd} ~ wwxYwt|t|cxkrt|ks nJddS)zVerify the input parameterszAProvider must offer top_logprobs to use this metric, pick from {}r)api_keycg|] }|j Sr%id.0models r! z)L3Score._verify_input..s"N"N"N58"N"N"Nr#zModel z not found for provider z, available models: rzhttps://api.deepseek.com)r)base_urlcg|] }|j Sr%r+r-s r!r0z)L3Score._verify_input..JJJEuxJJJr#rzhttps://api.xai.comcg|] }|j Sr%r+r-s r!r0z)L3Score._verify_input..r3r#errorzAuthentication failed: {}Nz?Questions, predictions and references must have the same length) PROVIDER_WITH_TOP_LOGPROBS ValueErrorformatrOpenAIsetmodelslistAuthenticationErrorlen) r r r rproviderr)r/client model_nameses r! _verify_inputzL3Score._verify_inputrsL 5 5 5SZZ.  D8##w777!"N"N9K9K9M9M"N"N"NOO  ++$%xe%x%xX%x%xkv%x%xyyy,Z''w@Z[[[JJV]5G5G5I5IJJJ  ++$%xe%x%xX%x%xkv%x%xyyy,U""wAVWWWJJV]5G5G5I5IJJJ  ++$%xe%x%xX%x%xkv%x%xyyy) D D D8??BBC C C C C C C D9~~[!1!1DDDDS__DDDDDGHDDDDDsDEF!E=7F=FcVt||}|dd}|S)z Get the LLM)r/r)T)logprobs top_logprobs)rbind)r r/r)llms r!_get_llmzL3Score._get_llms0E7;;;hh1h55 r#r gpt-4o-minic ||||||||||}d}d} t|||D]m\} } } |dt| | | f} n#t j$r }|jd}dd|icYd}~cSd}~wt j $r0}|jd}dd|icYd}~cSd}~wt j $r0}|jd}dd |icYd}~cSd}~wt$r0}|jd}dd |icYd}~cSd}~wwxYw| | j d d dd }||z }| dz } o| dkr|| z }d|iS)zReturns the scoresrhuman)questiongtanswermessager5zAuthentication failed: NzRate limit exceeded: {}zBad request: {}zAn error occurred: {}rFcontentrGr)rCrJzipinvoke_PROMPTr8rr=bodyRateLimitErrorBadRequestError Exception_calculate_L3Scoreresponse_metadataitem)r r r rr)r?r/rIrcountrN prediction referenceresponserBrQscores r!_computezL3Score._computesm 9k:xRWXXXmmE7++/29k:/V/V   +Hj) D::YzZZ- F F F&+!D7!D!DEEEEEEEEE( F F F&+!:!A!A!!D!DEEEEEEEEE) > > >&+!2!9!9!!.s%XXX6tv66XXXr#c:g|]}|Sr%rf)r.complement_suffixr s r!r0z.L3Score._calculate_L3Score..s6* * * ! OO- . .* * * r#tokenlogprobg) yes_scoreno_scorecBg|]}tj|dS)rmnpexp)r. token_logprobs r!r0z.L3Score._calculate_L3Score..s' P P P-RVM), - - P P Pr#rSg:0yE>) _SUFFIXES_TO_SCORE_COMPLEMENT_SUFFIXES NEGATIVE_INF enumeraterg_renormalize_scorerrrssumminlog)r rGnormalized_suffixesnormalized_complement_suffixessuffix_logprobcomplement_logprob suffix_indexcomplement_suffix_indexirtlowest_logproblowest_token_prob sum_probsremaining_probmin_probreciprocal_logprob exclude_score include_scores` r!r[zL3Score._calculate_L3Scores& YXXXEWXXX* * * * %9* * * & &) "$ ), 7 7   A}}W566:MMM!.y!9 N !*, 7 7   A} g 677122+,'%29%=" 2 2  "9R"?"?3 2  "9R"?"?**(3E+ &b))4F>22 P P< P P P  Y(.99 d??HVH-- 2  *M.MM $ * *.M.M&&&WWWr#rnroreturnc>ddtj||z zz S)z-Renormalize the scores to be between 0 and 1.rSrq)r rnros r!ryzL3Score._renormalize_scores$AX!5677788r#textcN|S)z=Remove white space and lower case for normalized comparisons.)striplower)r rs r!rgzL3Score._normalizeszz||!!###r#N)rrK)__name__ __module__ __qualname____doc__r"r'rCrJrcr[floatrystrrgr%r#r!rrRs    (   !H!H!HH6 6 6 6 p9X9X9Xv9E9U9u9999$s$s$$$$$$r#r__main__zWhat is the capital of France?zWhat is the capital of Germany?ParisMoscowBerlinOPENAI_API_KEYrzdeepseek-coder)r r rr)r?r/)rosrrnumpyrrrlangchain.chat_models.baserrrrr6rVrurvrwutils file_utilsadd_start_docstringsMetricrrr r r L3Score_testcomputeenvironresultsr%r#r!rsi   666666   .;:: lg&w  // >QRRD$D$D$D$D$hoD$D$SRD$N z13TUIH%K8$J799L"" +, #GGGr#