/mh\:ddlZddlZddlZddlmZddlm Z m Z m Z m Z ddl Z ddlmZmZddlmZmZddlmZmZddlmZmZmZmZmZmZd d d Zd ZGd dZ y)N)NDArray)DictAnyOptionalList) AutoTokenizerAutoModelForCausalLM)BaseContextPartitionerSimpleContextPartitioner) BaseSolverLassoRegression)get_masks_and_logit_probsaggregate_logit_probs split_texthighlight_word_indicesget_attributions_df char_to_tokeniF)max_new_tokens do_samplez"Context: {context} Query: {query}ceZdZddddddedfdededed ed ed eeeefd ed e dedee dedee ddfdZ e diifded ededeeefdeeefdeeefddfdZ d1deedefdZedZedZedZed Zed!Zed2deej2gfd"Zedefd#Zedeefd$Zd%Zd3d&Zd4d'Z ed(Z!ed)Z"de#fd*Z$ d5d+eed,eed-ed.eed/ef d0Z%y)6 ContextCitersentenceN@g?r model tokenizercontextquery source_typegenerate_kwargs num_ablationsablation_keep_prob batch_sizesolverprompt_template partitionerreturnc 6||_||_| t|||_n+| |_|jj|k7r t d||_|xst|_||_ ||_ | |_ | xs t|_ | |_i|_t!j"d|_|j$j't j(|jj*&|jj,|j_yy)a Initializes a new instance of the ContextCiter class, which is designed to assist in generating contextualized responses using a given machine learning model and tokenizer, tailored to specific queries and contexts. Arguments: model (Any): The model to apply ContextCite to (a HuggingFace ModelForCausalLM). tokenizer (Any): The tokenizer associated with the provided model. context (str): The context provided to the model query (str): The query to pose to the model. source_type (str, optional): The type of source to partition the context into. Defaults to "sentence", can also be "word". generate_kwargs (Optional[Dict[str, Any]], optional): Additional keyword arguments to pass to the model's generate method. num_ablations (int, optional): The number of ablations used to train the surrogate model. Defaults to 64. ablation_keep_prob (float, optional): The probability of keeping a source when ablating the context. Defaults to 0.5. batch_size (int, optional): The batch size used when performing inference using ablated contexts. Defaults to 1. solver (Optional[Solver], optional): The solver to use to compute the linear surrogate model. Lasso regression is used by default. prompt_template (str, optional): A template string used to create the prompt from the context and query. partitioner (Optional[BaseContextPartitioner], optional): A custom partitioner to split the context into sources. This will override "source_type" if specified. N)rz4Partitioner context does not match provided context. ContextCite)rrr r&r ValueErrorrDEFAULT_GENERATE_KWARGSr r!r"r#rr$r%_cachelogging getLoggerloggersetLevelDEBUG pad_token eos_token) selfrrrrrr r!r"r#r$r%r&s C/Users/shenjiajun/Desktop/2code/2code/context_cite/context_citer.py__init__zContextCiter.__init__sp "  7[ D  +D ''72 !WXX .I2I*"4$1 1 . '' 6  W]]+ >> # # +'+~~'?'?DNN $ ,cudadevice model_kwargstokenizer_kwargskwargsc tj|fi|}|j|tj|fi|} d| _||| ||fi|S)aR Load a ContextCiter instance from a pretrained model. Arguments: pretrained_model_name_or_path (str): The name or path of the pretrained model. This can be a local path or a model name on the HuggingFace model hub. context (str): The context provided to the model. The context and query will be used to construct a prompt for the model, using the prompt template. query (str): The query provided to the model. The context and query will be used to construct a prompt for the model, using the prompt template. device (str, optional): The device to use. Defaults to "cuda". model_kwargs (Dict[str, Any], optional): Additional keyword arguments to pass to the model's constructor. tokenizer_kwargs (Dict[str, Any], optional): Additional keyword arguments to pass to the tokenizer's constructor. **kwargs (Dict[str, Any], optional): Additional keyword arguments to pass to the ContextCiter constructor. Returns: ContextCiter: A ContextCiter instance initialized with the provided model, tokenizer, context, query, and other keyword arguments. left)r from_pretrainedtor padding_side) clspretrained_model_name_or_pathrrr9r:r;r<rrs r5r?zContextCiter.from_pretrainedjsiL%44 ) -9  !11 ) -= "( 5)We>v>>r7mask return_promptc|jj|}|jj||j}d|dg}|j j |dd}|j j|d}|r||fS|S)N)rruser)rolecontentFT)tokenizeadd_generation_promptadd_special_tokens)r& get_contextr%formatrrapply_chat_templateencode)r4rDrErpromptmessages chat_promptchat_prompt_idss r5_get_prompt_idszContextCiter._get_prompt_idss ""..t4%%,,WDJJ,O#78nn88 uD9 ..// PU/V "K/ /" "r7c8|j}t|SN)rVlen)r4 prompt_idss r5_response_startzContextCiter._response_starts))+ :r7c|jjd|jd\}}tj|g|j j }|j j|fi|jd}|jj|}t|jj|}|||dz|jd<|jdS)NoutputT)rE)r9r) r,getrVchtensorrr9generater rdecoderY)r4rZrR input_ids output_ids raw_output prompt_lengths r5_outputzContextCiter._outputs ;;??8 $ ,!%!5!5D!5!I J :,tzz7H7HII,,,YO$:N:NOPQRJ..z:J 5 5j ABM$*Z -G$GDKK !{{8$$r7c<|j|jdS)NFrL)rrgr4s r5_output_tokenszContextCiter._output_tokenss~~dllu~EEr7c:|jd|jdS)Nrc)rjr[ris r5 _response_idszContextCiter._response_idss!"";/0D0D0FGGr7c|j}|j|jj}|j|dS)z The response generated by the model (excluding the prompt). This property is cached. Returns: str: The response generated by the model. N)rjtoken_to_charsr[startrg)r4 output_tokenschar_response_starts r5responsezContextCiter.responses?++ +::4;O;OPVV||/011r7cZg}t|j|\}}}t||||}|S)a The response generated by the model, annotated with the starting index of each part. Arguments: split_by (str, optional): The method to split the response by. Can be "word" or "sentence". Defaults to "word". color (bool, optional): Whether to color the starting index of each part. Defaults to True. Returns: str: The response with the starting index of each part highlighted. )rrrr)r4split_bycolor start_indicesparts separators separated_strs r5response_with_indicesz"ContextCiter.response_with_indicess7" +5dmmX+N(z=.umZQVW r7c.|jjS)z The number of sources within the context. I.e., the number of sources that the context is partitioned into. Returns: int: The number of sources in the context. )r& num_sourcesris r5r|zContextCiter.num_sourcess+++r7c.|jjS)z The sources within the context. I.e., the context as a list where each element is a source. Returns: List[str]: The sources within the context. )r&sourcesris r5r~zContextCiter.sourcess'''r7c|j}|j}|j|j}t |||z}t |||zdz dz}||z ||z fS)Nr )rjr[rnror)r4 start_index end_indexrpresponse_startoffsetids_start_index ids_end_indexs r5_char_range_to_token_rangez'ContextCiter._char_range_to_token_rangeso++ ----n=CC' {V7KL%mY5G!5KLqP /1OOOr7c ||d}t|j}d|cxkr|cxkrt|jks*ntd|d|dt|jd|j||S)NrzInvalid selection range (z, z&). Please select any range within (0, z).)rYrrr*r)r4rrs r5_indices_to_token_indicesz&ContextCiter._indices_to_token_indices s  )"3KDMM*I[B9BDMM0BB+K=9+F669$--6H5IM ..{IFFr7c t|j|j|j|j|j |j |j|j\|jd<|jd<y)N reg_masksreg_logit_probs) rrrr!r|rVrlr"r#r,ris r5_compute_masks_and_logit_probsz+ContextCiter._compute_masks_and_logit_probssi % ""  $$""''  A K $++.?"@r7cv|jjd|j|jdS)Nrr,r^rris r5_maskszContextCiter._masks"s0 ;;??; ' /  / / 1{{;''r7cv|jjd|j|jdS)Nrrris r5 _logit_probszContextCiter._logit_probs(s2 ;;??, - 5  / / 1{{,--r7ct|jdd||f}||z }|jj|j||\}}||fSrX)rrr$fitr)r4 ids_start_idx ids_end_idxoutputsnum_output_tokensweightbiass r5_get_attributions_for_ids_rangez,ContextCiter._get_attributions_for_ids_range.sU'(9(9!];=V:V(WX'-7{{t{{G=NO t|r7 start_idxend_idx as_dataframetop_kverbosec|jdk(r0|jjdtjgS|s||jjd|j ||\}}|j ||}|j||} |jj| } |j| jvr9|jjd|j| j|rtd| j|j||\} } |rt| |j|S| S)a> Get the attributions for (part of) the response. Arguments: start_idx (int, optional): Start index of the part to attribute to. If None, defaults to the start of the response. end_idx (int, optional): End index of the part to attribute to. If None, defaults to the end of the response. as_dataframe (bool, optional): If True, return the attributions as a stylized dataframe in sorted order. Otherwise, return them as a numpy array where the ith element corresponds to the score of the ith source within the context. Defaults to False. top_k (int, optional): Only used if as_dataframe is True. Number of top attributions to return. If None, all attributions are returned. Defaults to None. verbose (bool, optional): If True, print the selected part of the response. Defaults to True. Returns: NDArray | Any: If as_dataframe is False, return a numpy array where the ith element corresponds to the score of the ith source within the context. Otherwise, return a stylized dataframe in sorted order. rzNo sources to attribute to!z+top_k is ignored when not using dataframes.zDecoded selected tokens do not match selected text. If the following look close enough, feel free to ignore: What you selected: %s What is being attributed: %sz Attributed: )r)r|r/warningnparrayrrrrlrrbstripprintrrr&) r4rrrrrrr selected_textselected_tokens decoded_text attributions_biass r5get_attributionszContextCiter.get_attributions4sCH   q KK   = >88B<  1 KK   M N%)%C%CIw%W" { i8 ,,];G~~,,_=    (:(:(< < KK  F##%""$    L!3!3!5 67 8#BB    e &|T5E5EUS S r7)NF)wordT)NN)r'N)NNFNT)&__name__ __module__ __qualname__DEFAULT_PROMPT_TEMPLATErstrrrintfloatr r r6 classmethodr?rboolrVpropertyr[rgrjrlrrpd DataFramerzr|rr~rrrrrtuplerrr7r5rrs&48$''+68<O@O@O@ O@  O@  O@"$sCx.1O@O@"O@O@$O@O@45O@ O@b ')+--?-? -?  -? 38n -?sCx.-?sCx.-? -?-?b#'##w##$ % %FFHH 2 2S",,DW* ,S , , (c ( (P G  (( .. U$(!%"# C C=C #C  C } C  C r7r)!numpyrpandasrtorchr_ numpy.typingrtypingrrrrr- transformersrr context_partitionerr r r$r rutilsrrrrrrr+rrrr7r5rsL ,,<Q/.1uE@_ _ r7