U *pf/@sddlZddlZddlZddlmZmZmZmZddlm Z ddl m Z ddl m Z mZddlmZmZGdddZGd d d e ZdS) N)ListOptionalUnionDict)SentencePieceProcessor)PreTrainedTokenizer)loggingPaddingStrategy) EncodedInput BatchEncodingc@s|eZdZedddZdedddZdeeeeedd d Z eeed d d Z eeedddZ ddZ ddZ dS) SPTokenizer) model_pathcCstj|st|t|d|_|j|_|j|_|j |_ |j |_ |j|j kshtddddg}dddd d g|}i|_ i|_|D]*}|j|j |<||j|j<|jd 7_qd d d|D|_dS)N) model_filez <|system|>z<|user|> <|assistant|>z<|observation|>z[MASK][gMASK]z[sMASK]sopZeop|cSsg|]}t|qS)reescape).0tokenrr?/home/lixiang46/hf/Kolors/kolors/models/tokenization_chatglm.py sz(SPTokenizer.__init__..)ospathisfileAssertionErrorrsp_model vocab_sizen_wordsbos_ideos_idunk_idpad_idget_piece_sizespecial_tokensindex_special_tokensjoinrole_special_token_expression)selfr Zrole_special_tokensr'rrrr__init__ s         zSPTokenizer.__init__F)sc Cs|rd}g}t|j|D]P}||krH||j||||||||}q|t |kr||j||d|S|j|SdS)Nr) rfinditerr*startextendrEncodeAsPiecesappendendlen)r+r-encode_special_tokens last_indextmatchrrrtokenize"s   zSPTokenizer.tokenize)r-boseosreturncCs@t|tkst|j|}|r,|jg|}|r<||jg}|SN)typestrrrencoder"r#)r+r-r:r;r7rrrr@1s   zSPTokenizer.encode)r7r<cCshdg}}|D]@}||jkrD|r4||j|7}g}||j|7}q||q|rd||j|7}|S)N)r(rdecoder2)r+r7textbufferrrrrrB:s   zSPTokenizer.decodetokensr<cCs|j|}|Sr=)r DecodePieces)r+rFrCrrr decode_tokensHs zSPTokenizer.decode_tokenscCs ||jkr|j|S|j|Sz2 Converts a token (str) in an id using the vocab. )r'r PieceToIdr+rrrrconvert_token_to_idLs  zSPTokenizer.convert_token_to_idcCs@||jkr|j|S||j|j|jfks0|dkr4dS|j|S)=Converts an index (integer) in a token (str) using the vocab.rrA)r(r#r"r%r IdToPiecer+indexrrrconvert_id_to_tokenRs   zSPTokenizer.convert_id_to_tokenN)F)FF)__name__ __module__ __qualname__r?r,r9boolrintr@rBrHrLrQrrrrr s r csNeZdZddiZdddgZd4fdd Zd d Zeed d dZ eed ddZ eddZ eed ddZ eddZ eddZddZddZddZdd Zeeed!d"d#Zd5d%d&Zd'd(Zd)d*Zd6d,d-Zd7eeeeeeed.d/d0Zd$ejd$d$feeee fe!feeeeeee"e#d1d2d3Z$Z%S)8ChatGLMTokenizer vocab_fileztokenizer.model input_idsattention_mask position_idsleftFc sTd|_||_t||_|jj|jj|jjd|_||_t j f|||d|dS)NZ GLMTokenizer)z) padding_sideclean_up_tokenization_spacesr5) namerXr tokenizerr"r#r%r'r5superr,)r+rXr_r`r5kwargs __class__rrr,`s  zChatGLMTokenizer.__init__cCs@||jkr|j|S||jjks4t|d|j|jj|S)Nz is not a special token for )r'rbrrarKrrr get_commandps   zChatGLMTokenizer.get_command)r<cCsdSNzrr+rrr unk_tokenvszChatGLMTokenizer.unk_tokencCsdSrhrrirrr pad_tokenzszChatGLMTokenizer.pad_tokencCs |dS)Nr^rgrirrr pad_token_id~szChatGLMTokenizer.pad_token_idcCsdS)Nzrrirrr eos_tokenszChatGLMTokenizer.eos_tokencCs |dS)Nr]rlrirrr eos_token_idszChatGLMTokenizer.eos_token_idcCs|jjSr=)rbr!rirrrr szChatGLMTokenizer.vocab_sizecs(fddtjD}|j|S)z Returns vocab as a dict csi|]}||qSr)_convert_id_to_token)ririrr sz.ChatGLMTokenizer.get_vocab..)ranger updateadded_tokens_encoder)r+vocabrrir get_vocabs zChatGLMTokenizer.get_vocabcKs|jj||jdS)N)r5)rbr9r5)r+rCrdrrr _tokenizeszChatGLMTokenizer._tokenizecCs |j|SrI)rbrLrKrrr_convert_token_to_idsz%ChatGLMTokenizer._convert_token_to_idcCs |j|S)rM)rbrQrOrrrrpsz%ChatGLMTokenizer._convert_id_to_tokenrEcCs |j|Sr=)rbrH)r+rFrrrconvert_tokens_to_stringsz)ChatGLMTokenizer.convert_tokens_to_stringNc Csltj|r"tj||jd}n|}t|jd}|}W5QRXt|d}||W5QRX|fS)a Save the vocabulary and special tokens file to a directory. Args: save_directory (`str`): The directory in which to save the vocabulary. filename_prefix (`str`, *optional*): An optional prefix to add to the named of the saved files. Returns: `Tuple(str)`: Paths to the files saved. rXrbwb) rrisdirr)vocab_files_namesopenrXreadwrite)r+save_directoryfilename_prefixrXfinZ proto_strwriterrrrsave_vocabularys  z ChatGLMTokenizer.save_vocabularycCs|d|dg}|S)Nrrrl)r+ prefix_tokensrrrget_prefix_tokenssz"ChatGLMTokenizer.get_prefix_tokenscCsN|dkst||d|dg|j|d}|j|}||}|S)N)systemuser assistantZ observationz<|z|> )rrgrbr@)r+rolemetadatamessageZ role_tokensZmessage_tokensrFrrrbuild_single_messages & z%ChatGLMTokenizer.build_single_messagerc Cs|dkr g}g}|D]\}|d}|ddkrPd|krP|dtj|dddd}|||d|d d |q|||d |||d g|j|gd d dS)NcontentrrtoolsrF)indent ensure_asciirrArptT)return_tensorsis_split_into_words)jsondumpsr0rgetrgbatch_encode_plus)r+queryhistoryrrYitemrrrrbuild_chat_inputs"z!ChatGLMTokenizer.build_chat_input) token_ids_0 token_ids_1r<cCs0|}||}|dk r,|||dg}|S)a Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format: - single sequence: `[CLS] X [SEP]` - pair of sequences: `[CLS] A [SEP] B [SEP]` Args: token_ids_0 (`List[int]`): List of IDs to which the special tokens will be added. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. Nr])rrg)r+rrrrrr build_inputs_with_special_tokenss z1ChatGLMTokenizer.build_inputs_with_special_tokens)encoded_inputs max_lengthpadding_strategypad_to_multiple_ofreturn_attention_maskr<c Cs|jdkst||jd}t|}|tjkr6t|}|dk rb|dk rb||dkrb||d|}|tjkovt||k}d|krdg||d<d|krtt||d<|r|t|} d|krdg| |d|d<d|krdg| |d|d<|j g| |||jd<|S)a? Pad encoded inputs (on left/right and up to predefined length or max length in the batch) Args: encoded_inputs: Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`). max_length: maximum length of the returned list and optionally padding length (see below). Will truncate by taking into account the special tokens. padding_strategy: PaddingStrategy to use for padding. - PaddingStrategy.LONGEST Pad to the longest sequence in the batch - PaddingStrategy.MAX_LENGTH: Pad to the max length (default) - PaddingStrategy.DO_NOT_PAD: Do not pad The tokenizer padding sides are defined in self.padding_side: - 'left': pads on the left of the sequences - 'right': pads on the right of the sequences pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability `>= 7.5` (Volta). return_attention_mask: (optional) Set to False to avoid returning attention mask (default: set to model specifics) r\rNrrZr[) r_rmodel_input_namesr4r LONGEST DO_NOT_PADlistrsrm) r+rrrrrrequired_input seq_lengthneeds_to_be_padded differencerrr_pads(   zChatGLMTokenizer._pad)r\FF)N)Nr)N)&rRrSrTr~rr,rgpropertyr?rjrkrmrnror rwrxryrprrzrrrrrVrrr rrrr r rUdictr __classcell__rrrerrW[sV        rW)rrrtypingrrrr sentencepiecer transformersrtransformers.utilsrr Z$transformers.tokenization_utils_baser r r rWrrrrs  P