U d@sdZddlZddlZddlZddlZddlmZmZmZm Z m Z ddl m Z ddl mZmZmZmZmZmZmZmZmZmZmZmZmZmZeeZddZd d Z d d Z!d dZ"ddZ#GdddeZ$dS)z Tokenization classes for python tokenizers. For fast tokenizers (provided by HuggingFace's tokenizers library) see tokenization_utils_fast.py N)DictListOptionalTupleUnion)add_end_docstrings)ENCODE_KWARGS_DOCSTRING'ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING AddedToken BatchEncoding EncodedInputEncodedInputPairPaddingStrategyPreTokenizedInputPreTokenizedInputPairPreTrainedTokenizerBase TensorType TextInput TextInputPairTruncationStrategycCs>|dks |dks |dks |dkr$dSt|}|dkr:dSdS)z1Checks whether `chars` is a whitespace character.    TZsF) unicodedatacategorycharcatr!6/home/yxchng/Downloads/elia/bert/tokenization_utils.py_is_whitespace/s   r#cCs8|dks|dks|dkrdSt|}|dr4dSdS)z.Checks whether `chars` is a control character.rrrFCT)rr startswithrr!r!r" _is_control;s   r&cCsht|}|dkr|dksH|dkr(|dksH|dkr8|dksH|dkrL|dkrLd St|}|d rdd Sd S) z2Checks whether `chars` is a punctuation character.!/:@[`{~TPF)ordrrr%)rcpr r!r!r"_is_punctuationGs@  r2cCs$|d}tt|t|Bt|BS)zcChecks whether the last character in text is one of a punctuation, control or whitespace character.boolr&r2r#)text last_charr!r!r"_is_end_of_wordVsr8cCs$|d}tt|t|Bt|BS)zdChecks whether the first character in text is one of a punctuation, control or whitespace character.rr4)r6 first_charr!r!r"_is_start_of_word\sr:cseZdZdZfddZeedddZeedddZ d d Z e e efdd d Z d dZd@eee eefedddZdAddZedddZddZddZddZddZd d!ejejd d"dd d d d ddddd!feeeefe eeeefeeee eeee ee ee e!fe ee eeeeeee"d#d$d%Z#d!ejejd d"dd d d d ddddd!feeeee$eeee%eeee&feeee eeee ee ee e!fe ee eeeeeee"d&d'd(Z'e(e)e*d!ejejd d"d d d d dddd!f eee%e+eed ffeeee eee ee e e ee eeeeee"d)d*d+Z,dBe e e-fd,d-d.Z.dCee eeeed/d0d1Z/dDeeeefeee ee fd2d3d4Z0ee d5d6d7Z1ee e d8d9d:Z2dEeeeee d;dd?Z4Z5S)FPreTrainedTokenizera6 Base class for all slow tokenizers. Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary. This class also contain the added tokens in a unified way on top of all tokenizers so we don't have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece...). Class attributes (overridden by derived classes): - ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file required by the model, and as associated values, the filename for saving the associated file (string). - ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the associated pretrained vocabulary file. - ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model, or None if the model has no maximum input size. - ``pretrained_init_configuration``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained models, and as associated values, a dictionnary of specific arguments to pass to the ``__init__``method of the tokenizer class for this pretrained model when loading the tokenizer with the ``from_pretrained()`` method. Args: - ``model_max_length``: (`Optional`) int: the maximum length in number of tokens for the inputs to the transformer model. When the tokenizer is loaded with `from_pretrained`, this will be set to the value stored for the associated model in ``max_model_input_sizes`` (see above). If no value is provided, will default to VERY_LARGE_INTEGER (`int(1e30)`). no associated max_length can be found in ``max_model_input_sizes``. - ``padding_side``: (`Optional`) string: the side on which the model should have padding applied. Should be selected between ['right', 'left'] - ``model_input_names``: (`Optional`) List[string]: the list of the forward pass inputs accepted by the model ("token_type_ids", "attention_mask"...). - ``bos_token``: (`Optional`) string: a beginning of sentence token. Will be associated to ``self.bos_token`` and ``self.bos_token_id`` - ``eos_token``: (`Optional`) string: an end of sentence token. Will be associated to ``self.eos_token`` and ``self.eos_token_id`` - ``unk_token``: (`Optional`) string: an unknown token. Will be associated to ``self.unk_token`` and ``self.unk_token_id`` - ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence). Will be associated to ``self.sep_token`` and ``self.sep_token_id`` - ``pad_token``: (`Optional`) string: a padding token. Will be associated to ``self.pad_token`` and ``self.pad_token_id`` - ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model). Will be associated to ``self.cls_token`` and ``self.cls_token_id`` - ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language modeling). Will be associated to ``self.mask_token`` and ``self.mask_token_id`` - ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens. Adding all special tokens here ensure they won't be split by the tokenization process. Will be associated to ``self.additional_special_tokens`` and ``self.additional_special_tokens_ids`` .. automethod:: __call__ c s$tjf|i|_i|_g|_dSN)super__init__added_tokens_encoderadded_tokens_decoderunique_no_split_tokens)selfkwargs __class__r!r"r>szPreTrainedTokenizer.__init__)returncCsdS)NFr!rBr!r!r"is_fastszPreTrainedTokenizer.is_fastcCstdS)z8 Size of the base vocabulary (without the added tokens) NNotImplementedErrorrGr!r!r" vocab_sizeszPreTrainedTokenizer.vocab_sizecCs tdS)z Returns the vocabulary as a dict of {token: index} pairs. `tokenizer.get_vocab()[token]` is equivalent to `tokenizer.convert_tokens_to_ids(token)` when `token` is in the vocab. NrIrGr!r!r" get_vocabszPreTrainedTokenizer.get_vocabcCs|jSr<)r?rGr!r!r"get_added_vocabsz#PreTrainedTokenizer.get_added_vocabcCs|jt|jS)z3 Size of the full vocabulary with the added tokens )rKlenr?rGr!r!r"__len__szPreTrainedTokenizer.__len__F) new_tokensrFcsdd|D}g}|D]p}t|ts(t|sBjddrB|}|jkr|jkr||kr||j rt d|qt fddt |D}dd |D}j|j||rttjt|_nttjt|_t|S) aO Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to it with indices starting from length of the current vocabulary. Args: new_tokens: string or list of string. Each string is a token to add. Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them). Returns: Number of tokens added to the vocabulary. Examples:: # Let's see how to increase the vocabulary of Bert model and tokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2']) print('We have added', num_added_toks, 'tokens') model.resize_token_embeddings(len(tokenizer)) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer. cSsg|] }t|qSr!)str).0tokr!r!r" sz3PreTrainedTokenizer._add_tokens.. do_lower_caseFzAdding %s to the vocabularyc3s"|]\}}|t|fVqdSr<rN)rRirSrGr!r" sz2PreTrainedTokenizer._add_tokens..cSsi|]\}}||qSr!r!)rRkvr!r!r" sz3PreTrainedTokenizer._add_tokens..) isinstancerQAssertionError init_kwargsgetlower unk_tokenconvert_tokens_to_idsappendverboseloggerinfodict enumerateitemsr?updater@listsetrAunionrN)rBrPZspecial_tokens tokens_to_addtokenZadded_tok_encoderZadded_tok_decoderr!rGr" _add_tokenss.   zPreTrainedTokenizer._add_tokenscCs g}g}t|||r|ndS)a) Returns the number of added tokens when encoding a sequence with special tokens. Note: This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop. Args: pair: Returns the number of added tokens in the case of a sequence pair if set to True, returns the number of added tokens in the case of a single sequence if set to False. Returns: Number of tokens added to sequences N)rN build_inputs_with_special_tokens)rBpair token_ids_0 token_ids_1r!r!r"num_special_tokens_to_addsz-PreTrainedTokenizer.num_special_tokens_to_add)r6c stddjDj|f|\}}|r.zKeyword arguments z not recognized.rUFcSsg|]}t|qSr!)reescape)rRZs_tokr!r!r"rTsz0PreTrainedTokenizer.tokenize..(|z)|z(.+?)cSs|dp|dS)Nrr)groupsr`)mr!r!r"z.PreTrainedTokenizer.tokenize..csRg}|d}||}d}t|D]$\}}t|tr|jr|t|dkrvt|svt||dsv|||7}n|r||7}||g7}d}q&|j r|dkr| }|j r|t|dkr| }n(|t|dkr| }|dkr| }|dkr |s ||g7}q&|t|dkr2|rL||g7}nq&|rB||g7}||g7}q&|S)Nrr) r_splitrhr\r Z single_wordrNr8r:rstriplstrip)rSr6resultZ tok_extendedZ split_textZ full_wordrWsub_text)all_special_tokens_extendedr!r"split_on_tokensJ         z4PreTrainedTokenizer.tokenize..split_on_tokencs|s gS|s|Sg}|g}|D]:}g}|D](}|jkrR|||7}q4||g7}q4|}q(ttjfdd|DS)Nc3s(|] }|jkr|n|gVqdSr<)rA _tokenize)rRrorGr!r"rXdszHPreTrainedTokenizer.tokenize..split_on_tokens..)striprrArk itertoolschain from_iterable)Ztok_listr6tokenized_text text_listrSr)rBrr!r"split_on_tokensQs(    z5PreTrainedTokenizer.tokenize..split_on_tokens) rgrprepare_for_tokenizationrewarningr^r_all_special_tokensjoinrxsubrA)rBr6rCZescaped_special_tokspatternrZno_split_tokenrr!)rrBrr"tokenizes  5 zPreTrainedTokenizer.tokenizecKstdS)a Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Do NOT take care of added tokens. NrI)rBr6rCr!r!r"roszPreTrainedTokenizer._tokenizecCsB|dkr dSt|tr ||Sg}|D]}|||q(|S)z Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary. N)r\rQ#_convert_token_to_id_with_added_vocrc)rBtokensidsror!r!r"rbxs  z)PreTrainedTokenizer.convert_tokens_to_idscCs*|dkr dS||jkr |j|S||Sr<)r?_convert_token_to_idrBror!r!r"rs   z7PreTrainedTokenizer._convert_token_to_id_with_added_voccCstdSr<rIrr!r!r"rsz(PreTrainedTokenizer._convert_token_to_idNTr)r6 text_pairadd_special_tokenspadding_strategytruncation_strategy max_lengthstrideis_pretokenizedpad_to_multiple_ofreturn_tensorsreturn_token_type_idsreturn_attention_maskreturn_overflowing_tokensreturn_special_tokens_maskreturn_offsets_mapping return_lengthrdrFc sffdd}|rtd||}|dk r4||nd}j||||j|j||| | d| | | |||dS)Ncst|tr"j|f}|St|ttfrt|dkrt|dtrrvttjfdd|D}|S|SnRt|ttfrt|dkrt|dt r|Srt d|dnt d|ddS)Nrc3s$|]}j|fddiVqdSrTNrrvrCrBr!r"rXszJPreTrainedTokenizer._encode_plus..get_input_ids..zInput zY is not valid. Should be a string or a list/tuple of strings when `is_pretokenized=True`.zW is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers. r\rQrrbrktuplerNrrint ValueErrorr6rrrCrBr!r" get_input_idss"  (  (  z7PreTrainedTokenizer._encode_plus..get_input_idsa return_offset_mapping is not available when using Python tokenizers.To use this feature, change your tokenizer to one deriving from transformers.PreTrainedTokenizerFast.More information on available tokenizers at https://github.com/huggingface/transformers/pull/2674T)pair_idsrpadding truncationrrrrprepend_batch_axisrrrrrrd)rJprepare_for_modelvalue)rBr6rrrrrrrrrrrrrrrrdrCr first_ids second_idsr!rr" _encode_pluss2z PreTrainedTokenizer._encode_plus)batch_text_or_text_pairsrrrrrrrrrrrrrrrdrFc sfdd}|rtdg}|D]r}t|ttfsB|d}}n*rdt|dttfsd|d}}n|\}}||}|dk r||nd}|||fq$j|||||||| | | | || |d}t|S)Ncst|tr"j|f}|St|ttfrt|dkrt|dtrrvttjfdd|D}|S|Sn4t|ttfrt|dkrt|dt r|St ddS)Nrc3s$|]}j|fddiVqdSrrrvrr!r"rXszPPreTrainedTokenizer._batch_encode_plus..get_input_ids..z\Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.rrrr!r"rs  (  (z=PreTrainedTokenizer._batch_encode_plus..get_input_idszreturn_offset_mapping is not available when using Python tokenizers.To use this feature, change your tokenizer to one deriving from transformers.PreTrainedTokenizerFast.r) rrrrrrrrrrrrrd)rJr\rkrrc_batch_prepare_for_modelr )rBrrrrrrrrrrrrrrrrdrCr input_idsZids_or_pair_idsrrrr batch_outputsr!rr"_batch_encode_pluss@  z&PreTrainedTokenizer._batch_encode_plus)batch_ids_pairsrrrrrrrrrrrrrdrFcCsi}|D]h\}}|j|||tjj|j||dd| | | | dd|d}|D]&\}}||kr`g||<|||qHq|j||j||| d}t||d}|S)a Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and manages a moving window (with user defined stride) for overflowing tokens Args: batch_ids_pairs: list of tokenized input ids or input ids pairs NF)rrrrrrrrrrrrrrd)rrrr) tensor_type)rr DO_NOT_PADrrircpadr )rBrrrrrrrrrrrrrrdrrroutputskeyrr!r!r"r/sB  z,PreTrainedTokenizer._batch_prepare_for_model)r6rFcKs||fS)a  Performs any necessary transformations before tokenization. This method should pop the arguments from kwargs and return kwargs as well. We test kwargs at the end of the encoding process to be sure all the arguments have been used. r!)rBr6rrCr!r!r"rosz,PreTrainedTokenizer.prepare_for_tokenization)rsrtalready_has_special_tokensrFcCsdg|rt|ndt|S)a Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer ``prepare_for_model`` method. Args: token_ids_0: list of ids (must not contain special tokens) token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids for sequence pairs already_has_special_tokens: (default False) Set to True if the token list is already formated with special tokens for the model Returns: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. rrV)rBrsrtrr!r!r"get_special_tokens_maskwsz+PreTrainedTokenizer.get_special_tokens_mask)rskip_special_tokensrFcCs~t|tr(||jkr|j|S||Sg}|D]H}t|}|rL||jkrLq0||jkrh||j|q0|||q0|S)a' Converts a single index or a sequence of indices (integers) in a token " (resp.) a sequence of tokens (str), using the vocabulary and added tokens. Args: skip_special_tokens: Don't decode special tokens (self.all_special_tokens). Default: False )r\rr@_convert_id_to_tokenall_special_idsrc)rBrrrindexr!r!r"convert_ids_to_tokenss     z)PreTrainedTokenizer.convert_ids_to_tokens)rrFcCstdSr<rI)rBrr!r!r"rsz(PreTrainedTokenizer._convert_id_to_token)rrFcCsd||S)z Converts a sequence of tokens (string) in a single string. The most simple way to do it is ' '.join(self.convert_ids_to_tokens(token_ids)) but we often want to remove sub-word tokenization artifacts at the same time. r)rr)rBrr!r!r"convert_tokens_to_stringsz,PreTrainedTokenizer.convert_tokens_to_string) token_idsrclean_up_tokenization_spacesrFc Cs|j||d}g}g}|D]L}|r.||jkr.q||jkr\|rP|||g}||q||q|r||||d|}|r||} | S|SdS)N)rr)rrr?rcrrZclean_up_tokenization) rBrrrZfiltered_tokensZ sub_textsZcurrent_sub_textror6Z clean_textr!r!r"decodes&     zPreTrainedTokenizer.decodecCstdS)a Save the tokenizer vocabulary to a directory. This method does *NOT* save added tokens and special token mappings. Please use :func:`~transformers.PreTrainedTokenizer.save_pretrained` `()` to save the full Tokenizer state if you want to reload it using the :func:`~transformers.PreTrainedTokenizer.from_pretrained` class method. NrI)rBsave_directoryr!r!r"save_vocabularysz#PreTrainedTokenizer.save_vocabulary)F)F)F)NF)F)FT)6__name__ __module__ __qualname____doc__r>propertyr5rHrrKrLrrQrMrOrrr rprurrrrbrrrrrZDO_NOT_TRUNCATErr rrr rrrrrrr r rrrgrrrrrrr __classcell__r!r!rDr"r;bs.9 "4 o   T S?    r;)%rrloggingrxrtypingrrrrr file_utilsrZtokenization_utils_baser r r r r rrrrrrrrr getLoggerrrer#r&r2r8r:r;r!r!r!r"s @