fh,HdZddlZddlmZddlmZmZmZmZm Z m Z ddl Z ddl mZddlmZmZddlmZdd lmZerdd lmZej0eZd d iZd Zd\ZZd\ZZ dZ!edGddeZ"dgZ#y)zTokenization classes for BLT.N)copyfile) TYPE_CHECKINGAnyDictListOptionalTuple)import_protobuf) AddedTokenPreTrainedTokenizer)logging)requires) TextInput vocab_fileztokenizer.modelu▁)z[INST]z[/INST])z<> z <> aYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.) sentencepiece)backendsc XeZdZdZeZddgZ ddeee e fffd Z e dZ ddZdZd Ze d Zd Zd d dee ffd ZdZdZdZdZddee dee fdZddZ ddeedeeededeeffd Z ddeedeeedeefdZxZ S) BLTTokenizeruu Construct a BLT tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as there is no padding token in the original model. Args: vocab_file (`str`): Path to the vocabulary file. unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`): The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`): The end of sequence token. pad_token (`str` or `tokenizers.AddedToken`, *optional*): A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation. sp_model_kwargs (`Dict[str, Any]`, `Optional`, *optional*): Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set: - `enable_sampling`: Enable subword regularization. - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout. - `nbest_size = {0,1}`: No sampling is performed. - `nbest_size > 1`: samples from the nbest_size results. - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm. - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. add_bos_token (`bool`, *optional*, defaults to `True`): Whether or not to add an `bos_token` at the start of sequences. add_eos_token (`bool`, *optional*, defaults to `False`): Whether or not to add an `eos_token` at the end of sequences. clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`): Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces. use_default_system_prompt (`bool`, *optional*, defaults to `False`): Whether or not the default system prompt for BLT should be used. spaces_between_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not to add spaces between special tokens. legacy (`bool`, *optional*): Whether or not the `legacy` behavior of the tokenizer should be used. Legacy is before the merge of #24622 and #25224 which includes fixes to properly handle tokens that appear after special tokens. Make sure to also set `from_slow` to `True`. A simple example: - `legacy=True`: ```python >>> from transformers import BLTTokenizerFast >>> tokenizer = BLTTokenizerFast.from_pretrained("huggyblt/blt-7b", legacy=True, from_slow=True) >>> tokenizer.encode("Hello .") # 869 is '▁.' [1, 15043, 29871, 1, 869] ``` - `legacy=False`: ```python >>> from transformers import BLTTokenizerFast >>> tokenizer = BLTTokenizerFast.from_pretrained("huggyblt/blt-7b", legacy=False, from_slow=True) >>> tokenizer.encode("Hello .") # 29889 is '.' [1, 15043, 29871, 1, 29889] ``` Checkout the [pull request](https://github.com/huggingface/transformers/pull/24565) for more details. add_prefix_space (`bool`, *optional*, defaults to `True`): Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. Again, this should be set with `from_slow=True` to make sure it's taken into account. input_idsattention_masksp_model_kwargsc V|in||_t|trt|ddn|}t|trt|ddn|}t|trt|ddn|}t|trt|ddn|}| %tj d|j dd} | |_||_||_ ||_ | |_ |j|jdd|_| |_t!|Dd|||||||j| | | | | d |y) NFT) normalizedspecialz2You are using the default legacy behaviour of the a. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a blt tokenizer from a GGUF file you can ignore this message from_slow) bos_token eos_token unk_token pad_token add_bos_token add_eos_tokenrclean_up_tokenization_spacesuse_default_system_promptspaces_between_special_tokenslegacyadd_prefix_space)r isinstancestrr logger warning_once __class__r&rr!r"r$get_spm_processorpopsp_modelr'super__init__)selfrrrrr rr!r"r#r$r%r&r'kwargsr-s [/fsx/ita_zaporozhets/transformers/src/transformers/models/blt_modellike/tokenization_blt.pyr2zBLTTokenizer.__init__sF"&5%   DT^^DTU// F $**)B&..vzz+u/MN 0  '' 00)E&?*G-  cpt|jjt|jSN)lenr0encoder*rr3s r5unk_token_lengthzBLTTokenizer.unk_token_lengths%4==''DNN(;<==r6c4tjdi|j}|js|r|j |j |St |j d5}|j}td|jjd}|jj|}|j}d|_|jj!||j#}|j%|ddd|S#1swY|SxYw)NrbzThe new behaviour of z (with `self.legacy = False`)Fr()spmSentencePieceProcessorrr&Loadropenreadr r-__name__ ModelProto FromStringNormalizerSpecadd_dummy_prefixnormalizer_spec MergeFromSerializeToStringLoadFromSerializedProto)r3r tokenizerfr0 model_pb2modelrIs r5r.zBLTTokenizer.get_spm_processors..F1E1EF ;;) NN4?? +  $//4 (AvvxH'*?@W@W?XXu(vwI((33H=E'668O/4O ,  ! ! + +O <..0H  - -h 7))s !B"D  Dc~|jj}d|d<|jj|d<|S)Nr0sp_model_proto)__dict__copyr0serialized_model_proto)r3states r5 __getstate__zBLTTokenizer.__getstate__s; ""$ j"&--"F"F"H r6c|jj|tjdi|j|_|j j |jy)Nr()rSupdater?r@rr0rLrR)r3ds r5 __setstate__zBLTTokenizer.__setstate__sG Q22JT5I5IJ  --d.A.ABr6c6|jjS)zReturns vocab size)r0get_piece_sizer;s r5 vocab_sizezBLTTokenizer.vocab_sizes}}++--r6ct|jDcic]}|j||}}|j|j|Scc}w)zReturns vocab as a dict)ranger^convert_ids_to_tokensrYadded_tokens_encoder)r3ivocabs r5 get_vocabzBLTTokenizer.get_vocabsN;@;QR;Qa++A.1;QR T../ SsAtextrreturnc 2|jst|dk(rt| |fi|S|j t d}|j r t |z}t| |fi|}t|dkDr"|dt k(r|d|jvr|dd}|S)z Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the first token is special. r N)r&r9r1tokenizereplaceSPIECE_UNDERLINEr'all_special_tokens)r3rfr4tokensr-s r5rkzBLTTokenizer.tokenizes ;;#d)q.7#D3F3 3||,c2  #d*D!$1&1 v;?vay,<<dNeNeAeABZF r6c 8|js|jtdfs!|jj |t S|jj |j |zt }t||jk\r||jdS|S)u( Returns a tokenized string. We de-activated the `add_dummy_prefix` option, thus the sentencepiece internals will always strip any SPIECE_UNDERLINE. For example: `self.sp_model.encode(f"{SPIECE_UNDERLINE}Hey", out_type = str)` will give `['H', 'e', 'y']` instead of `['▁He', 'y']`. Thus we always encode `f"{unk_token}text"` and strip the `unk_token`. Here is an example with `unk_token = ""` and `unk_token_length = 4`. `self.tokenizer.sp_model.encode(" Hey", out_type = str)[4:]`. ri)out_typeN) r& startswithrmr0r:r*rr9r<)r3rfr4ros r5 _tokenizezBLTTokenizer._tokenizes ;;doo/?.EF==''s'; ;%%dnnt&;c%J25f+AVAV2Vvd++-.b\bbr6c8|jj|S)z0Converts a token (str) in an id using the vocab.)r0 piece_to_id)r3tokens r5_convert_token_to_idz!BLTTokenizer._convert_token_to_id s}}((//r6c<|jj|}|S)z=Converts an index (integer) in a token (str) using the vocab.)r0 IdToPiece)r3indexrvs r5_convert_id_to_tokenz!BLTTokenizer._convert_id_to_tokens ''. r6c|djtr|jr |ddd|d<g}d}d}t|D]\}}||jvr>|s|dk7r|j r|dz }||j j||zz }d}g}R|r+|dk(r&|jr|jts|dz }|j|d}||j j|z }|S)z:Converts a sequence of tokens (string) in a single string.rrjNFriT) rrrmr' enumeraternr&r0decodeappend)r3rocurrent_sub_tokens out_stringprev_is_specialrcrvs r5convert_tokens_to_stringz%BLTTokenizer.convert_tokens_to_strings !9   0 1d6K6Kq !" F1I !&)HAu///&16dkk#%Jdmm223EFNN "&%'""qAv$2G2GPUP`P`aqPr#%J"))%0"'* dmm**+=>> r6filename_prefixctjj|stj d|dytjj ||r|dzndt dz}tjj|jtjj|k7rBtjj|jrt|j||fStjj|jsCt|d5}|jj}|j|ddd|fS|fS#1swY|fSxYw)a Save the vocabulary and special tokens file to a directory. Args: save_directory (`str`): The directory in which to save the vocabulary. Returns: `Tuple(str)`: Paths to the files saved. zVocabulary path (z) should be a directoryN-r}rwb)ospathisdirr+errorjoinVOCAB_FILES_NAMESabspathrisfilerrBr0rUwrite)r3save_directoryrout_vocab_fileficontent_spiece_models r5save_vocabularyzBLTTokenizer.save_vocabulary/s"ww}}^, LL,^,<zzNNTFFFFNT)Fr8)NF)!rD __module__ __qualname____doc__rvocab_files_namesmodel_input_namesrrr*rr2propertyr<r.rWr[r^rerrkrsrwr{rr rrintboolrr __classcell__)r-s@r5rr7s`FP*$&67 48%*"'&+8 "$sCx.18 t>>" C .. [tCy$c$0 2!x}!X]^aXb!6 sx# 9# 3;DI3F# ko# c# LJN93;DI3F cr6r)$rrshutilrtypingrrrrrr rr?convert_slow_tokenizerr tokenization_utilsr r utilsrutils.import_utilsrtokenization_utils_baser get_loggerrDr+rrmB_INSTE_INSTB_SYSE_SYSDEFAULT_SYSTEM_PROMPTr__all__r(r6r5rs*$ BB5A*4   H %!#45$, u^ %&a&a'aH  r6