fhddZddlmZddlmZddlmZejeZ Gdde eZ Gdd e eZ Gd d eZ gd Zy )z1BLT (Byte Latent Transformer) model configuration)Enum)PretrainedConfig)loggingceZdZdZdZy) InitStdFactordisabled current_depthN)__name__ __module__ __qualname__DISABLED CURRENT_DEPTH^/fsx/ita_zaporozhets/transformers/src/transformers/models/blt_wip copy/configuration_blt_og.pyrrs H#Mrrc$eZdZdZdZdZdZdZdZy)PatchingModeEnumentropybpe bpe_patcherspacestaticbyteN) r r r rrrrrrrrrrr!s G CK E F Drrc6eZdZdZdZdgZ d fd ZedZedZ edZ edZ d e d e fd ZxZS) BLTConfiga/ This is the configuration class to store the configuration of a [`ByteLatentTransformer`]. It is used to instantiate a BLT model according to the specified arguments, defining the model architecture. Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from [`PretrainedConfig`] for more information. Args: vocab_size (`int`, *optional*, defaults to 256): Vocabulary size of the BLT model. Defines the number of different tokens (bytes) that can be represented. max_seqlen (`int`, *optional*, defaults to 1024): The maximum sequence length that this model can handle. # Main architecture dimensions dim (`int`, *optional*, defaults to 512): Main dimension of the model. n_layers (`int`, *optional*, defaults to 8): Number of layers in the main transformer. n_heads (`int`, *optional*, defaults to 8): Number of attention heads in the main transformer. head_dim (`int`, *optional*): Dimension of each attention head. If not specified, computed as dim // n_heads. n_kv_heads (`int`, *optional*): Number of key-value heads for grouped query attention. If not specified, defaults to n_heads. # Component-specific dimensions dim_global (`int`, *optional*, defaults to 512): Dimension of the global transformer component. dim_local_decoder (`int`, *optional*, defaults to 512): Dimension of the local decoder component. dim_local_encoder (`int`, *optional*, defaults to 512): Dimension of the local encoder component. n_layers_global (`int`, *optional*, defaults to 8): Number of layers in the global transformer. n_layers_local_decoder (`int`, *optional*, defaults to 8): Number of layers in the local decoder. n_layers_local_encoder (`int`, *optional*, defaults to 8): Number of layers in the local encoder. n_heads_global (`int`, *optional*, defaults to 8): Number of attention heads in the global transformer. n_heads_local_decoder (`int`, *optional*, defaults to 8): Number of attention heads in the local decoder. n_heads_local_encoder (`int`, *optional*, defaults to 8): Number of attention heads in the local encoder. n_kv_heads_global (`int`, *optional*): Number of key-value heads in the global transformer. # Transformer configuration norm_eps (`float`, *optional*, defaults to 1e-5): The epsilon used by the layer normalization layers. dropout (`float`, *optional*, defaults to 0.0): The dropout probability for all fully connected layers. ffn_dim_multiplier (`float`, *optional*, defaults to 1.0): Multiplier for the feedforward network dimension. multiple_of (`int`, *optional*, defaults to 256): Make feedforward network dimension multiple of this value. # Positional encoding rope_theta (`float`, *optional*, defaults to 10000.0): The base period of the RoPE embeddings. rope_use_fp32_in_outer_product (`bool`, *optional*, defaults to False): Whether to use fp32 in RoPE outer product computation. # Attention configuration attn_impl (`str`, *optional*, defaults to "sdpa"): Attention implementation to use ("sdpa" or "flex_attention"). attn_bias_type (`str`, *optional*, defaults to "causal"): Type of attention bias to apply. local_attention_window_len (`int`, *optional*): Window length for local attention. use_rope (`bool`, *optional*, defaults to True): Whether to use rotary position embeddings. # Initialization init_base_std (`float`, *optional*): Base standard deviation for weight initialization. init_std_factor (`str`, *optional*, defaults to "disabled"): Factor for adjusting initialization standard deviation. # Embedding dimensions dim_token_emb (`int`, *optional*): Token embedding dimension. dim_token (`int`, *optional*): Token dimension. # Patching configuration patch_in_forward (`bool`, *optional*, defaults to False): Whether to perform patching during forward pass. realtime_patching (`bool`, *optional*, defaults to True): Whether to use realtime patching. patch_size (`float`, *optional*): Size of patches for static patching. patching_mode (`str`, *optional*): Mode for patching ("entropy", "static", etc.). patching_threshold (`float`, *optional*): Threshold for entropy-based patching. patching_threshold_add (`float`, *optional*): Additional threshold parameter for patching. monotonicity (`bool`, *optional*, defaults to False): Whether to enforce monotonicity in patching. patching_batch_size (`int`, *optional*, defaults to 1): Batch size for patching operations. patching_device (`str`, *optional*, defaults to "cuda"): Device to use for patching operations. max_patch_length (`int`, *optional*): Maximum length of patches. entropy_model_checkpoint_dir (`str`, *optional*): Directory containing entropy model checkpoint. # Cross attention configurations cross_attn_encoder (`bool`, *optional*, defaults to False): Whether to use cross attention in encoder. cross_attn_decoder (`bool`, *optional*, defaults to False): Whether to use cross attention in decoder. cross_attn_window_encoder (`int`, *optional*): Cross attention window for encoder. cross_attn_window_decoder (`int`, *optional*): Cross attention window for decoder. cross_attn_k (`int`, *optional*): Number of cross attention components. cross_attn_nheads (`int`, *optional*): Number of heads for cross attention. cross_attn_all_layers_decoder (`bool`, *optional*, defaults to False): Whether to apply cross attention to all decoder layers. cross_attn_all_layers_encoder (`bool`, *optional*, defaults to False): Whether to apply cross attention to all encoder layers. cross_attn_use_flex_attention (`bool`, *optional*, defaults to True): Whether to use flexible attention for cross attention. cross_attn_init_by_pooling (`bool`, *optional*, defaults to False): Whether to initialize cross attention by pooling. # Encoder configurations use_local_encoder_transformer (`bool`, *optional*, defaults to False): Whether to use transformer in local encoder. max_encoder_seq_length (`int`, *optional*): Maximum sequence length for encoder. encoder_hash_byte_group_size (`Any`, *optional*): Hash byte group size for encoder. encoder_hash_byte_group_vocab (`int`, *optional*, defaults to 30000): Vocabulary size for hash byte groups. encoder_hash_byte_group_nb_functions (`int`, *optional*, defaults to 3): Number of hash functions for byte groups. encoder_enable_byte_ngrams (`bool`, *optional*, defaults to False): Whether to enable byte n-grams in encoder. encoder_ngram_to_size_str (`str`, *optional*): String defining n-gram sizes. downsampling_by_pooling (`str`, *optional*): Type of pooling for downsampling. # Model behavior share_encoder_decoder_emb (`bool`, *optional*, defaults to True): Whether to share encoder and decoder embeddings. weight_tying (`bool`, *optional*, defaults to False): Whether to tie input and output embeddings. # Performance optimization sequence_parallel (`bool`, *optional*, defaults to False): Whether to use sequence parallelism. loss_parallel (`bool`, *optional*, defaults to False): Whether to use loss parallelism. fuse_sequence_parallel (`bool`, *optional*, defaults to False): Whether to fuse sequence parallel operations. use_fsdp (`bool`, *optional*, defaults to True): Whether to use fully sharded data parallel. # Parameter mixing pm_size (`int`, *optional*, defaults to 0): Parameter mixing size. # Special tokens bos_token_id (`int`, *optional*, defaults to 1): The id of the "beginning-of-sequence" token. eos_token_id (`int`, *optional*, defaults to 2): The id of the "end-of-sequence" token. pad_token_id (`int`, *optional*, defaults to -1): The id of the padding token. # Patcher/Entropy model configuration patcher_vocab_size (`int`, *optional*, defaults to 256): Vocabulary size for the entropy model used in patching. patcher_dim (`int`, *optional*, defaults to 512): Hidden dimension for the entropy model. patcher_n_layers (`int`, *optional*, defaults to 8): Number of layers in the entropy model. patcher_n_heads (`int`, *optional*, defaults to 8): Number of attention heads in the entropy model. patcher_head_dim (`int`, *optional*): Dimension of each attention head in the entropy model. patcher_n_kv_heads (`int`, *optional*): Number of key-value heads in the entropy model. patcher_max_seqlen (`int`, *optional*, defaults to 1024): Maximum sequence length for the entropy model. patcher_norm_eps (`float`, *optional*, defaults to 1e-5): Layer normalization epsilon for the entropy model. patcher_dropout (`float`, *optional*, defaults to 0.0): Dropout probability for the entropy model. patcher_sliding_window (`int`, *optional*): Sliding window size for the entropy model attention. patcher_ffn_dim_multiplier (`float`, *optional*): Feedforward dimension multiplier for the entropy model. patcher_multiple_of (`int`, *optional*, defaults to 256): Make feedforward dimension multiple of this for the entropy model. patcher_rope_theta (`float`, *optional*, defaults to 10000.0): RoPE theta parameter for the entropy model. patcher_rope_use_fp32_in_outer_product (`bool`, *optional*, defaults to False): Whether to use fp32 in RoPE outer product for the entropy model. patcher_attn_impl (`str`, *optional*, defaults to "sdpa"): Attention implementation for the entropy model. patcher_attn_bias_type (`str`, *optional*, defaults to "causal"): Attention bias type for the entropy model. patcher_init_base_std (`float`, *optional*): Base initialization standard deviation for the entropy model. patcher_init_std_factor (`str`, *optional*, defaults to "disabled"): Initialization std factor for the entropy model. patcher_dim_token_emb (`int`, *optional*): Token embedding dimension for the entropy model. patcher_weight_tying (`bool`, *optional*, defaults to False): Whether to tie embeddings in the entropy model. patcher_bos_token_id (`int`, *optional*, defaults to 1): Beginning of sequence token id for the entropy model. patcher_eos_token_id (`int`, *optional*, defaults to 2): End of sequence token id for the entropy model. ```python >>> from transformers import ByteLatentTransformer, BLTConfig >>> # Initializing a BLT configuration >>> configuration = BLTConfig() >>> # Initializing a model from the configuration >>> model = ByteLatentTransformer(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```bltpast_key_valuesc] rd|_||_||_||_||_||_||_||_||_| |_ | |_ | |_ | |_ | |_ ||_||_||_||_||_||_||_||_||_||_||_||_||_||_||_t;||_||_||_ | |_!|!|_"|"|_#|#|_$|$|_%|%|_&|&|_'|'|_(|(|_)|)|_*|*|_+|+|_,|,|_-|-|_.|.|_/|/|_0|0|_1|1|_2|2|_3|3|_4|4|_5|5|_6|6|_7|7|_8|8|_9|9|_:|:|_;|;|_<|<|_=|=|_>|>|_?|?|_@@|_AA|_BB|_CC|_DG|_EH|_FI|_GJ|_HK|_IL|_JM|_KN|_LO|_MP|_NQ|_OR|_PS|_QT|_RU|_SV|_TW|_Ut;X|_VY|_WZ|_X[|_Y\|_Z|jpbt|jptk(rF|jpjdD^cgc]}^t|^dkDst^c}^|_8dddd|_`||_a||_b| |_c||_dt_|dDEFd]ycc}^w)N,rdynamicg@)typefactor rope_type) bos_token_id eos_token_id pad_token_idr)gsliding_window vocab_size max_seqlendimn_layersn_headshead_dim n_kv_heads dim_globaldim_local_decoderdim_local_encodern_layers_globaln_layers_local_decodern_layers_local_encodern_heads_globaln_heads_local_decodern_heads_local_encodern_kv_heads_globalnorm_epsdropoutffn_dim_multiplier multiple_of rope_thetarope_use_fp32_in_outer_product attn_implattn_bias_typelocal_attention_window_lenuse_rope init_base_stdrinit_std_factor dim_token_emb dim_tokenpatch_in_forwardrealtime_patching patch_size patching_modepatching_thresholdpatching_threshold_add monotonicitypatching_batch_sizepatching_devicemax_patch_lengthentropy_model_checkpoint_dircross_attn_encodercross_attn_decodercross_attn_window_encodercross_attn_window_decoder cross_attn_kcross_attn_nheadscross_attn_all_layers_decodercross_attn_all_layers_encodercross_attn_use_flex_attentioncross_attn_init_by_poolinguse_local_encoder_transformermax_encoder_seq_lengthencoder_hash_byte_group_sizeencoder_hash_byte_group_vocab$encoder_hash_byte_group_nb_functionsencoder_enable_byte_ngramsencoder_ngram_to_size_strdownsampling_by_poolingshare_encoder_decoder_emb weight_tyingsequence_parallel loss_parallelfuse_sequence_paralleluse_fsdppm_sizepatcher_vocab_size patcher_dimpatcher_n_layerspatcher_n_headspatcher_head_dimpatcher_n_kv_headspatcher_max_seqlenpatcher_norm_epspatcher_dropoutpatcher_sliding_windowpatcher_ffn_dim_multiplierpatcher_multiple_ofpatcher_rope_theta&patcher_rope_use_fp32_in_outer_productpatcher_attn_implpatcher_attn_bias_typepatcher_init_base_stdpatcher_init_std_factorpatcher_dim_token_embpatcher_weight_tyingpatcher_bos_token_idpatcher_eos_token_idr"strsplitlenint rope_scalingnum_key_value_headsmax_position_embeddings hidden_sizenum_attention_headssuper__init__)`selfr)r*r+r,r-r.r/r0r1r2r3r4r5r6r7r8r9r:r;r<r=r>r?r@rArBrCrDrErFrGrHrIrJrKrLrMrNrOrPrQrRrSrTrUrVrWrXrYrZr[r\r]r^r_r`rarbrcrdrerfrgrhrirjrkr%r&r'rlrmrnrorprqrrrsrtrurvrwrxryrzr{r|r}r~rrrkwargsx __class__s` rrzBLTConfig.__init__sb#$$     $%!2!2.&<#&<#,%:"%:"!2!  "4&%.L+#,*D'  +,_=+"!1!2$*"4&<#(#6 . 0,H)#5"4)B&)B&(!2-J*-J*-J**D'.K*&<#,H)-J*4X1*D')B&'>$*C&("3*&<#   #5& 0. 0"4"4 0.&<#*D'#6 "46\3!2&<#%:"'45L'M$%:"$8!$8!$8!  , , 8T$BcBc=dhk=k $ A A G G L1 L1PSTUPVYZPZA L1D - "  "7 %/$*!6   %%%   #1s L4# L4c|j |jS|jr |jS|j |jnd}|j|zS)z*Compute encoder token embedding dimension.)rGr]r2rJr0)rrJs rencoder_dim_token_embzBLTConfig.encoder_dim_token_embsR >> %>> !  / /)) )-1OO,GQJ??j0 0rcd|jr$|jr |jS|jSy)z*Compute encoder patch embedding dimension.N)rSr\r2r0rs rencoder_dim_patch_embzBLTConfig.encoder_dim_patch_emb)s/  " "..---&rch|j}|jr|j |jnd}||zS|j$|jrt |jdk(r|j |j nd}||zS|t dDcgc]}||jvc}zScc}w)z)Compute global patch embedding dimension.rr)avgminmax)rrSrWrdrrJsum)rrFrWrJpoolings rglobal_dim_patch_embzBLTConfig.global_dim_patch_emb3s22  " "040A0A0M4,,STL </ /  ( ( 0//4//0A5-1OO,GQJ :- - 3_t'u_tT[43O3O(O_t'u#vv v'usB/ cz|jr |jS|j |jS|jS)z*Compute decoder token embedding dimension.)rerrGr1rs rdecoder_dim_token_embzBLTConfig.decoder_dim_token_embEs9  ) )-- - ^^ '>> !)) )rdepthreturncT|jtjk(r d|dzzdzSy)a Calculate the initialization standard deviation scaling factor for a given layer depth. Args: depth: Current layer depth (0-indexed) Returns: Scaling factor to divide the base initialization std by rg??)rErr)rrs rget_init_std_factorzBLTConfig.get_init_std_factorOs.   =#>#> >O+ +r)\rrNNrrrrrrrrrNh㈵>rr@FsdpacausalNTNr NNFTNNNNFrcudaNNFFNNNNFFTFFNNi0urFNNTFFFFTrrrrrrrNNrrrNNrrFrrNr NFrr)r r r __doc__ model_typekeys_to_ignore_at_inferencerpropertyrrrrrfloatr __classcell__)rs@rrr*sj XJ#4"5   ',#'"#%)  "&"&&+&+&*#(&+#%)&+-.#("& $"&$##'"/4 '" *""YA F 1 1ww"**   rr)rrrN)renumrconfiguration_utilsrutilsr get_loggerr loggerrrrr__all__rrrrsZ"83   H %$C$ sDr rj =r