U a)@s0ddlZddlmZddlZddlmZddlm Z m Z ddl m Z ddl mZddlmZddl mZmZmZddlm Z m Z ddlmZddlmZmZmZdd lmZmZdd lmZdd lm Z dd lm!Z!dd lm"Z"dZ#ej$dddZ%dZ&dZ'GdddeZ"Gdddej(Z)Gddde"Z*dS)N) FrozenDictunfreeze)Path)Callable)chain)AnyOptionalTuple)FlaxPreTrainedModel) GPTNeoConfig GPT2Tokenizer file_utils)add_start_docstrings%add_start_docstrings_to_model_forward)FlaxGPTNeoBlockCollection)FlaxBaseModelOutput)FlaxGPTNeoModule)FlaxGPTNeoPreTrainedModelzEleutherAI/gpt-neo-1.3Bz <|endoftext|>) pad_tokenat This model inherits from :class:`~transformers.FlaxPreTrainedModel`. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a Flax Linen `flax.nn.Module `__ subclass. Use it as a regular Flax Module and refer to the Flax documentation for all matter related to general usage and behavior. Finally, this model supports inherent JAX features such as: - `Just-In-Time (JIT) compilation `__ - `Automatic Differentiation `__ - `Vectorization `__ - `Parallelization `__ Parameters: config (:class:`~transformers.GPTNeoConfig`): Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the :meth:`~transformers.FlaxPreTrainedModel.from_pretrained` method to load the model weights. a@ Args: input_ids (:obj:`numpy.ndarray` of shape :obj:`(batch_size, input_ids_length)`): :obj:`input_ids_length` = ``sequence_length``. Indices of input sequence tokens in the vocabulary. Indices can be obtained using :class:`~transformers.GPTNeoTokenizer`. See :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for details. `What are input IDs? <../glossary.html#input-ids>`__ attention_mask (:obj:`numpy.ndarray` of shape :obj:`(batch_size, sequence_length)`, `optional`): Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. `What are attention masks? <../glossary.html#attention-mask>`__ position_ids (:obj:`numpy.ndarray` of shape :obj:`(batch_size, sequence_length)`, `optional`): Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0, config.max_position_embeddings - 1]``. past_key_values (:obj:`Dict[str, np.ndarray]`, `optional`, returned by ``init_cache`` or when passing previous ``past_key_values``): Dictionary of pre-computed hidden-states (key and values in the attention blocks) that can be used for fast auto-regressive decoding. Pre-computed key and value hidden-states are of shape `[batch_size, max_length]`. output_attentions (:obj:`bool`, `optional`): Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned tensors for more detail. output_hidden_states (:obj:`bool`, `optional`): Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for more detail. return_dict (:obj:`bool`, `optional`): Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple. c seZdZUdZeZdZdZej e d<dde j fee ee jdfdd Zejje ed d d Zd dZeedeeejjeeeeeeedddZZS)rz An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models. transformerN module_class)rr)config input_shapeseeddtypec s0|jf||d|}tj|||||ddS)Nrr)rrr)rsuper__init__)selfrrrrkwargsmodule __class__+/home/vivek/gptneo_cosmos/src/model_file.pyrRsz"FlaxGPTNeoPreTrainedModel.__init__)rngrreturnc Csjtj|dd}t|}ttt|jd|}tj |\}}||d}|j j ||||dddS)Ni4r)paramsdropoutF return_dictr,) jnpzeros ones_like broadcast_toarange atleast_2dshapejaxrandomsplitr"init) r r'r input_idsattention_mask position_idsZ params_rng dropout_rngrngsr%r%r& init_weights\s   z&FlaxGPTNeoPreTrainedModel.init_weightscCs`t||f}t|}ttt|jd|j}|jjt j d|||ddd}|dS)aa Args: batch_size (:obj:`int`): batch_size used for fast auto-regressive decoding. Defines the batch size of the initialized cache. max_length (:obj:`int`): maximum possible length for auto-regressive decoding. Defines the sequence length of the initialized cache. r+rFT)r/ init_cachecache) r0onesr2r3r4r5r6r"r:r7r8PRNGKey)r batch_size max_lengthr;r<r=Zinit_variablesr%r%r&rAds    z$FlaxGPTNeoPreTrainedModel.init_cacheF)r,past_key_valuesr>trainoutput_attentionsoutput_hidden_statesr/c  Csp|dk r |n|jj}| dk r | n|jj} | dk r4| n|jj} |dkrt|dk rTtdttt|j d|j }|dkrt |}i} |dk r|| d<d|p|j i} |r|| d<dg} nd} |j j | tj|ddtj|ddtj|dd| d|| | | | d }|dk r.| r.|\}}t|d|d <|S|dk rl| sl|\}}|dd t|df|d d}|S) NzCMake sure to provide `position_ids` when passing `past_key_values`.r+r-r,rBFr)r*)r?mutablerGr)rrIrJr/ ValueErrorr0r3r4r5r6r2r,r"applyarrayr)r r;r<r=r,rGr>rHrIrJr/r?inputsrKoutputsr%r%r&__call__usN     &z"FlaxGPTNeoPreTrainedModel.__call__) NNNNNFNNN)__name__ __module__ __qualname____doc__r config_classbase_model_prefixrnnModule__annotations__r0float32r intrrr7r8rDrr@rArGPT_NEO_INPUTS_DOCSTRINGdictboolrrQ __classcell__r%r%r#r&rJsD  rc@s8eZdZUeed<ejZejed<ddZd ddZ dS) !FlaxGPTNeoForMultipleChoiceModulerrcCs6t|j|jd|_tjdd|_tjt|jd|_ dS)Nrg?)rater*) rrrrrXDropoutr-Dense num_choice classifier)r r%r%r&setupsz'FlaxGPTNeoForMultipleChoiceModule.setupTcGs|jd}tjd}tj|\} } |t|d}|t|d}|t|d}|j||||d} | d} tj | dd} | |d} |j | || d} | | }|dt}|s|f| ddS|S)Nrr+r.r)axis) deterministicr') r6r7r8rDr9reshapererr0meanr-rf)r r;r<r=r/riargsrEr'_r>rP hidden_statesZdropout_outputlogitsZreshaped_logitsr%r%r&rQs      z*FlaxGPTNeoForMultipleChoiceModule.__call__N)TT) rRrSrTr rZr0r[rrgrQr%r%r%r&ras rac@seZdZeZdS)FlaxGPTNeoForMultipleChoiceN)rRrSrTrarr%r%r%r&rqsrq)+r7 jax.numpynumpyr0flax flax.linenlinenrXZflax.core.frozen_dictrrpathlibrtypingr itertoolsrrrr Z transformers.modeling_flax_utilsr transformersr r r Ztransformers.file_utilsrrZ1transformers.models.gpt_neo.modeling_flax_gpt_neorZ"transformers.modeling_flax_outputsrrrrefrom_pretrained tokenizerZGPT_NEO_START_DOCSTRINGr]rYrarqr%r%r%r&s.          g