U e@sdZddlZddlZddlZddlmZddlmZmZddl Z ddl m Z ddl m Z m Z ddlmZmZddlmZmZmZmZmZdd lmZdd lmZdd lmZeeZd Z d Z!dgZ"Gddde j#Z$eGdddeZ%eGdddeZ&eGdddeZ'ddZ(Gddde j#Z)Gddde j#Z*Gddde j#Z+Gdd d e j#Z,Gd!d"d"e j#Z-Gd#d$d$e j#Z.Gd%d&d&e j#Z/Gd'd(d(e j#Z0Gd)d*d*e j#Z1Gd+d,d,e j#Z2Gd-d.d.e j#Z3Gd/d0d0e j#Z4Gd1d2d2e j#Z5Gd3d4d4e j#Z6Gd5d6d6e j#Z7Gd7d8d8e j#Z8Gd9d:d:e j#Z9Gd;d<dZedEe;GdFdGdGe:Z?dS)Hz PyTorch lxmert model. N) dataclass)OptionalTuple)nn)CrossEntropyLoss SmoothL1Loss)ACT2FNgelu) ModelOutputadd_code_sample_docstringsadd_start_docstrings%add_start_docstrings_to_model_forwardreplace_return_docstrings)PreTrainedModel)logging) LxmertConfigrLxmertTokenizerunc-nlp/lxmert-base-uncasedcs$eZdZfddZddZZS)GeLUcstdSN)super__init__self __class__CD:\Download\graduate-design\lxmert\lxmert\src\huggingface_lxmert.pyr4sz GeLU.__init__cCst|Sr)r )rxrrrforward7sz GeLU.forward__name__ __module__ __qualname__rr __classcell__rrrrr3s rc@seZdZUdZdZeejed<dZ eejed<dZ eejed<dZ ee ejed<dZ ee ejed<dZee ejed<dZee ejed <dZee ejed <dS) LxmertModelOutputa Lxmert's outputs that contain the last hidden states, pooled outputs, and attention probabilities for the language, visual, and, cross-modality encoders. (note: the visual encoder in Lxmert is referred to as the "relation-ship" encoder") Args: language_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): Sequence of hidden-states at the output of the last layer of the language encoder. vision_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): Sequence of hidden-states at the output of the last layer of the visual encoder. pooled_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, hidden_size)`): Last layer hidden-state of the first token of the sequence (classification, CLS, token) further processed by a Linear layer and a Tanh activation function. The Linear language_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. vision_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. language_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. vision_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. cross_encoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Nlanguage_output vision_output pooled_outputlanguage_hidden_statesvision_hidden_stateslanguage_attentionsvision_attentionscross_encoder_attentions)r!r"r#__doc__r&rtorch FloatTensor__annotations__r'r(r)rr*r+r,r-rrrrr%;s "r%c@seZdZUdZdZeejed<dZ eejed<dZ ee ejed<dZ ee ejed<dZ ee ejed<dZee ejed<dZee ejed <dS) LxmertForQuestionAnsweringOutputa? Output type of :class:`~transformers.LxmertForQuestionAnswering`. Args: loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`): Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.k. question_answering_score: (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, n_qa_answers)`, `optional`): Prediction scores of question answering objective (classification). language_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. vision_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. language_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. vision_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. cross_encoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Nlossquestion_answering_scorer)r*r+r,r-)r!r"r#r.r3rr/r0r1r4r)rr*r+r,r-rrrrr2is r2c@seZdZUdZdZejged<dZe ejed<dZ e ejed<dZ e ejed<dZ e e ejed<dZe e ejed<dZe e ejed <dZe e ejed <dZe e ejed <dS) LxmertForPreTrainingOutputa Output type of :class:`~transformers.LxmertForPreTraining`. Args: loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`): Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss. prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`): Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). cross_relationship_score: (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`): Prediction scores of the textual matching objective (classification) head (scores of True/False continuation before SoftMax). question_answering_score: (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, n_qa_answers)`): Prediction scores of question answering objective (classification). language_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. vision_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. language_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. vision_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. cross_encoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Nr3prediction_logitscross_relationship_scorer4r)r*r+r,r-)r!r"r#r.r3r/r0r1r6rr7r4r)rr*r+r,r-rrrrr5s #r5c Cszddl}ddl}ddl}Wn tk r<tdYnXtj|}t d ||j |}g}g} |D]<\} } t d | | |j || } || | | qrt|| D]\} } | d} tdd| Drt d d| q|} | D]}|d |r"|d |}n|g}|dd ksD|dd krPt| d} n|ddksl|ddkrxt| d} nz|ddkrt| d} n`|ddkrt| d} nFzt| |d} Wn2tk rt d d| YqYnXt|dkrt|d}| |} q|dddkr6t| d} n|d krJ|| } z| j| jks^tWn<tk r}z|j| j| jf7_W5d}~XYnXt d | t| | _q|S)z'Load tf checkpoints in a pytorch model.rNzLoading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see https://www.tensorflow.org/install/ for installation instructions.z(Converting TensorFlow checkpoint from {}z"Loading TF weight {} with shape {}/css|]}|dkVqdS))adam_vadam_mAdamWeightDecayOptimizerAdamWeightDecayOptimizer_1 global_stepNr).0nrrr s z,load_tf_weights_in_lxmert..z Skipping {}z [A-Za-z]+_\d+z_(\d+)kernelgammaweight output_biasbetabiasoutput_weightssquad classifieri _embeddingszInitialize PyTorch weight {})renumpy tensorflow ImportErrorloggererrorospathabspathinfoformattrainlist_variables load_variableappendzipsplitanyjoin fullmatchgetattrAttributeErrorlenint transposeshapeAssertionErrorargsr/ from_numpydata)modelconfigtf_checkpoint_pathrMnptftf_path init_varsnamesarraysnamerfarraypointerm_name scope_namesnumerrrload_tf_weights_in_lxmertsv                    r{cs*eZdZdZfddZdddZZS)LxmertEmbeddingszGConstruct the embeddings from word, position and token_type embeddings.cspttj|j|jdd|_tj|j|jdd|_tj|j |jdd|_ tj |jdd|_ t |j |_dS)Nr) padding_idx-q=eps)rrr Embedding vocab_size hidden_sizeword_embeddingsmax_position_embeddingsposition_embeddingstype_vocab_sizetoken_type_embeddings LayerNormDropouthidden_dropout_probdropoutrrlrrrrs  zLxmertEmbeddings.__init__Nc Cs|dk r|}|j}n|dd}|j}|d}tj|tj|d}|d|}|dkrvtj|tj|jjd}|dkr| |}| |}| |} ||| } | | } | | } | S)NrKdtypedevicer)sizerr/arangelong unsqueezeexpandzeros position_idsrrrrr) r input_idstoken_type_ids inputs_embeds input_shaper seq_lengthrrr embeddingsrrrrs$      zLxmertEmbeddings.forward)NN)r!r"r#r.rrr$rrrrr|s r|csXeZdZdfdd ZddZddZd d Zd d Zd dZddZ dddZ Z S)LxmertAttentionNFcst|j|jdkr.td|j|jf|j|_t|j|j|_|j|j|_|dkrd|j}t |j|j|_ t ||j|_ t ||j|_ t |j|_||_d|_d|_dS)NrzLThe hidden size (%d) is not a multiple of the number of attention heads (%d))rrrnum_attention_heads ValueErrorrdattention_head_size head_sizerLinearquerykeyvaluerattention_probs_dropout_probr save_camsattnattn_gradients)rrlctx_dimrrrrr9s&  zLxmertAttention.__init__cCs|j}d|_|Srrrretrrrget_attnQszLxmertAttention.get_attncCs"|jdk r|j|g|_n||_dSrr)rrrrr save_attnVs zLxmertAttention.save_attncCs"|jdk r|j|g|_n||_dSrr)rrrrrsave_attn_gradients\s z#LxmertAttention.save_attn_gradientscCs|j}d|_|Srrrrrrget_attn_gradientsbsz"LxmertAttention.get_attn_gradientscCsd|_d|_dSr)rrrrrrresetgszLxmertAttention.resetcCs6|dd|j|jf}|j|}|ddddS)NrrrJrK)rrrviewpermute)rr new_x_shaperrrtranspose_for_scoresks  z$LxmertAttention.transpose_for_scorescCs||}||}||}||}||} ||} t|| dd} | t|j } |dk rp| |} t j dd| } | | | |j|| } t| | } | dddd} | dd|jf}| j|} |r| | fn| f}|S)Nr)dimrrJrKr)rrrrr/matmulremathsqrtrrSoftmaxr register_hookrrr contiguousrrr)r hidden_statescontextattention_maskoutput_attentionsmixed_query_layermixed_key_layermixed_value_layer query_layer key_layer value_layerattention_scoresattention_probs context_layernew_context_layer_shapeoutputsrrrrss(           zLxmertAttention.forward)NF)NF) r!r"r#rrrrrrrrr$rrrrr8srcs$eZdZfddZddZZS)LxmertAttentionOutputcs@tt|j|j|_tj|jdd|_t|j|_ dSNr~r) rrrrrdenserrrrrrrrrs zLxmertAttentionOutput.__init__cCs&||}||}|||}|Srrrrrr input_tensorrrrrs  zLxmertAttentionOutput.forwardr rrrrrs rcs(eZdZdfdd ZdddZZS) LxmertCrossAttentionLayerFcs&tt||d|_t||_dSNr)rrrattroutputrrlrrrrrs z"LxmertCrossAttentionLayer.__init__Nc CsD|j||||d}|r|d}||d|}|r:||fn|f}|SNrrKr)rr) rr ctx_tensor ctx_att_maskrrrattention_outputrrrrrs z!LxmertCrossAttentionLayer.forward)F)NFr rrrrrsrcs(eZdZdfdd ZdddZZS)LxmertSelfAttentionLayerFcs&tt||d|_t||_dSr)rrrrrrrrrrrs z!LxmertSelfAttentionLayer.__init__cCsD|j||||d}|r|d}||d|}|r:||fn|f}|Sr)rr)rrrrrrrrrrrrsz LxmertSelfAttentionLayer.forward)F)Fr rrrrrsrcs$eZdZfddZddZZS)LxmertIntermediatecs,tt|j|j|_t|j|_ dSr) rrrrrintermediate_sizerr hidden_actintermediate_act_fnrrrrrs zLxmertIntermediate.__init__cCs||}||}|Sr)rrrrrrrrs  zLxmertIntermediate.forwardr rrrrrs rcs$eZdZfddZddZZS) LxmertOutputcs@tt|j|j|_tj|jdd|_t|j |_ dSr) rrrrrrrrrrrrrrrrs zLxmertOutput.__init__cCs&||}||}|||}|Srrrrrrrs  zLxmertOutput.forwardr rrrrrs rcs(eZdZdfdd ZdddZZS) LxmertLayerFcs0tt||d|_t||_t||_dSr)rrr attentionr intermediaterrrrrrrs  zLxmertLayer.__init__NcCsD|j|||d}|d}||}|||}|f|dd}|S)NrrrK)rrr)rrrrrrintermediate_output layer_outputrrrrs   zLxmertLayer.forward)F)NFr rrrrrsrcsBeZdZd fdd Zd ddZddZdd Zdd d ZZS) LxmertXLayerFcsXtt||d|_t||_t||_t||_t ||_ t||_ t ||_ dSr) rrrvisual_attentionr lang_self_att visn_self_attr lang_interr lang_output visn_inter visn_outputrrrrrs      zLxmertXLayer.__init__cCs,|j||||d}|j|||dd}||fS)N)rrF)r)r lang_inputlang_attention_mask visual_inputvisual_attention_maskoutput_x_attentionslang_att_outputvisual_att_outputrrr cross_atts zLxmertXLayer.cross_attcCs0|j||dd}|j||dd}|d|dfS)NFrr)rr)rrrrrrrrrrself_attszLxmertXLayer.self_attcCs4||}||}|||}|||}||fSr)rrrr)rrrlang_inter_outputvisual_inter_outputr visual_outputrrr output_fc s     zLxmertXLayer.output_fcc Csj|j|||||d\}}|dd}||d||d|\}}|||\} } |rb| | |dfS| | fS)N)rrrrrrKr)rrr) r lang_featsr visual_featsrrrrrrrrrrr+s.   zLxmertXLayer.forward)F)F)F) r!r"r#rrrrrr$rrrrrs rcs$eZdZfddZddZZS)LxmertVisualFeatureEncodercslt|j}|j}t||j|_tj|jdd|_ t||j|_ tj|jdd|_ t |j |_dSr)rrvisual_feat_dimvisual_pos_dimrrrvisn_fcrvisn_layer_normbox_fcbox_layer_normrrr)rrlfeat_dimpos_dimrrrrPs z#LxmertVisualFeatureEncoder.__init__cCsB||}||}||}||}||d}||}|SNrJ)r r r r r)rr visual_posryrrrrr_s      z"LxmertVisualFeatureEncoder.forwardr rrrrrOs rcs(eZdZdfdd ZdddZZS) LxmertEncoderFcstt|_|_j|_j|_j |_ t fddt |jD|_t fddt |jD|_t fddt |j D|_ dS)Ncsg|]}tdqSrrr>_rlrrr ysz*LxmertEncoder.__init__..csg|] }tqSr)rr)rlrrrzscsg|]}tdqSrrrrrrr{s)rrrr rll_layers num_l_layersx_layers num_x_layersr_layers num_r_layersr ModuleListrangelayerrrrrrks  " zLxmertEncoder.__init__NcCsdd}d}|s|jjrdnd} |s(|jjr,dnd} |s<|jjr@dnd} |||}|jD]:} | |||d} | d}||f}| dk rV| | df} qV|jD]:} | |||d}|d}||f}| dk r| |df} q|jD]P} | |||||d}|dd\}}||f}||f}| dk r| |df} q||r8| ndf}||rJ| ndf}|||r^| ndfS)NrrrrKrJ)rlrr r!rr)rrrrrrrr*r)r,r+r- layer_module l_outputs v_outputs x_outputsvisual_encoder_outputslang_encoder_outputsrrrr}sR            zLxmertEncoder.forward)F)NNr rrrrrjsrcs$eZdZfddZddZZS) LxmertPoolercs.tt|t|j|j|_t|_dSr) rr(rrrrrTanh activationrrrrrszLxmertPooler.__init__cCs(|dddf}||}||}|S)Nr)rr*)rrfirst_token_tensorr(rrrrs  zLxmertPooler.forwardr rrrrr(s r(cs$eZdZfddZddZZS)LxmertPredictionHeadTransformcsBtt|t|j|j|_t|j|_ tj |jdd|_ dSr) rr,rrrrrrrtransform_act_fnrrrrrrs z&LxmertPredictionHeadTransform.__init__cCs"||}||}||}|Sr)rr-rrrrrrs   z%LxmertPredictionHeadTransform.forwardr rrrrr,s r,cs$eZdZfddZddZZS)LxmertLMPredictionHeadcsZtt|t||_tj|d|ddd|_||j_ t t |d|_ dS)NrKrFrF)rr.rr, transformrrrdecoderrC Parameterr/rrFrrllxmert_model_embedding_weightsrrrrs zLxmertLMPredictionHead.__init__cCs||}|||j}|Sr)r0r1rFrrrrrs zLxmertLMPredictionHead.forwardr rrrrr.s r.cs$eZdZfddZddZZS)LxmertVisualAnswerHeadc sNt|j}tt||dttj|dddt|d||_dS)NrJr~r) rrrr Sequentialrrrlogit_fc)rrl num_labelshid_dimrrrrs zLxmertVisualAnswerHead.__init__cCs ||Sr)r7rrrrrszLxmertVisualAnswerHead.forwardr rrrrr5s r5cs$eZdZfddZddZZS)LxmertVisualObjHeadcstt_i}jr.djd|d<jrDdjd|d<jr`djfjd|d<|_ t fddj D_ dS) Nr)rfryobjattrrfeatcs&i|]}|tjj|dqS)ry)rrr visual_losses)r>rrlrrr sz0LxmertVisualObjHead.__init__..) rrr,r0visual_obj_lossnum_object_labelsvisual_attr_lossnum_attr_labelsrr?r ModuleDict decoder_dict)rrlr?rr@rrs   zLxmertVisualObjHead.__init__cCs0||}i}|jD]}|j||||<q|Sr)r0r?rG)rrrrrrrrs   zLxmertVisualObjHead.forwardr rrrrr:s r:cs$eZdZfddZddZZS)LxmertPreTrainingHeadscs.tt|t|||_t|jd|_dSr) rrHrr. predictionsrrrseq_relationshipr3rrrrs zLxmertPreTrainingHeads.__init__cCs||}||}||fSr)rIrJ)rsequence_outputr(prediction_scoresseq_relationship_scorerrrr"s  zLxmertPreTrainingHeads.forwardr rrrrrHs rHc@s$eZdZdZeZeZdZddZ dS)LxmertPreTrainedModelz An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models. lxmertcCsvt|tjtjfr*|jjjd|jjdn&t|tj rP|j j |jj dt|tjrr|j dk rr|j j dS)z Initialize the weights )meanstd?N) isinstancerrrrCrjnormal_rlinitializer_rangerrFzero_fill_)rmodulerrr _init_weights2s  z#LxmertPreTrainedModel._init_weightsN) r!r"r#r.r config_classr{load_tf_weightsbase_model_prefixrZrrrrrN(s rNa The lxmert model was proposed in `lxmert: Learning Cross-Modality Encoder Representations from Transformers `__ by Hao Tan and Mohit Bansal. It's a vision and language transformer model, pretrained on a variety of multi-modal datasets comprising of GQA, VQAv2.0, MCSCOCO captions, and Visual genome, using a combination of masked language modeling, region of interest feature regression, cross entropy loss for question answering attribute prediction, and object tag prediction. This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch `torch.nn.Module `__ subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. Parameters: config (:class:`~transformers.LxmertConfig`): Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights. u Args: input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`): Indices of input sequence tokens in the vocabulary. Indices can be obtained using :class:`~transformers.LxmertTokenizer`. See :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for details. `What are input IDs? <../glossary.html#input-ids>`__ visual_feats: (:obj:`torch.FloatTensor` of shape :obj:՝(batch_size, num_visual_features, visual_feat_dim)՝): This input represents visual features. They ROI pooled object features from bounding boxes using a faster-RCNN model) These are currently not provided by the transformers library. visual_pos: (:obj:`torch.FloatTensor` of shape :obj:՝(batch_size, num_visual_features, visual_pos_dim)՝): This input represents spacial features corresponding to their relative (via index) visual features. The pre-trained lxmert model expects these spacial features to be normalized bounding boxes on a scale of 0 to 1. These are currently not provided by the transformers library. attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`): Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. `What are attention masks? <../glossary.html#attention-mask>`__ visual_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`): Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. `What are attention masks? <../glossary.html#attention-mask>`__ token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`): Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0, 1]``: - 0 corresponds to a `sentence A` token, - 1 corresponds to a `sentence B` token. `What are token type IDs? <../glossary.html#token-type-ids>`__ inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`): Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert :obj:`input_ids` indices into associated vectors than the model's internal embedding lookup matrix. output_attentions (:obj:`bool`, `optional`): Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned tensors for more detail. output_hidden_states (:obj:`bool`, `optional`): Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for more detail. return_dict (:obj:`bool`, `optional`): Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple. z`The bare Lxmert Model transformer outputting raw hidden-states without any specific head on top.c sVeZdZdfdd ZddZddZeede e d e e d dd d Z ZS) LxmertModelFcs:t|t||_t||d|_t||_|dSr) rrr|rrencoderr(pooler init_weightsrrrrrs    zLxmertModel.__init__cCs|jjSrrrrrrrget_input_embeddingssz LxmertModel.get_input_embeddingscCs ||j_dSrrb)rnew_embeddingsrrrset_input_embeddingssz LxmertModel.set_input_embeddingsbatch_size, sequence_lengthrtokenizer_class checkpoint output_typer[Nc  CsP|dk r |n|jj}| dk r | n|jj} | dk r4| n|jj} |dk rV|dk rVtdn4|dk rh|} n"|dk r|dd} ntd|dk std|dk std|dk r|jn|j} |dkrtj | | d}|dkrtj | tj | d}| d d } | j |jd } d | d } |dk rN| d d }|j |jd }d |d }nd}||||}|j|| ||||d }|dd \}}|d}|d}d}|r|d}|d}|d }|||f}| r||fnd}|d}|d}||}| s|||f||St|||| r|nd| r$|nd|r0|nd|r<|nd|rH|nddS)NzDYou cannot specify both input_ids and inputs_embeds at the same timerz5You have to specify either input_ids or inputs_embedsz`visual_feats` cannot be `None`z`visual_pos` cannot be `None`rrrKrJ)rrSg)rrrrrr)r(r&r'r)r*r+r,r-)rlroutput_hidden_statesuse_return_dictrrrgrr/onesrrrtorrr_r`r%)rrrrrrrrrrl return_dictrrextended_attention_maskextended_visual_attention_maskembedding_outputencoder_outputsr&r'r*r)all_attentionsr+r,r-rrrr(rrrrs~           zLxmertModel.forward)F) NNNNNNNNNN)r!r"r#rrcrer LXMERT_INPUTS_DOCSTRINGrWr _TOKENIZER_FOR_DOCr%_CONFIG_FOR_DOCrr$rrrrr^s* r^z7Lxmert Model with a specified pretraining head on top. csreZdZdfdd ZddZddZejdd d Zd d Z d dZ e e deeeddddZZS)LxmertForPreTrainingFcst|||_|j|_|j|_|j|_|j|_|j|_|j|_t ||d|_ t ||j j j j|_|jrtt||_|jrt||j|_|tddtddtd|_i}|jrd|jdd|d<|jrd|jdd|d <|jrd |jf|jd d|d <||_dS) Nrnone) reduction)l2 visual_cecer;r})rfryr3r<r=rr|r>)rrrl num_qa_labelsvisual_loss_normalizer task_mask_lmtask_obj_predict task_matchedtask_qar^rOrHrrrCclsr:obj_predict_headr5 answer_headrarr loss_fctsrBrCrDrErr?)rrlrr?rrrrsH     zLxmertForPreTraining.__init__cCs8|}|dks|dkrdS||}||j_||_|Sa  Build a resized question answering linear layer Module from a provided new linear layer. Increasing the size will add newly initialized weights. Reducing the size will remove weights from the end Args: num_labels (:obj:`int`, `optional`): New number of labels in the linear layer weight matrix. Increasing the size will add newly initialized weights at the end. Reducing the size will remove weights from the end. If not provided or :obj:`None`, just returns a pointer to the qa labels :obj:`torch.nn.Linear`` module of the model without doing anything. Return: :obj:`torch.nn.Linear`: Pointer to the resized Linear layer or the old Linear layer Nget_qa_logit_layer_resize_qa_labelsrlrrr8cur_qa_logit_layernew_qa_logit_layerrrrresize_num_qa_labelsTs z)LxmertForPreTraining.resize_num_qa_labelscCs&|}|||}|||Srr_get_resized_qa_labels_set_qa_logit_layerrrrrrms  z&LxmertForPreTraining._resize_qa_labelsreturncCst|dr|jjdSdS)a Returns the the linear layer that produces question answering logits. Returns: :obj:`nn.Module`: A torch module mapping the question answering prediction hidden states or :obj:`None` if lxmert does not have a visual answering head. rrNhasattrrr7rrrrrss z'LxmertForPreTraining.get_qa_logit_layercCs||jjd<dSNrrr7rqa_logit_layerrrrr~sz(LxmertForPreTraining._set_qa_logit_layercCs|dkr |S|j\}}||kr&|St|dddk rDt||}ntj||dd}||jj||t||}|jj d|ddf|jj d|ddf<t|dddk r|j j d||j j d|<|SNrFFr/ rCrrarrrorrZminrjrFrrr8 cur_qa_labels hidden_dimrnum_labels_to_copyrrrrs  ,z+LxmertForPreTraining._get_resized_qa_labelsrf)rjr[Nc* Ksd|krtdt|d}|dk r*|n|jj}|dk r@|jn|j}|j|||||||| | |d }|d|d|d}}}|||\}}|j r| |}n |dd}|dkr| dkr| dkr| dkrdn t j d|d }|dk r|j r|jd |d |jj|d }||7}| dk rT|jrT|jd |d d| d }||7}| dk r(|jr(t j d|jd }||}|jD]\}}| |\}} |d }!|d }"|d}#|j}$|j|"}%||}&|%|&d |!|j|#}'|'dkr|'d}'|'| d |$}'||'7}q||7}| dk rb|j rb|jd |d |j| d }(||(7}|s|||f|dd})|dk r|f|)S|)St|||||j|j|j|j|jd S)ay labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`): Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]`` obj_labels: (``Dict[Str: Tuple[Torch.FloatTensor, Torch.FloatTensor]]``, `optional`): each key is named after each one of the visual losses and each element of the tuple is of the shape ``(batch_size, num_features)`` and ``(batch_size, num_features, visual_feature_dim)`` for each the label id and the label score respectively matched_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`): Labels for computing the whether or not the text input matches the image (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring) Indices should be in ``[0, 1]``: - 0 indicates that the sentence does not match the image, - 1 indicates that the sentence does match the image. ans: (``Torch.Tensor`` of shape ``(batch_size)``, `optional`): a one hot representation hof the correct answer `optional` Returns: masked_lm_labelszlThe `masked_lm_labels` argument is deprecated and will be removed in a future version, use `labels` instead.N rrrrrrrrlrrprrKrJrPrkr~rryr3rfr) r3r6r7r4r)r*r+r,r-) warningswarn FutureWarningpoprlrmrrOrrrr/tensorrrrrrrrr?itemsrrrQrr5r)r*r+r,r-)*rrrrrrrrlabels obj_labels matched_labelansrrlrpkwargsr lxmert_outputrrr(lang_prediction_scoresr7 answer_score total_lossmasked_lm_loss matched_losstotal_visual_lossvisual_prediction_scores_dictrkey_infolabel mask_conf output_dim loss_fct_name label_shaperCvisual_loss_fctvisual_prediction_scores visual_loss answer_lossrrrrrs)             zLxmertForPreTraining.forward)F)NNNNNNNNNNNNNN)r!r"r#rrrrModulerrrr rvrWrr5rxrr$rrrrrys.6   ryzHLxmert Model with a visual-answering head on top for downstream QA tasksc steZdZfddZddZddZejddd Zd d Z d d Z e e deedeeddddZZS)LxmertForQuestionAnsweringcsNt|||_|j|_|j|_t||_t||j|_| t |_ dSr) rrrlrrr^rOr5rrarr3rrrrr(s  z#LxmertForQuestionAnswering.__init__cCs8|}|dks|dkrdS||}||j_||_|Srrrrrrr:s z/LxmertForQuestionAnswering.resize_num_qa_labelscCs&|}|||}|||SrrrrrrrSs  z,LxmertForQuestionAnswering._resize_qa_labelsrcCst|dr|jjdSdS)a, Returns the the linear layer that produces question answering logits Returns: :obj:`nn.Module`: A torch module mapping the question answering prediction hidden states. :obj:`None`: A NoneType object if Lxmert does not have the visual answering head. rrNrrrrrrYs z-LxmertForQuestionAnswering.get_qa_logit_layercCs||jjd<dSrrrrrrresz.LxmertForQuestionAnswering._set_qa_logit_layercCs|dkr |S|j\}}||kr&|St|dddk rDt||}ntj||dd}||jj||t||}|jj d|ddf|jj d|ddf<t|dddk r|j j d||j j d|<|Srrrrrrrhs  ,z1LxmertForQuestionAnswering._get_resized_qa_labelsrfrrgNc  Cs| dk r | n|jj} |j|||||||| | | d } | d} || }d}|dk rl||d|j|d}| s|f| dd}|dk r|f|S|St||| j| j | j | j | j dS)z labels: (``Torch.Tensor`` of shape ``(batch_size)``, `optional`): A one-hot representation of the correct answer Returns: NrrJrr)r3r4r)r*r+r,r-) rlrmrOrr3rrr2r)r*r+r,r-)rrrrrrrrrrrlrprr(rr3rrrrrs<  z"LxmertForQuestionAnswering.forward) NNNNNNNNNNN)r!r"r#rrrrrrrrr rvrWr rwr2rxrr$rrrrr#s2   r)@r.rrSr dataclassesrtypingrrr/rtorch.nnrrZtransformers.activationsrr transformers.file_utilsr r r r rZtransformers.modeling_utilsrZtransformers.utilsrZ!transformers.configuration_lxmertr get_loggerr!rQrxrw$LXMERT_PRETRAINED_MODEL_ARCHIVE_LISTrrr%r2r5r{r|rrrrrrrrrrr(r,r.r5r:rHrNLXMERT_START_DOCSTRINGrvr^ryrrrrrs      -'/O(_ \R ;