U e @sdZddlZddlZddlZddlZddlmZddlmZm Z ddl Z ddl m Z ddl m Z mZddlTddlmZmZmZmZmZdd lmZdd lmZdd lmZeeZd Zd Z dgZ!e"e#e$dZ%eGdddeZ&eGdddeZ'eGdddeZ(ddZ)Gddde j*Z+Gddde j*Z,Gddde j*Z-Gddde j*Z.Gd d!d!e j*Z/Gd"d#d#e j*Z0Gd$d%d%e j*Z1Gd&d'd'e j*Z2Gd(d)d)e j*Z3Gd*d+d+e j*Z4Gd,d-d-e j*Z5Gd.d/d/e j*Z6Gd0d1d1e j*Z7Gd2d3d3e j*Z8Gd4d5d5e j*Z9Gd6d7d7e j*Z:Gd8d9d9e j*Z;Gd:d;d;eZed>e=Gd?d@d@e<Z?edAe=GdBdCdCe<Z@edDe=GdEdFdFe<ZAdS)Gz PyTorch lxmert model. N) dataclass)OptionalTuple)nn)CrossEntropyLoss SmoothL1Loss)*) ModelOutputadd_code_sample_docstringsadd_start_docstrings%add_start_docstrings_to_model_forwardreplace_return_docstrings)PreTrainedModel)logging) LxmertConfigrLxmertTokenizerunc-nlp/lxmert-base-uncased)relutanhgeluc@seZdZUdZdZeejed<dZ eejed<dZ eejed<dZ ee ejed<dZ ee ejed<dZee ejed<dZee ejed <dZee ejed <dS) LxmertModelOutputa Lxmert's outputs that contain the last hidden states, pooled outputs, and attention probabilities for the language, visual, and, cross-modality encoders. (note: the visual encoder in Lxmert is referred to as the "relation-ship" encoder") Args: language_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): Sequence of hidden-states at the output of the last layer of the language encoder. vision_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): Sequence of hidden-states at the output of the last layer of the visual encoder. pooled_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, hidden_size)`): Last layer hidden-state of the first token of the sequence (classification, CLS, token) further processed by a Linear layer and a Tanh activation function. The Linear language_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. vision_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. language_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. vision_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. cross_encoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Nlanguage_output vision_output pooled_outputlanguage_hidden_statesvision_hidden_stateslanguage_attentionsvision_attentionscross_encoder_attentions)__name__ __module__ __qualname____doc__rrtorch FloatTensor__annotations__rrrrrrrrr&r&;D:\Download\graduate-design\lxmert\lxmert\src\lxmert_lrp.pyr7s "rc@seZdZUdZdZeejed<dZ eejed<dZ ee ejed<dZ ee ejed<dZ ee ejed<dZee ejed<dZee ejed <dS) LxmertForQuestionAnsweringOutputa? Output type of :class:`~transformers.LxmertForQuestionAnswering`. Args: loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`): Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.k. question_answering_score: (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, n_qa_answers)`, `optional`): Prediction scores of question answering objective (classification). language_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. vision_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. language_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. vision_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. cross_encoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Nlossquestion_answering_scorerrrrr)rr r!r"r)rr#r$r%r*rrrrrrr&r&r&r'r(es r(c@seZdZUdZdZejged<dZe ejed<dZ e ejed<dZ e ejed<dZ e e ejed<dZe e ejed<dZe e ejed <dZe e ejed <dZe e ejed <dS) LxmertForPreTrainingOutputa Output type of :class:`~transformers.LxmertForPreTraining`. Args: loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`): Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss. prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`): Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). cross_relationship_score: (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`): Prediction scores of the textual matching objective (classification) head (scores of True/False continuation before SoftMax). question_answering_score: (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, n_qa_answers)`): Prediction scores of question answering objective (classification). language_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. vision_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`. language_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. vision_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. cross_encoder_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Nr)prediction_logitscross_relationship_scorer*rrrrr)rr r!r"r)r#r$r%r,rr-r*rrrrrrr&r&r&r'r+s #r+c Cszddl}ddl}ddl}Wn tk r<tdYnXtj|}t d ||j |}g}g} |D]<\} } t d | | |j || } || | | qrt|| D]\} } | d} tdd| Drt d d| q|} | D]}|d |r"|d |}n|g}|dd ksD|dd krPt| d} n|ddksl|ddkrxt| d} nz|ddkrt| d} n`|ddkrt| d} nFzt| |d} Wn2tk rt d d| YqYnXt|dkrt|d}| |} q|dddkr6t| d} n|d krJ|| } z| j| jks^tWn<tk r}z|j| j| jf7_W5d}~XYnXt d | t| | _q|S)z'Load tf checkpoints in a pytorch model.rNzLoading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see https://www.tensorflow.org/install/ for installation instructions.z(Converting TensorFlow checkpoint from {}z"Loading TF weight {} with shape {}/css|]}|dkVqdS))adam_vadam_mAdamWeightDecayOptimizerAdamWeightDecayOptimizer_1 global_stepNr&).0nr&r&r' s z,load_tf_weights_in_lxmert..z Skipping {}z [A-Za-z]+_\d+z_(\d+)kernelgammaweight output_biasbetabiasoutput_weightssquad classifieri _embeddingszInitialize PyTorch weight {})renumpy tensorflow ImportErrorloggererrorospathabspathinfoformattrainlist_variables load_variableappendzipsplitanyjoin fullmatchgetattrAttributeErrorlenint transposeshapeAssertionErrorargsr# from_numpydata)modelconfigtf_checkpoint_pathrCnptftf_path init_varsnamesarraysnamer\arraypointerm_name scope_namesnumer&r&r'load_tf_weights_in_lxmertsv                    rqcs2eZdZdZfddZd ddZddZZS) LxmertEmbeddingszGConstruct the embeddings from word, position and token_type embeddings.cs|ttj|j|jdd|_tj|j|jdd|_tj|j |jdd|_ t |jdd|_ t |j |_t|_t|_dS)Nr) padding_idx-q=eps)super__init__r Embedding vocab_size hidden_sizeword_embeddingsmax_position_embeddingsposition_embeddingstype_vocab_sizetoken_type_embeddings LayerNormDropouthidden_dropout_probdropoutAddadd1add2selfrb __class__r&r'rxs  zLxmertEmbeddings.__init__Nc Cs|dk r|}|j}n|dd}|j}|d}tj|tj|d}|d|}|dkrvtj|tj|jjd}|dkr| |}| |}| |} | | |g} | | |g} || } || } | S)NrAdtypedevicer)sizerr#arangelong unsqueezeexpandzeros position_idsr|r~rrrrr) r input_idstoken_type_ids inputs_embeds input_shaper seq_lengthrr~r embeddingsr&r&r'forwards&     zLxmertEmbeddings.forwardcKs4|jj|f|}|jj|f|}|jj|f|}|SN)rrelproprrrcamkwargsr&r&r'r8szLxmertEmbeddings.relprop)NN)rr r!r"rxrr __classcell__r&r&rr'rr s  rrcspeZdZdfdd ZddZddZdd Zd d Zd d ZddZ ddZ ddZ dddZ ddZ ZS)LxmertAttentionNcst|j|jdkr.td|j|jf|j|_t|j|j|_|j|j|_|dkrd|j}t|j|j|_ t||j|_ t||j|_ t |j |_t|_t|_tdd|_t|_t|_d|_d|_t|_d|_d|_d|_dS)NrzLThe hidden size (%d) is not a multiple of the number of attention heads (%d)r)dim)rwrxr{num_attention_heads ValueErrorrZattention_head_size head_sizeLinearquerykeyvaluerattention_probs_dropout_probrMatMulmatmul1matmul2SoftmaxsoftmaxraddMulmul head_maskattention_maskClonecloneattnattn_gradientsattn_cam)rrbctx_dimrr&r'rxCs6    zLxmertAttention.__init__cCs|jSrrrr&r&r'get_attndszLxmertAttention.get_attncCs ||_dSrr)rrr&r&r' save_attngszLxmertAttention.save_attncCs|jSrrrr&r&r' get_attn_camjszLxmertAttention.get_attn_camcCs ||_dSrr)rrr&r&r' save_attn_cammszLxmertAttention.save_attn_camcCs ||_dSrr)rrr&r&r'save_attn_gradientspsz#LxmertAttention.save_attn_gradientscCs|jSrrrr&r&r'get_attn_gradientsssz"LxmertAttention.get_attn_gradientscCs6|dd|j|jf}|j|}|ddddS)Nrrr@rA)rrrviewpermute)rx new_x_shaper&r&r'transpose_for_scoresvs  z$LxmertAttention.transpose_for_scorescCs|dddddS)Nrr@rAr)rflatten)rrr&r&r'transpose_for_scores_relprop~sz,LxmertAttention.transpose_for_scores_relpropFcCs||d\}}||}||}||} ||} ||} || } || | ddg} | t|j } |dk r| | |g} | | }| || |j||}||| g}|dddd}|dd|jf}|j|}|r||fn|f}|S)Nr@rrrAr)rrrrrrr[mathsqrtrrrr register_hookrrrr contiguousrrr)r hidden_statescontextroutput_attentionsrrmixed_query_layermixed_key_layermixed_value_layer query_layer key_layer value_layerattention_scoresattention_probs context_layernew_context_layer_shapeoutputsr&r&r'rs*           zLxmertAttention.forwardcKs ||}|jj|f|\}}|d}|d}|||jj|f|}|jj|f|}|jdk rv|jj|f|\}}|jj|f|\}}|d}|d}| |}|j j|f|}| | dd}|j j|f|}| |}|j j|f|}|jj||ff|}||fS)Nr@rr)rrrrrrrrrrrr[rrr)rrrcam1cam2_Zcam1_1Zcam1_2r&r&r'rs(     zLxmertAttention.relprop)N)NF)rr r!rxrrrrrrrrrrrr&r&rr'rBs! %rcs,eZdZfddZddZddZZS)LxmertAttentionOutputcsBtt|j|j|_t|jdd|_t|j|_t |_ dSNrtru) rwrxrr{denserrrrrrrrr&r'rxs   zLxmertAttentionOutput.__init__cCs0||}||}|||g}||}|Srrrrrrr input_tensorrr&r&r'rs    zLxmertAttentionOutput.forwardcKsL|jj|f|}|jj|f|\}}|jj|f|}|jj|f|}||fSrrrrrrrrrrrr&r&r'rs zLxmertAttentionOutput.relproprr r!rxrrrr&r&rr'rs rcs.eZdZfddZd ddZddZZS) LxmertCrossAttentionLayercs*tt||_t||_t|_dSr)rwrxrattroutputrrrrr&r'rxs   z"LxmertCrossAttentionLayer.__init__NFc CsT||d\}}|j||||d}|r.|d}||d|} |rJ| |fn| f} | S)Nr@rrAr)rrr) rr ctx_tensor ctx_att_maskrinp1inp2rrattention_outputrr&r&r'rsz!LxmertCrossAttentionLayer.forwardcKsD|jj|f|\}}|jj|f|\}}|jj||ff|}||fSr)rrrr)rrr cam_outputcam_inp2cam_inp1Zcam_ctxcam_inpr&r&r'rsz!LxmertCrossAttentionLayer.relprop)NFrr&r&rr'rs  rcs.eZdZfddZdddZddZZS) LxmertSelfAttentionLayercs*tt||_t||_t|_dSr)rwrxrrrrrrrrr&r'rxs   z!LxmertSelfAttentionLayer.__init__Fc CsV||d\}}}|j||||d}|r0|d}||d|} |rL| |fn| f} | S)NrrrAr)rrr) rrrrrrZinp3rrrrr&r&r'rsz LxmertSelfAttentionLayer.forwardcKsB|jj|f|\}}|jj|f|\}}|jj|||ff|}|Sr)rrrr)rrrrZcam_inp3rrrr&r&r'rsz LxmertSelfAttentionLayer.relprop)Frr&r&rr'rs  rcs,eZdZfddZddZddZZS)LxmertIntermediatecs,tt|j|j|_t|j|_dSr) rwrxrr{intermediate_sizerACT2FN hidden_actintermediate_act_fnrrr&r'rx s zLxmertIntermediate.__init__cCs||}||}|Sr)rrrrr&r&r'r%s  zLxmertIntermediate.forwardcKs$|jj|f|}|jj|f|}|Sr)rrrrr&r&r'r*szLxmertIntermediate.relproprr&r&rr'rs rcs,eZdZfddZddZddZZS) LxmertOutputcsBtt|j|j|_t|jdd|_t|j|_ t |_ dSr) rwrxrrr{rrrrrrrrrr&r'rx1s   zLxmertOutput.__init__cCs0||}||}|||g}||}|Srrrr&r&r'r8s    zLxmertOutput.forwardcKsL|jj|f|}|jj|f|\}}|jj|f|}|jj|f|}||fSrrrr&r&r'r?s zLxmertOutput.relproprr&r&rr'r0s rcs.eZdZfddZd ddZddZZS) LxmertLayercs4tt||_t||_t||_t|_ dSr) rwrxr attentionr intermediaterrrrrrr&r'rxIs     zLxmertLayer.__init__NFc CsT|j|||d}|d}||d\}}||}|||} | f|dd}|S)Nrrr@rA)rrrr) rrrrrrZao1Zao2intermediate_output layer_outputr&r&r'rPs  zLxmertLayer.forwardcKsL|jj|f|\}}|jj|f|}|jj||ff|}|jj|f|}|Sr)rrrrrrr&r&r'rYs zLxmertLayer.relprop)NFrr&r&rr'rHs  rcs`eZdZfddZdddZddZdd Zd d Zd d ZddZ dddZ ddZ Z S) LxmertXLayercsttt||_t||_t||_t||_t ||_ t||_ t ||_ t |_t |_t |_t |_dSr)rwrxrvisual_attentionr lang_self_att visn_self_attr lang_interr lang_output visn_inter visn_outputrclone1clone2clone3clone4rrr&r'rxbs        zLxmertXLayer.__init__Fc Csd||d\}}||d\}} t|ds8t|j|_|j||||d} |j| ||dd} | | fS)Nr@visual_attention_copy)rrF)rrhasattrcopydeepcopyr r) r lang_inputlang_attention_mask visual_inputvisual_attention_maskoutput_x_attentions lang_input1 lang_input2 visual_input1 visual_input2lang_att_outputvisual_att_outputr&r&r' cross_attvs" zLxmertXLayer.cross_attc Ks`|\}}|jj|f|\}}|jj|f|\}}|jj||ff|}|jj||ff|}||fSr)rrr rr) rrrcam_langcam_viscam_vis2 cam_lang2 cam_lang1cam_vis1r&r&r' relprop_crosss zLxmertXLayer.relprop_crosscCs0|j||dd}|j||dd}|d|dfS)NFrr)r r )rrrrrr#r$r&r&r'self_attszLxmertXLayer.self_attcKs0|\}}|jj|f|}|jj|f|}||fSr)r rr rrrr&r'r&r&r' relprop_selfszLxmertXLayer.relprop_selfc CsT||d\}}||d\}}||}||}|||} |||} | | fSNr@)rrrrrr) rrrrr r!r"lang_inter_outputvisual_inter_outputr visual_outputr&r&r' output_fcs    zLxmertXLayer.output_fcc Ks|\}}|jj|f|\}}|jj|f|\}}|jj|f|} |jj|f|} |jj| |ff|}|jj| |ff|}||fSr)rrrrrrr) rrrr&r'Z cam_vis_interr(Zcam_lang_interr)r+r*r&r&r'relprop_outputszLxmertXLayer.relprop_outputc Csj|j|||||d\}}|dd}||d||d|\}}|||\} } |rb| | |dfS| | fS)N)rrrrrrAr)r%r-r4) r lang_featsr visual_featsrrr#r$rrr3r&r&r'rs.  zLxmertXLayer.forwardcKsR|\}}|j||ff|\}}|j||ff|\}}|j||ff|\}}||fSr)r5r/r,r.r&r&r'rs zLxmertXLayer.relprop)F)F) rr r!rxr%r,r-r/r4r5rrrr&r&rr'r as     "r cs,eZdZfddZddZddZZS)LxmertVisualFeatureEncodercsbt|j}|j}t||j|_t|jdd|_t||j|_ t|jdd|_ t |j |_ dSr)rwrxvisual_feat_dimvisual_pos_dimrr{visn_fcrvisn_layer_normbox_fcbox_layer_normrrr)rrbfeat_dimpos_dimrr&r'rxs z#LxmertVisualFeatureEncoder.__init__cCsB||}||}||}||}||d}||}|Sr0)r;r<r=r>r)rr7 visual_posryrr&r&r'rs      z"LxmertVisualFeatureEncoder.forwardcKs4|jj|f|}|jj|f|}|jj|f|}|Sr)rrr<r;rr&r&r'rsz"LxmertVisualFeatureEncoder.relproprr&r&rr'r8s  r8cs.eZdZfddZdddZddZZS) LxmertEncodercstt|_|_j|_j|_j |_ t fddt |jD|_t fddt |jD|_t fddt |j D|_ dS)Ncsg|] }tqSr&rr4rrbr&r' sz*LxmertEncoder.__init__..csg|] }tqSr&)r rErFr&r'rGscsg|] }tqSr&rDrErFr&r'rGs)rwrxr8r;rbl_layers num_l_layersx_layers num_x_layersr_layers num_r_layersr ModuleListrangelayerrrrFr'rxs    zLxmertEncoder.__init__NcCsdd}d}|s|jjrdnd} |s(|jjr,dnd} |s<|jjr@dnd} |||}|jD]:} | |||d} | d}||f}| dk rV| | df} qV|jD]:} | |||d}|d}||f}| dk r| |df} q|jD]P} | |||||d}|dd\}}||f}||f}| dk r| |df} q||r8| ndf}||rJ| ndf}|||r^| ndfS)Nr&rrrAr@)rbrr;rPrLrJ)rr6rr7rArrrrrrr layer_module l_outputs v_outputs x_outputsvisual_encoder_outputslang_encoder_outputsr&r&r'rsR            zLxmertEncoder.forwardcKsr|\}}t|jD]}|j||ff|\}}qt|jD]}|j|f|}q8t|jD]}|j|f|}qV||fSr)reversedrJrrLrP)rrrr&r'rQr&r&r'rWszLxmertEncoder.relprop)NNrr&r&rr'rCs   >rCcs,eZdZfddZddZddZZS) LxmertPoolercs2tt|t|j|j|_t|_t|_ dSr) rwrXrxrr{rTanh activationZ IndexSelectpoolrrr&r'rxeszLxmertPooler.__init__cCs<||dtjd|jd}|d}||}||}|S)NrArr)r[r#tensorrsqueezerrZ)rrfirst_token_tensorrr&r&r'rls    zLxmertPooler.forwardcKs>|jj|f|}|jj|f|}|d}|jj|f|}|S)NrA)rZrrrr[rr&r&r'rvs  zLxmertPooler.relproprr&r&rr'rXds  rXcs,eZdZfddZddZddZZS)LxmertPredictionHeadTransformcs>tt|t|j|j|_t|j|_t |jdd|_ dSr) rwr`rxrr{rrrtransform_act_fnrrrr&r'rxs z&LxmertPredictionHeadTransform.__init__cCs"||}||}||}|Sr)rrarrr&r&r'rs   z%LxmertPredictionHeadTransform.forwardcKs4|jj|f|}|jj|f|}|jj|f|}|Sr)rrrarrr&r&r'rsz%LxmertPredictionHeadTransform.relproprr&r&rr'r`s r`cs,eZdZfddZddZddZZS)LxmertLMPredictionHeadcsXtt|t||_t|d|ddd|_||j_t t |d|_ dS)NrArFr<)rwrbrxr` transformrrdecoderr9r Parameterr#rr<rrblxmert_model_embedding_weightsrr&r'rxs zLxmertLMPredictionHead.__init__cCs||}|||j}|Sr)rdrer<rr&r&r'rs zLxmertLMPredictionHead.forwardcKs$|jj|f|}|jj|f|}|Sr)rerrdrr&r&r'rszLxmertLMPredictionHead.relproprr&r&rr'rbs rbcs,eZdZfddZddZddZZS)LxmertVisualAnswerHeadcsHt|j}tt||dtt|dddt|d||_dS)Nr@rtru) rwrxr{r SequentialrGELUrlogit_fc)rrb num_labelshid_dimrr&r'rxs   zLxmertVisualAnswerHead.__init__cCs ||Sr)rlrr&r&r'rszLxmertVisualAnswerHead.forwardcKs(t|jjD]}|j|f|}q|Sr)rWrl_modulesvaluesr)rrrmr&r&r'rszLxmertVisualAnswerHead.relproprr&r&rr'ris rics,eZdZfddZddZddZZS)LxmertVisualObjHeadcstt_i}jr.djd|d<jrDdjd|d<jr`djfjd|d<|_ t fddj D_ dS) Nr)r\roobjattrrfeatcs&i|]}|tjj|dqS)ro)rrr{ visual_losses)r4rrbrr&r' sz0LxmertVisualObjHead.__init__..) rwrxr`rdvisual_obj_lossnum_object_labelsvisual_attr_lossnum_attr_labelsr9rwr ModuleDict decoder_dictrrbrwrrxr'rxs   zLxmertVisualObjHead.__init__cCs0||}i}|jD]}|j||||<q|Sr)rdrwr)rrrrr&r&r'rs   zLxmertVisualObjHead.forwardcKs|jj|f|Sr)rdrrr&r&r'rszLxmertVisualObjHead.relproprr&r&rr'rrs rrcs,eZdZfddZddZddZZS)LxmertPreTrainingHeadscs,tt|t|||_t|jd|_dSr0)rwrrxrb predictionsrr{seq_relationshiprgrr&r'rxs zLxmertPreTrainingHeads.__init__cCs||}||}||fSr)rr)rsequence_outputrprediction_scoresseq_relationship_scorer&r&r'rs  zLxmertPreTrainingHeads.forwardcKs0|\}}|jj|f|}|jj|f|}||fSr)rrr)rrrZcam_seqZ cam_pooledr&r&r'rszLxmertPreTrainingHeads.relproprr&r&rr'rs rc@s$eZdZdZeZeZdZddZ dS)LxmertPreTrainedModelz An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models. lxmertcCsvt|tjtjfr*|jjjd|jjdn&t|tj rP|j j |jj dt|tjrr|j dk rr|j j dS)z Initialize the weights )meanstd?N) isinstancerrryr9r`normal_rbinitializer_rangerr<zero_fill_)rmoduler&r&r' _init_weightss  z#LxmertPreTrainedModel._init_weightsN) rr r!r"r config_classrqload_tf_weightsbase_model_prefixrr&r&r&r'rs ra The lxmert model was proposed in `lxmert: Learning Cross-Modality Encoder Representations from Transformers `__ by Hao Tan and Mohit Bansal. It's a vision and language transformer model, pretrained on a variety of multi-modal datasets comprising of GQA, VQAv2.0, MCSCOCO captions, and Visual genome, using a combination of masked language modeling, region of interest feature regression, cross entropy loss for question answering attribute prediction, and object tag prediction. This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch `torch.nn.Module `__ subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. Parameters: config (:class:`~transformers.LxmertConfig`): Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights. u Args: input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`): Indices of input sequence tokens in the vocabulary. Indices can be obtained using :class:`~transformers.LxmertTokenizer`. See :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for details. `What are input IDs? <../glossary.html#input-ids>`__ visual_feats: (:obj:`torch.FloatTensor` of shape :obj:՝(batch_size, num_visual_features, visual_feat_dim)՝): This input represents visual features. They ROI pooled object features from bounding boxes using a faster-RCNN model) These are currently not provided by the transformers library. visual_pos: (:obj:`torch.FloatTensor` of shape :obj:՝(batch_size, num_visual_features, visual_pos_dim)՝): This input represents spacial features corresponding to their relative (via index) visual features. The pre-trained lxmert model expects these spacial features to be normalized bounding boxes on a scale of 0 to 1. These are currently not provided by the transformers library. attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`): Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. `What are attention masks? <../glossary.html#attention-mask>`__ visual_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`): Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. `What are attention masks? <../glossary.html#attention-mask>`__ token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`): Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0, 1]``: - 0 corresponds to a `sentence A` token, - 1 corresponds to a `sentence B` token. `What are token type IDs? <../glossary.html#token-type-ids>`__ inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`): Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert :obj:`input_ids` indices into associated vectors than the model's internal embedding lookup matrix. output_attentions (:obj:`bool`, `optional`): Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned tensors for more detail. output_hidden_states (:obj:`bool`, `optional`): Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for more detail. return_dict (:obj:`bool`, `optional`): Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple. z`The bare Lxmert Model transformer outputting raw hidden-states without any specific head on top.c s\eZdZfddZddZddZeede e de e d dd d Z d dZZS) LxmertModelcs6t|t||_t||_t||_|dSr) rwrxrrrrCencoderrXpooler init_weightsrrr&r'rxcs     zLxmertModel.__init__cCs|jjSrrr|rr&r&r'get_input_embeddingsjsz LxmertModel.get_input_embeddingscCs ||j_dSrr)rnew_embeddingsr&r&r'set_input_embeddingsmsz LxmertModel.set_input_embeddingsbatch_size, sequence_lengthrtokenizer_class checkpoint output_typerNc  CsP|dk r |n|jj}| dk r | n|jj} | dk r4| n|jj} |dk rV|dk rVtdn4|dk rh|} n"|dk r|dd} ntd|dk std|dk std|dk r|jn|j} |dkrtj | | d}|dkrtj | tj | d}| d d } | j |jd } d | d } |dk rN| d d }|j |jd }d |d }nd}||||}|j|| ||||d }|dd \}}|d}|d}d}|r|d}|d}|d }|||f}| r||fnd}|d}|d}||}| s|||f||St|||| r|nd| r$|nd|r0|nd|r<|nd|rH|nddS)NzDYou cannot specify both input_ids and inputs_embeds at the same timerz5You have to specify either input_ids or inputs_embedsz`visual_feats` cannot be `None`z`visual_pos` cannot be `None`r\rrAr@)rrg)r7rArrrr&)rrrrrrrr)rbroutput_hidden_statesuse_return_dictrrr]rr#onesrrrtorrrrr)rrr7rArrrrrr return_dictrrextended_attention_maskextended_visual_attention_maskembedding_outputencoder_outputsrUrVrrall_attentionsrrrrr3rrr&r&r'rps~           zLxmertModel.forwardcKs8|\}}|jj|f|}|jj||ff|\}}||fSr)rrrr.r&r&r'rszLxmertModel.relprop) NNNNNNNNNN)rr r!rxrrr LXMERT_INPUTS_DOCSTRINGrMr _TOKENIZER_FOR_DOCr_CONFIG_FOR_DOCrrrr&r&rr'r^s,  nrz7Lxmert Model with a specified pretraining head on top. cspeZdZfddZddZddZejddd Zd d Z d d Z e e deeeddddZZS)LxmertForPreTrainingcst|||_|j|_|j|_|j|_|j|_|j|_|j|_t ||_ t ||j j j j|_|jrpt||_|jrt||j|_|tddtddtd|_i}|jrd|jdd|d<|jrd|jdd|d<|jrd |jf|jd d|d <||_dS) Nnone) reduction)l2 visual_cecersr)r\ror)rtrurrrv)rwrxrb num_qa_labelsvisual_loss_normalizer task_mask_lmtask_obj_predict task_matchedtask_qarrrrr|r9clsrrobj_predict_headri answer_headrrr loss_fctsrzr{r|r}r9rwrrr&r'rxsH      zLxmertForPreTraining.__init__cCs8|}|dks|dkrdS||}||j_||_|Sa  Build a resized question answering linear layer Module from a provided new linear layer. Increasing the size will add newly initialized weights. Reducing the size will remove weights from the end Args: num_labels (:obj:`int`, `optional`): New number of labels in the linear layer weight matrix. Increasing the size will add newly initialized weights at the end. Reducing the size will remove weights from the end. If not provided or :obj:`None`, just returns a pointer to the qa labels :obj:`torch.nn.Linear`` module of the model without doing anything. Return: :obj:`torch.nn.Linear`: Pointer to the resized Linear layer or the old Linear layer Nget_qa_logit_layer_resize_qa_labelsrbrrrmcur_qa_logit_layernew_qa_logit_layerr&r&r'resize_num_qa_labels(s z)LxmertForPreTraining.resize_num_qa_labelscCs&|}|||}|||Srr_get_resized_qa_labels_set_qa_logit_layerrr&r&r'rAs  z&LxmertForPreTraining._resize_qa_labelsreturncCst|dr|jjdSdS)a Returns the the linear layer that produces question answering logits. Returns: :obj:`nn.Module`: A torch module mapping the question answering prediction hidden states or :obj:`None` if lxmert does not have a visual answering head. rrNrrrlrr&r&r'rGs z'LxmertForPreTraining.get_qa_logit_layercCs||jjd<dSNrrrlrqa_logit_layerr&r&r'rRsz(LxmertForPreTraining._set_qa_logit_layercCs|dkr |S|j\}}||kr&|St|dddk rDt||}ntj||dd}||jj||t||}|jj d|ddf|jj d|ddf<t|dddk r|j j d||j j d|<|SNr<Frc r9rrWrrrrrminr`r<rrrm cur_qa_labels hidden_dimrnum_labels_to_copyr&r&r'rUs  ,z+LxmertForPreTraining._get_resized_qa_labelsr)rrNc* Ksd|krtdt|d}|dk r*|n|jj}|dk r@|jn|j}|j|||||||| | |d }|d|d|d}}}|||\}}|j r| |}n |dd}|dkr| dkr| dkr| dkrdn t j d|d }|dk r|j r|jd |d |jj|d }||7}| dk rT|jrT|jd |d d| d }||7}| dk r(|jr(t j d|jd }||}|jD]\}}| |\}} |d }!|d }"|d}#|j}$|j|"}%||}&|%|&d |!|j|#}'|'dkr|'d}'|'| d |$}'||'7}q||7}| dk rb|j rb|jd |d |j| d }(||(7}|s|||f|dd})|dk r|f|)S|)St|||||j|j|j|j|jd S)ay labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`): Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]`` obj_labels: (``Dict[Str: Tuple[Torch.FloatTensor, Torch.FloatTensor]]``, `optional`): each key is named after each one of the visual losses and each element of the tuple is of the shape ``(batch_size, num_features)`` and ``(batch_size, num_features, visual_feature_dim)`` for each the label id and the label score respectively matched_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`): Labels for computing the whether or not the text input matches the image (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring) Indices should be in ``[0, 1]``: - 0 indicates that the sentence does not match the image, - 1 indicates that the sentence does match the image. ans: (``Torch.Tensor`` of shape ``(batch_size)``, `optional`): a one hot representation hof the correct answer `optional` Returns: masked_lm_labelszlThe `masked_lm_labels` argument is deprecated and will be removed in a future version, use `labels` instead.N rr7rArrrrrrrrrAr@rr\rrror)r\r) r)r,r-r*rrrrr) warningswarn FutureWarningpoprbrrrrrrr#r]rrrrzrrrrwitemsrrrrr+rrrrr)*rrr7rArrrrlabels obj_labels matched_labelansrrrrr lxmert_outputrr3rlang_prediction_scoresr- answer_score total_lossmasked_lm_loss matched_losstotal_visual_lossvisual_prediction_scores_dictrkey_infolabel mask_conf output_dim loss_fct_name label_shaper9visual_loss_fctvisual_prediction_scores visual_loss answer_lossrr&r&r'rqs)             zLxmertForPreTraining.forward)NNNNNNNNNNNNNN)rr r!rxrrrModulerrrr rrMr r+rrrr&r&rr'rs. 6   rzHLxmert Model with a visual-answering head on top for downstream QA tasksc s|eZdZfddZddZddZejddd Zd d Z d d Z e e deedeeddddZddZZS)LxmertForQuestionAnsweringcsNt|||_|j|_|j|_t||_t||j|_| t |_ dSr) rwrxrbrrrrrirrrr)rrr&r'rxs  z#LxmertForQuestionAnswering.__init__cCs8|}|dks|dkrdS||}||j_||_|Srrrr&r&r'rs z/LxmertForQuestionAnswering.resize_num_qa_labelscCs&|}|||}|||Srrrr&r&r'r(s  z,LxmertForQuestionAnswering._resize_qa_labelsrcCst|dr|jjdSdS)a, Returns the the linear layer that produces question answering logits Returns: :obj:`nn.Module`: A torch module mapping the question answering prediction hidden states. :obj:`None`: A NoneType object if Lxmert does not have the visual answering head. rrNrrr&r&r'r.s z-LxmertForQuestionAnswering.get_qa_logit_layercCs||jjd<dSrrrr&r&r'r:sz.LxmertForQuestionAnswering._set_qa_logit_layercCs|dkr |S|j\}}||kr&|St|dddk rDt||}ntj||dd}||jj||t||}|jj d|ddf|jj d|ddf<t|dddk r|j j d||j j d|<|Srrrr&r&r'r=s  ,z1LxmertForQuestionAnswering._get_resized_qa_labelsrrrNc  Cs| dk r | n|jj} |j|||||||| | | d } | d} || }d}|dk rl||d|j|d}| s|f| dd}|dk r|f|S|S| jj|_ t ||| j | j | j | j| jdS)z labels: (``Torch.Tensor`` of shape ``(batch_size)``, `optional`): A one-hot representation of the correct answer Returns: Nrr@rr)r)r*rrrrr)rbrrrr)rrrr\ vis_shaper(rrrrr)rrr7rArrrrrrrrrrrr)rr&r&r'rYs>   z"LxmertForQuestionAnswering.forwardcKsD|jj|f|}t|j|j}|jj||ff|\}}||fSr)rrr#rrrrrr.r&r&r'rsz"LxmertForQuestionAnswering.relprop) NNNNNNNNNNN)rr r!rxrrrrrrrr rrMr rr(rrrrr&r&rr'rs4   9r)Br"rrIrr dataclassesrtypingrrr#rtorch.nnrrZlxmert.lxmert.src.layerstransformers.file_utilsr r r r r Ztransformers.modeling_utilsrZtransformers.utilsrZ!transformers.configuration_lxmertr get_loggerrrGrr$LXMERT_PRETRAINED_MODEL_ARCHIVE_LISTReLUrYrkrrr(r+rqrrrrrrrrrrr r8rCrXr`rbrirrrrLXMERT_START_DOCSTRINGrrrrr&r&r&r's      -'/O6 ^";