cdXfpdZddlZddlZddlZddlmZmZmZmZddl Z ddl Z ddl Z ddl Z ddlmZddlmcmZddl mZddlmZmZddlmZddlmZddlmZmZdd lmZm Z m!Z!dd l"m#Z#dd l$m%Z%m&Z&Gd d ej'Z(Gddej'Z)Gdde%Z*Gdde&Z+dS)z: Donut Copyright (c) 2022-present NAVER Corp. MIT License N)AnyListOptionalUnion)ImageOps)IMAGENET_DEFAULT_MEANIMAGENET_DEFAULT_STD)SwinTransformer) transforms)resizerotate) MBartConfigMBartForCausalLMXLMRobertaTokenizer) ModelOutput)PretrainedConfigPreTrainedModelceZdZdZ ddeedededeedeee e j ff fd Z d e jd e jfd Zdd ejjded e jfdZxZS) SwinEncodera Donut encoder based on SwinTransformer Set the initial weights and configuration with a pretrained SwinTransformer and then modify the detailed configurations as a Donut Encoder Args: input_size: Input image size (width, height) align_long_axis: Whether to rotate image if height is greater than width window_size: Window size(=patch size) of SwinTransformer encoder_layer: Number of layers of SwinTransformer encoder name_or_path: Name of a pretrained model name either registered in huggingface.co. or saved in local. otherwise, `swin_base_patch4_window12_384` will be set (using `timm`). N input_sizealign_long_axis window_size encoder_layer name_or_pathc t||_||_||_||_t jt jt j ttg|_ t|j|j|jddgdd|_|stjdd}|j}|D]t}|d s|d r.|d r%|jjdjdjjdd kr||dd} t/t1jt5| } t/d |zdz } | d| | ddddd } t;j| | | fdd} | dd ddd| d zdd||<i||||<v|j |dSdS)N)r r)img_sizedepthsr patch_size embed_dim num_heads num_classesswin_base_patch4_window12_384T) pretrainedrelative_position_index attn_maskrelative_position_bias_table bicubicFsizemode align_corners)!super__init__rrrrr ComposeToTensor Normalizerr to_tensorr modeltimm create_model state_dictendswithlayersblocksattn unsqueezeintmathsqrtlenreshapepermuteF interpolatesqueezeload_state_dict) selfrrrrrswin_state_dictnew_swin_state_dictxpos_biasold_lennew_len __class__s ND:\Documents\dsti\notes\25-deep-learning\project\mrz-extraction\donut\model.pyr7zSwinEncoder.__init__*s $.&*#+#%%$%:> @ )!,3A6;GJbPP.q1;;A>>qAH!$)CMM":":;;G!!k/A"566G'//7GRHHPPQRTUWXZ[\\H }XWg???oa(394 q)CJ6  + ))akAo)FFI**q|a7G*HHJJ#q(I%*J   ) # : %   ~~hoc7;;<<zembed_positions.weightir-zembed_tokens.weightzlm_head.weight)r6r7rrrfrom_pretrained tokenizerrrrHr<r]rradd_special_tokens pad_token_iddecoder embed_tokens padding_idxprepare_inputs_for_inferenceprepare_inputs_for_generationr?r@rnn Parameterresize_bart_abs_pos_embrN)rOrrrbart_state_dictnew_bart_state_dictrRrVs rWr7zBARTDecoder.__init__sI *'>$,<1= O ) )<  &#($(#1(,(Dt~.. $%)        "\ /3 ,  +++<@N?\]]hhjjO"&*"7"7"9"9 ( @ @::677 @D/B/B C C C C C  rX input_idsencoder_outputs use_cacheattention_maskc||jj}||ddddf}|||||jd}|S)a Args: input_ids: (batch_size, sequence_lenth) Returns: input_ids: (batch_size, sequence_length) attention_mask: (batch_size, sequence_length) encoder_hidden_states: (batch_size, sequence_length, embedding_dim) Nr/)rrpast_key_valuesrencoder_hidden_states)nerrlonglast_hidden_state)rOrrpastrroutputs rWrz(BARTDecoder.prepare_inputs_for_inferencesg#dn&ABBGGII  !!!!RSS&)I",#"%4%F    rXrrlabelsoutput_attentionsoutput_hidden_states return_dictc J||n|jjj}||n|jjj}| | n|jjj} |jj|||||||| } |j| d} d} |Wtjd} | | d|jjj | d} | s| f| ddz}| | f|zn|St| | | j | j | j| jS) a A forward fucntion to get cross attentions and utilize `generate` function Source: https://github.com/huggingface/transformers/blob/v4.11.3/src/transformers/models/mbart/modeling_mbart.py#L1669-L1810 Args: input_ids: (batch_size, sequence_length) attention_mask: (batch_size, sequence_length) encoder_hidden_states: (batch_size, sequence_length, hidden_size) Returns: loss: (1, ) logits: (batch_size, sequence_length, hidden_dim) hidden_states: (batch_size, sequence_length, hidden_size) decoder_attentions: (batch_size, num_heads, sequence_length, sequence_length) cross_attentions: (batch_size, num_heads, sequence_length, sequence_length) N)rrrrrrrrri) ignore_indexr/r.)losslogitsr hidden_statesdecoder_attentionscross_attentions)r<rrruse_return_dictrlm_headrCrossEntropyLossviewrrrr attentionsr)rOrrrrrrrrroutputsrrloss_fctrs rWr]zBARTDecoder.forwards^<2C1N--TXT^TeTw$8$D $*J[Jp &1%/