o gJ;@s ddlmZddlZddlZddlmZmZmZmZddl Z ddl m Z ddl Z ddl mZddlmZmZmZmZmZedZd"d ed e jd e jfddZGddde jZGddde jZd#d e jdefddZd$ddZd$ddZ d$ddZ!d$ddZ"d d!Z#dS)%)partialN)SequenceTupleUnionCallable) trunc_normal_)Mlp PatchEmbedSwiGLUFFNFusedMemEffAttentionNestedTensorBlockdinov2TFfnmodulereturncCsf|s |r |||d|D]\}}|rd||fn|}t||||ddq|r1|r1|||d|S)Nrname.T)rrr depth_first include_root)named_childrenjoin named_apply)rrrrr child_name child_modulerV/project/osprey/scratch/f.qiao/repos/GenStereo/extern/DAM2/depth_anything_v2/dinov2.pyrs  rc@seZdZddZdS) BlockChunkcCs|D]}||}q|SNr)selfxbrrrforward&s zBlockChunk.forwardN)__name__ __module__ __qualname__r$rrrrr%s rcseZdZdddddddddddd d eejed d d d dffdd ZddZddZ d)ddZ ddZ d)ddZ d*ddZ d*ddZ d+dejd eeefd!ed"ed#eeejeejff d$d%Zd d&d'd(ZZS),DinoVisionTransformer g@TgFNmlprr皙?c s&tttjdd|_|_d|_||_ |_ ||_ ||_ ||_ ||_ ||||d|_|jj}ttdd|_ttd||j|_|dksSJ|r_ttd|nd|_| durl| g|n dd td| |Dd krtd tn"d ksd krtdtndkrtddd}|nt f dd t|D}|dkrd|_g}||}td||D]}|tg|||||qt dd |D|_!n d|_t ||_!|_"t|_#ttd|_$|%dS)a Args: img_size (int, tuple): input image size patch_size (int, tuple): patch size in_chans (int): number of input channels embed_dim (int): embedding dimension depth (int): depth of transformer num_heads (int): number of attention heads mlp_ratio (int): ratio of mlp hidden dim to embedding dim qkv_bias (bool): enable bias for qkv if True proj_bias (bool): enable bias for proj in attn if True ffn_bias (bool): enable bias for ffn if True drop_path_rate (float): stochastic depth rate drop_path_uniform (bool): apply uniform drop rate across blocks weight_init (str): weight init scheme init_values (float): layer-scale init values embed_layer (nn.Module): patch embedding layer act_layer (nn.Module): MLP activation layer block_fn (nn.Module): transformer block class ffn_layer (str): "mlp", "swiglu", "swiglufused" or "identity" block_chunks: (int) split block sequence into block_chunks units for FSDP wrap num_register_tokens: (int) number of extra cls tokens (so-called "registers") interpolate_antialias: (str) flag to apply anti-aliasing when interpolating positional embeddings interpolate_offset: (float) work-around offset to apply when interpolating positional embeddings ư>)epsr)img_size patch_sizein_chans embed_dimrNTcSsg|]}|qSr)item).0r"rrr wz2DinoVisionTransformer.__init__..r.zusing MLP layer as FFN swiglufusedswigluzusing SwiGLU layer as FFNidentityzusing Identity layer as FFNc_stSr )nnIdentity)argskwargsrrrfsz)DinoVisionTransformer.__init__..fcs.g|]}  |d qS)) dim num_heads mlp_ratioqkv_bias proj_biasffn_biasZ drop_path norm_layer act_layer ffn_layer init_valuesr)r7i rIblock_fndprr5rGrJrKrDrHrCrFrErrr8s cSsg|]}t|qSr)r)r7prrrr8r9F)&super__init__rr= LayerNorm num_featuresr5 num_tokensn_blocksrCr3num_register_tokensinterpolate_antialiasinterpolate_offset patch_embed num_patches Parametertorchzeros cls_token pos_embedregister_tokenslinspaceloggerinfor r NotImplementedErrorrangechunked_blocksappendr> ModuleListblocksnormhead mask_token init_weights)r!r2r3r4r5depthrCrDrErGrFZdrop_path_rateZdrop_path_uniformrKZ embed_layerrIrNrJ block_chunksrWrXrYr[rAZ blocks_listrg chunksizerL __class__rMrrR-s^ 1       &    zDinoVisionTransformer.__init__cCsJt|jddtjj|jdd|jdurtjj|jddtt|dS)N{Gz?stdr0) rr`r=initnormal_r_rarinit_weights_vit_timmr!rrrrns  z"DinoVisionTransformer.init_weightscCs^|j}|jdd}|jjdd}||kr||kr|jS|j}|dddf}|ddddf} |jd} ||j} ||j} | |j| |j} } t|} t| | t| | }}tj j | dt | t | |  dddd||fd|jd} t | | jdksJt | | jdksJ| dddddd| } tj|d| fdd |S) Nrrr+bicubic) scale_factormode antialiasrB)dtypeshaper`floatr3rYmathsqrtr= functional interpolatereshapeintpermuterXviewr]cat unsqueezeto)r!r"whZprevious_dtypeZnpatchNr`class_pos_embedpatch_pos_embedrBw0h0Zsqrt_Nsxsyrrrinterpolate_pos_encodings0     "z.DinoVisionTransformer.interpolate_pos_encodingcCs|j\}}}}||}|dur"t|d|j|jd|}tj|j |jddd|fdd}|| |||}|j duretj|ddddf|j |jddd|ddddffdd}|S)Nr{rrr) rrZr]whererrmrrrr_expandrra)r!r"masksBncrrrrrprepare_tokens_with_maskss $$  z/DinoVisionTransformer.prepare_tokens_with_masksc sfddt||D}jD]}||}q|}g}t||D]0\}}|}||dddf|dddjdf|ddjddf||dq|S)Ncsg|] \}}||qSr)r)r7r"rrzrrr8z?DinoVisionTransformer.forward_features_list..rrx_norm_clstokenZx_norm_regtokensZx_norm_patchtokensZ x_prenormr)ziprjrkrhrW) r!x_listZ masks_listr"blkZall_xoutputrx_normrrzrforward_features_lists     z+DinoVisionTransformer.forward_features_listcCst|tr |||S|||}|jD]}||}q||}|dddf|ddd|jdf|dd|jddf||dS)Nrrr) isinstancelistrrrjrkrW)r!r"rrrrrrforward_featuress      z&DinoVisionTransformer.forward_featurescCs||}gt|j}}t|trt|||n|}t|jD]\}}||}||vr1||q t|t|ksIJdt|dt|d|S)Nonly  / blocks found)rlenrjrrrf enumeraterh)r!r"nrtotal_block_lenblocks_to_takerLrrrr$_get_intermediate_layers_not_chunkeds  .z:DinoVisionTransformer._get_intermediate_layers_not_chunkedc Cs||}gdt|jd}}}t|trt|||n|}|jD]}||dD]}||}||vr:|||d7}q+q#t|t|ksWJdt|dt|d|S)Nrr{rrrr)rrrjrrrfrh) r!r"rrrLrrZ block_chunkrrrr _get_intermediate_layers_chunkeds    .z6DinoVisionTransformer._get_intermediate_layers_chunkedr"rrreturn_class_tokenrc sjr ||}n||}|rfdd|D}dd|D}fdd|D}|r@|j\}fdd|D}|rItt||St|S)Ncsg|]}|qSr)rkr7outrzrrr86szADinoVisionTransformer.get_intermediate_layers..cSsg|] }|dddfqS)Nrrrrrrr87rcs&g|]}|dddjdfqS)Nr)rWrrzrrr88s&cs8g|]}|jjdddddqS)r{rr+rr|)rr3r contiguousrrrr!rrrr8;s*)rgrrrtupler) r!r"rrrrkoutputsZ class_tokens_rrrget_intermediate_layers)s z-DinoVisionTransformer.get_intermediate_layers) is_trainingcOs&|j|i|}|r |S||dS)Nr)rrl)r!rr?r@retrrrr$CszDinoVisionTransformer.forwardr )r)rFFT)r%r&r'r r=GELUBlockrRrnrrrrrrr]Tensorrrrboolrrr$ __classcell__rrrrrr(,s\ !     r(rcCs>t|tjrt|jdd|jdurtj|jdSdSdS)zCViT weight initialization, original timm impl (for reproducibility)rtruN)rr=Linearrweightbiasrwzeros_rrrrryKs  ryr*c K*td|ddddtttd|d|}|S)Nir-Z attn_classr3r5rorCrDrNrWrr(rrr r3rWr@modelrrr vit_smallS  rc Ks*td|ddddtttd|d|}|S)Nr,r-rrrrrrrrrvit_basearrc Kr)Nir*rrrrrrrrr vit_largeorrc Ks*td|ddddtttd|d|}|S) zW Close to ViT-giant, with embed-dim 1536 and 24 heads => embed-dim per head 64 i(rrrrNrrrrrr vit_giant2}s  rc Cs6ttttd}||ddd|dkrdndddd d d S) N)vitsvitbvitlvitgig?rr.r:rFr/)r2r3rKrJrprWrXrY)rrrr) model_name model_zoorrrDINOv2sr)rTF)r)r*r)$ functoolsrrloggingtypingrrrrr]torch.nnr=torch.utils.checkpointZ torch.nn.initrZ dinov2_layersr r r r r r getLoggerrcModulerrirr(strryrrrrrrrrrs*     !