U aW @sTdZddlmZddlZddlmZddlmmZddl Zddl m Z ddlm Z ddl mZmZddlmZdd lmZdd lmZmZmZmZdd lmZdd lmZmZdJddZeddedddedddddeddedddedddddeddedddeddedded dd! ZGd"d#d#ej Z!Gd$d%d%ej Z"Gd&d'd'ej Z#Gd(d)d)ej Z$d*d+Z%edKee&e&fe'd-d.d/Z(Gd0d1d1ej Z)dLd2d3Z*edMd4d5Z+edNd6d7Z,edOd8d9Z-edPd:d;Z.edQdd?Z0edSd@dAZ1edTdBdCZ2edUdDdEZ3edVdFdGZ4edWdHdIZ5dS)Xa CrossViT Model @inproceedings{ chen2021crossvit, title={{CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification}}, author={Chun-Fu (Richard) Chen and Quanfu Fan and Rameswar Panda}, booktitle={International Conference on Computer Vision (ICCV)}, year={2021} } Paper link: https://arxiv.org/abs/2103.14899 Original code: https://github.com/IBM/CrossViT/blob/main/models/crossvit.py NOTE: model names have been renamed from originals to represent actual input res all *_224 -> *_240 and *_384 -> *_408 Modifications and additions for timm hacked together by / Copyright 2021, Ross Wightman )TupleN)partial)ListIMAGENET_DEFAULT_MEANIMAGENET_DEFAULT_STD)register_notrace_function)build_model_with_cfg)DropPath to_2tuple trunc_normal__assert)register_model)MlpBlockc Ks|ddddttdddd |S)N)rg?T)zpatch_embed.0.projzpatch_embed.1.proj)zhead.0zhead.1) url num_classes input_sizeZ pool_sizecrop_pctmeanstdZfixed_input_size first_conv classifierr)rkwargsrU/home/chou/anaconda3/envs/pytorch/lib/python3.8/site-packages/timm/models/crossvit.py_cfg,sr!zQhttps://github.com/IBM/CrossViT/releases/download/weights-0.1/crossvit_15_224.pth)rzXhttps://github.com/IBM/CrossViT/releases/download/weights-0.1/crossvit_15_dagger_224.pth)zpatch_embed.0.proj.0zpatch_embed.1.proj.0)rrzXhttps://github.com/IBM/CrossViT/releases/download/weights-0.1/crossvit_15_dagger_384.pth)rr"?)rrrrzQhttps://github.com/IBM/CrossViT/releases/download/weights-0.1/crossvit_18_224.pthzXhttps://github.com/IBM/CrossViT/releases/download/weights-0.1/crossvit_18_dagger_224.pthzXhttps://github.com/IBM/CrossViT/releases/download/weights-0.1/crossvit_18_dagger_384.pthzPhttps://github.com/IBM/CrossViT/releases/download/weights-0.1/crossvit_9_224.pthzWhttps://github.com/IBM/CrossViT/releases/download/weights-0.1/crossvit_9_dagger_224.pthzShttps://github.com/IBM/CrossViT/releases/download/weights-0.1/crossvit_base_224.pthzThttps://github.com/IBM/CrossViT/releases/download/weights-0.1/crossvit_small_224.pthzShttps://github.com/IBM/CrossViT/releases/download/weights-0.1/crossvit_tiny_224.pth) crossvit_15_240crossvit_15_dagger_240crossvit_15_dagger_408crossvit_18_240crossvit_18_dagger_240crossvit_18_dagger_408crossvit_9_240crossvit_9_dagger_240crossvit_base_240crossvit_small_240crossvit_tiny_240cs*eZdZdZd fdd Zd d ZZS) PatchEmbedz Image to Patch Embedding rFc sPtt|}t|}|d|d|d|d}||_||_||_|r8|ddkrttj||dddddtj dd tj|d|d ddddtj dd tj|d |dddd|_ nr|dd krLttj||dddddtj dd tj|d|d dd ddtj dd tj|d |dd dd|_ ntj||||d |_ dS) Nrr r) kernel_sizestridepaddingT)Zinplacer1)r6r7) super__init__r img_size patch_size num_patchesnn SequentialZConv2dZReLUproj)selfr<r=in_chans embed_dim multi_convr> __class__rr r;\s2       zPatchEmbed.__init__c Cs|j\}}}}t||jdkd|d|d|jdd|jdd t||jdkd|d|d|jdd|jdd ||ddd}|S)NrzInput image size (*z) doesn't match model (rz).r9)shaperr<rAflatten transpose)rBxBCHWrrr forwardxs((zPatchEmbed.forward)r0r1rr2F)__name__ __module__ __qualname____doc__r;rQ __classcell__rrrFr r/Xsr/cs&eZdZd fdd ZddZZS) CrossAttentionFNcst||_||}|p"|d|_tj|||d|_tj|||d|_tj|||d|_t ||_ t|||_ t ||_ dS)Ng)bias) r:r; num_headsscaler?LinearwqwkwvDropout attn_droprA proj_drop)rBdimr[qkv_biasqk_scalerbrcZhead_dimrFrr r;s  zCrossAttention.__init__c Cs|j\}}}||dddddf|d|j||jdddd}|||||j||jdddd}|||||j||jdddd}||dd|j}|j dd}| |}||dd|d|}| |}| |}|S) Nrr.r9rrd) rIr^Zreshaper[Zpermuter_r`rKr\ZsoftmaxrbrArc) rBrLrMNrNqkvattnrrr rQs <**    zCrossAttention.forward)rXFNrYrY)rRrSrTr;rQrVrrrFr rWsrWcs:eZdZddddddejejffdd ZddZZS) CrossAttentionBlock@FNrYc sHt| ||_t||||||d|_|dkr:t|nt|_dS)N)r[rerfrbrcrY) r:r;norm1rWrnr r?Identity drop_path) rBrdr[ mlp_ratiorerfdroprbrs act_layer norm_layerrFrr r;s  zCrossAttentionBlock.__init__cCs0|dddddf||||}|S)Nrr.)rsrnrq)rBrLrrr rQs,zCrossAttentionBlock.forward) rRrSrTr?GELU LayerNormr;rQrVrrrFr ros   rocsJeZdZddddejejffdd Zeej eej dddZ Z S)MultiScaleBlockFrYc srtt|} | |_t|_t| D]f} g}t|| D]2}|t || || || |||| || dq>t|dkr*|jtj |q*t|jdkrd|_t|_ t| D]j} || || d| krdrt g}n,| || | t || || d| g}|j tj |qt|_t| D]} | d| }||}|ddkr|jt||||| |||| d| dnTg}t|dD]0}|t||||| |||| d| dq|jtj |q6t|_t| D]x} || d| || kr$dr$t g}n4| || d| | t || d| || g}|jtj |qdS)N)rdr[rtrerurbrsrwrrFrh)r:r;len num_branchesr? ModuleListblocksrangeappendrr@projsrrr]fusionro revert_projs)rBrdpatchesdepthr[rtrerurbrsrvrwr|dtmpiZd_Znh_rFrr r;s        ,        zMultiScaleBlock.__init__)rLreturnc Cs(g}t|jD]\}}||||qtjttjg}t|jD],\}}||||dddddfqHg}tt |j |j D]\}\}} tj ||||d|j dddddffdd} || } | | dddddf} tj | ||dddddffdd} || q|S)Nrr.ri) enumerater~rtorchjitZannotaterTensorrziprrcatr|) rBrLZouts_brblockZproj_cls_tokenrAZoutsrZ revert_projrZreverted_proj_cls_tokenrrr rQs&6( zMultiScaleBlock.forward) rRrSrTr?rxryr;rrrrQrVrrrFr rzs 6rzcCsddt||DS)NcSs(g|] \}}|d||d|qS)rrr).0rprrr sz(_compute_num_patches..)r)r<rrrr _compute_num_patchessrF)ss crop_scalecCs|jdd\}}||dks*||dkr|r|d|kr|d|krtt||ddtt||dd}}|dddd|||d|||df}ntjjj||ddd}|S) a~ Pulled out of CrossViT.forward_features to bury conditional logic in a leaf node for FX tracing. Args: x (Tensor): input image ss (tuple[int, int]): height and width to scale to crop_scale (bool): whether to crop instead of interpolate to achieve the desired scale. Defaults to False Returns: Tensor: the "scaled" image batch tensor rgNrr@ZbicubicF)sizemodeZ align_corners)rIintroundrr? functionalZ interpolate)rLrrrOrPZcuZclrrr scale_images 22rcseZdZdZdddddddd d d d d d eejd dddffdd ZddZe j j ddZ ddZ dddZddZddZZS) CrossViTzI Vision Transformer with support for patch or hybrid CNN input stage r0)r#r#)rXr1rr))rrrrr)r3)rrrpTrYgư>)ZepsFc sLt_t|_t|}fdd|D_|_tj|}t|_ _ d_ t _tj D]X}td|t tdd|||td|t tdd|qvtj|D]$\}}}jt|||||dqt j| d_td d|D}d dtd| |D}d}t _t|D]b\}}t|dd |d }||||}t|||| | | | |d }||7}j|qRt fd dtj D_t fddtj D_ tj D]6}t!t"d|ddt!t"d|ddq#j$dS)Ncs$g|]tfddjDqS)csg|]}t|qSr)r)rZsjsirr r,sz0CrossViT.__init__...)tupler<)rrBrr r,sz%CrossViT.__init__..r pos_embed_r cls_token_)r<r=rCrDrE)rcSsg|]}t|ddqS)rgN)sumrrLrrr r?scSsg|] }|qSr)itemrrrr r@srh)r[rtrerurbrsrwcsg|]}|qSrrrr)rDrwrr rLscs,g|]$}dkr t|ntqSr)r?r]rrr)rDrrr rMs{Gz?r)%r:r;rr r<img_size_scaledrrr{r|rDZ num_featuresr?r} patch_embedrsetattr Parameterrzerosrrr/rapos_droprZlinspacer~rmaxrznormheadr getattrapply _init_weights)rBr< img_scaler=rCrrDrr[rtreZ drop_rateZattn_drop_rateZdrop_path_raterwrErr>rZim_srrZ total_depthZdprZdpr_ptridxZ block_cfgZ curr_depthZdpr_blkrF)rDrwrrBr r;!s`      .( " zCrossViT.__init__cCsrt|tjrBt|jddt|tjrn|jdk rntj|jdn,t|tjrntj|jdtj|jddS)Nrrrr#) isinstancer?r]r ZweightrZinitZ constant_ry)rBmrrr rWs  zCrossViT._init_weightscCsZt}t|jD]D}|d|t|d|d}|dk r|jr|d|q|S)Nrr)setrr|addrZ requires_grad)rBoutrperrr no_weight_decay`szCrossViT.no_weight_decaycCs|jS)N)rrrrr get_classifierjszCrossViT.get_classifierrcs,_tfddtjD_dS)Ncs.g|]&}dkr"tj|ntqSr)r?r]rDrrrrrBrr rpsz-CrossViT.reset_classifier..)rr?r}rr|r)rBrZ global_poolrrr reset_classifierms  zCrossViT.reset_classifierc s|jd}gt|jD]\}}|}|j|}t|||j}||}|dkrR|jn|j}||dd}t j ||fdd}|dkr|j n|j }||}| |}|qt|jD]\}} | qfddt|jDddDS)Nrrhrricsg|]\}}||qSrr)rrrxsrr rsz-CrossViT.forward_features..cSsg|]}|dddfqS)Nrr)rZxorrr rs)rIrrrrrZ cls_token_0Z cls_token_1expandrrZ pos_embed_0Z pos_embed_1rrr~r) rBrLrMrrZx_rZ cls_tokens pos_embedrrrr forward_featuresss$     zCrossViT.forward_featurescsP||fddt|jD}t|jdtjsLtjtj|dddd}|S)Ncsg|]\}}||qSrr)rrrrrr rsz$CrossViT.forward..rri) rrrrr?rrrrstack)rBrLZ ce_logitsrrr rQs  zCrossViT.forward)r)rRrSrTrUrr?ryr;rrrignorerrrrrQrVrrrFr rs2 6   rcKs:|ddrtddd}tt||ft||d|S)NZ features_onlyz.pretrained_filter_fn)Z default_cfgr)get RuntimeErrorr r default_cfgs)variant pretrainedrrrrr _create_crossvits  rc Ks^tfdddgddgdddgdddgdddggd d gdddgd |}tfd |d |}|S) Nr#g?r3r1`rrr4rrrr=rDrr[rtr.rrdictrrrZ model_argsmodelrrr r.sr.c Ks^tfdddgddgdddgdddgdddggd d gdddgd |}tfd |d |}|S) Nrr3r1rrrr4rrrr-rrrrrr r-sr-c Ks^tfdddgddgdddgdddgdddggddgdddgd |}tfd |d |}|S) Nrr3r1rr2rr4rrr,rrrrrr r,sr,c Ks^tfdddgddgdddgdddgdddggd d gdddgd |}tfd |d |}|S) Nrr3r1rrrr4rr*rrrrrr r*sr*c Ks^tfdddgddgdddgdddgdddggd d gd d dgd |}tfd |d |}|S)Nrr3r1rrrrrrrr$rrrrrr r$sr$c Ks^tfdddgddgdddgdddgdddggd d gd d dgd |}tfd |d |}|S)Nrr3r1r0rrrr5rrr'rrrrrr r'sr'c Ks`tfdddgddgdddgdddgdddggd d gdddgd d |}tfd |d |}|S)Nrr3r1rrrrrr4Trr=rDrr[rtrEr+rrrrrr r+sr+c Ks`tfdddgddgdddgdddgdddggd d gd d dgd d |}tfd |d|}|S)Nrr3r1rrrrrrrTrr%rrrrrr r%sr%c Ks`tfdddgddgdddgdddgdddggd d gd d dgd d |}tfd |d|}|S)Nr#g?r3r1rrrrrrrTrr&rrrrrr r&sr&c Ks`tfdddgddgdddgdddgdddggd d gd d dgd d |}tfd |d|}|S)Nrr3r1r0rrrrr5rTrr(rrrrrr r(sr(c Ks`tfdddgddgdddgdddgdddggd d gd d dgd d |}tfd |d|}|S)Nrr3r1r0rrrrr5rTrr)rrrrrr r)sr))r)F)F)F)F)F)F)F)F)F)F)F)F)F)6rUtypingrrZtorch.nnr?Ztorch.nn.functionalrFZ torch.hub functoolsrrZ timm.datarrZ fx_featuresr Zhelpersr Zlayersr r r rregistryrZvision_transformerrrr!rModuler/rWrorzrrboolrrrr.r-r,r*r$r'r+r%r&r(r)rrrr s        !+"Nu