o gZ&@s^dZddlmZmZddlmZddlmZddlm Z m Z m Z m Z m Z mZmZddlZddlZddlmZddlmmZddlmZddlmZmZdd lmZmZm Z dd l!m"Z"m#Z#d e e gee fd e e ge ffd dZ$d e e gee fd e e ge ffddZ%Gddde Z&eGdddZ'Gdddej(eZ)Gddde)eZ*dS)aF base_vision.py Abstract class definition of a Vision Backbone (Visual Featurizer), with full annotations of class methods, utility functions, and initialization logic. We also define the generic TimmViTBackbone class here, providing a default interface for loading any TIMM Vision Transformer model for feature extraction. )ABCabstractmethod) dataclass)partial)AnyCallableDictOptionalProtocolTupleUnionN)Image)BlockVisionTransformer)_module_wrap_policy _or_policytransformer_auto_wrap_policy)ComposeResizefnreturncdtdtdtffdd }|S)Nargskwargsrcs.|i|}t|tst|tr|dS|S)Nr) isinstancetuplelistrrresultrL/mnt/bn/huangmengqi-lf-nas-25bad429/Release/RealCustom/models/base_vision.pywrapper)s zunpack_tuple..wrapperrrr"r rr! unpack_tuple(r%cr)Nrrrcs|i|}|SNr rrr r!r"0szreturn_tuple..wrapperr#r$r rr! return_tuple/r&r(c @s6eZdZdededeejeeejfffddZ dS)ImageTransformimgrrcKdSr'r )selfr*rr r r!__call__9szImageTransform.__call__N) __name__ __module__ __qualname__r strr torchTensorrr-r r r r!r)8s.r)c@s2eZdZUeeeefed<dedefddZdS) LetterboxPadpadding_fill_valueimagercCsX|jt|j\}}}t||dt||d}}||||f}tj|||jddS)zVGiven a PIL.Image, pad to square by adding a symmetric border around the height/width.constant)fill padding_mode)sizemaxintTVFpadr5)r,r6whZmax_whZhorizontal_padZ vertical_padpaddingr r r!r-As" zLetterboxPad.__call__N)r.r/r0r r=__annotations__r r-r r r r!r4=s r4c seZdZddedededdffdd Zdefd d Zede fd d Z ed e j de j fddZ eedeeeeffddZeedefddZeedefddZeede jfddZZS)VisionBackbonevision_backbone_idimage_resize_strategydefault_image_sizerNcs,t||_||_||_d|_d|_dSr')super__init__ identifierrGrH featurizerimage_transform)r,rFrGrH __class__r r!rJKs  zVisionBackbone.__init__cC|jSr')rMr,r r r!get_image_transformUsz"VisionBackbone.get_image_transformcCr+r'r rQr r r!get_fsdp_wrapping_policyXsz'VisionBackbone.get_fsdp_wrapping_policy pixel_valuescCst)ziRun a forward pass through the featurizer given a set of processed images, returning patch/grid features.)NotImplementedErrorr,rTr r r!forward[szVisionBackbone.forwardcCr+r'r rQr r r!default_image_resolution`z'VisionBackbone.default_image_resolutioncCr+r'r rQr r r! embed_dimdrYzVisionBackbone.embed_dimcCr+r'r rQr r r! num_patcheshrYzVisionBackbone.num_patchescCr+r'r rQr r r!half_precision_dtypelrYz#VisionBackbone.half_precision_dtype)rE)r.r/r0r1r=rJr)rRrrrSr2r3rWpropertyr rXrZr[dtyper\ __classcell__r r rNr!rDJs&  rDc seZdZ  ddededededeeddf fd d Zdefd d Zd e e j e ee j ffde j fddZ edeeeeffddZedefddZedefddZede jfddZZS)TimmViTBackbonerENrFtimm_path_or_urlrGrHoverride_act_layerrc s<tj|||d||_||_tj|_|jdur%tj|jdd|j d|_ ntj|jdd|j |jd|_ |j t t |j jt|j jdhd|j _t|j tsUJdtj|j |_d |j |j f|jd <tjjdi|jd d i}d |jvsd|jvrt|tsJdt|jd}tsJtt|j |jdg|jdd}|jdkrt|tsJdt|jd}tsJ|j |j f}tt||jdg|jdd|_dS|jdkr||_dS|jdkrt|tsJdd|jvsJdtdd|jdD} tt| g|j|_dSt d|jd)N)rHTr) pretrained num_classesimg_size)rcrdre act_layerr7)nzFeaturizer is not a TIMM VisionTransformer; if you would like to support a new visual representation, file an issue or implement the requisite logic (see `cobra/models/backbones/vision/base_vision.py`)! input_size is_trainingFsiglipZin1kz%Unexpected `default_image_transform`!) interpolationz resize-naivez resize-cropZ letterboxmeanz1TIMM `data_cfg` missing image normalization mean!cSsg|]}t|dqS))r=).0xr r r! sz,TimmViTBackbone.__init__..zImage Resize Strategy `z` is not supported!r )!rIrJrarbr2bfloat16r^timm create_modelrHrLevalr%rget_intermediate_layerslenblocksrWrrdataresolve_model_data_configdata_cfgcreate_transformr transformsrrlrGrMrr4 ValueError) r,rFrarGrHrbZdefault_image_transformZresize_transform target_sizer9rNr r!rJssh            zTimmViTBackbone.__init__cCs,ttthd}ttthd}tt||gdS)zWReturn a simple FSDP policy that wraps each ViT block and then the _entire_ featurizer.)module_classes)transformer_layer_cls)policies)rrrrrr)r,vit_wrap_policytransformer_block_policyr r r!rSsz(TimmViTBackbone.get_fsdp_wrapping_policyrTcCs ||S)z\Runs transformed image/pixel tensor through vision backbone, returning _all_ patch features.)rLrVr r r!rW zTimmViTBackbone.forwardcCs |jdS)Nri)r|rQr r r!rXrz(TimmViTBackbone.default_image_resolutioncCs|jjSr')rLrZrQr r r!rZszTimmViTBackbone.embed_dimcCs |jjjSr')rL patch_embedr[rQr r r!r[rzTimmViTBackbone.num_patchescCrPr')r^rQr r r!r\sz$TimmViTBackbone.half_precision_dtype)rEN)r.r/r0r1r=r rJrrSr r2r3rrWr]r rXrZr[r^r\r_r r rNr!r`rs4X(r`)+__doc__abcrr dataclassesr functoolsrtypingrrrr r r r rtr2torch.nnnn!torchvision.transforms.functionalr~ functionalr> PIL.Imager timm.models.vision_transformerrrtorch.distributed.fsdp.wraprrrtorchvision.transformsrrr%r(r)r4ModulerDr`r r r r!s(  $  **  (