o HQgb@s^ddlmZmZddlZddlmZddlmmZGdddej Z Gdddej Z dS))ListOptionalNcsBeZdZ d dejdejdeffdd Zdejfdd Z Z S) ImageEncoderrtrunkneckscalpcsLt||_||_||_|jj|jjks$Jd|jjd|jjdS)Nz4Channel dims of trunk and neck do not match. Trunk: z, neck: )super__init__rrrZ channel_listbackbone_channel_list)selfrrr __class__Y/mnt/petrelfs/dingshuangrui/SAM2-Video-Predictor/sam2/modeling/backbones/image_encoder.pyr s zImageEncoder.__init__samplecCsX|||\}}|jdkr |d|j |d|j }}|d}|||d}|S)Nr)vision_featuresvision_pos_enc backbone_fpn)rrr)r rfeaturespossrcoutputrrrforwards "zImageEncoder.forward)r) __name__ __module__ __qualname__nnModuleintr torchTensorr __classcell__rrr rrsrcsveZdZdZ      ddejded eed ed ed ed edede eeffdd Z dee j fddZ ZS)FpnNeckz A modified variant of Feature Pyramid Network (FPN) neck (we remove output conv and also do bicubic interpolation similar to ViT pos embed interpolation) rbilinearsumNposition_encodingd_modelr kernel_sizestridepaddingfpn_interp_model fuse_typefpn_top_down_levelsc st||_t|_||_||_|D]} t} | dtj | ||||d|j | q||_ |dvs9J||_ | durGtt|j} t| |_dS)zInitialize the neck :param trunk: the backbone :param position_encoding: the positional encoding to use :param d_model: the dimension of the model :param neck_norm: the normalization to use conv) in_channels out_channelsr)r*r+)r&avgN)rr r'r ModuleListconvsr r( Sequential add_moduleConv2dappendr,r-rangelenlistr.) r r'r(r r)r*r+r,r-r.dimcurrentr rrr 4s0    zFpnNeck.__init__xsc Csdgt|j}dgt|j}t|t|jksJd}t|jd}t|ddD]P}||}|j|||}||jvrg|durgtj|jtjdd|j |j dkrTdnddd} || }|j dkrf|d }n|}|} | ||<| | | j ||<q*||fS) Nr$r)dtypeg@nearestF) scale_factormode align_corners antialiasr2) r:r4r9r.F interpolatetor float32r,r-r'r?) r r>outrZ prev_featuresnixZlateral_featuresZtop_down_featuresZx_outrrrrfs2  zFpnNeck.forward)r$r$rr%r&N)rrr__doc__rrrrstrrr r r!rr"rrr rr#-s8   2r#) typingrrr torch.nnrZtorch.nn.functional functionalrFrrr#rrrrs