o HQg1@sfddlmZmZmZmZddlZddlmZddlZddlm Z ddl m Z m Z Gdddej ZdS))ListOptionalTupleTypeN)nn)FlopCountAnalysis) LayerNorm2dMLPcseZdZdejddddddddddd dedejd ed eejd ed ed ededededdffddZ d$de j de j de j de j dedede e e j dee j e j ffddZ d$de j de j de j de j dede e e j dee j e j ffddZd d!Zd"d#ZZS)% MaskDecoderFg?g\(\?) num_multimask_outputs activationiou_head_depthiou_head_hidden_dimuse_high_res_featuresiou_prediction_use_sigmoiddynamic_multimask_via_stability!dynamic_multimask_stability_delta"dynamic_multimask_stability_threshpred_obj_scorespred_obj_scores_mlpuse_multimask_token_for_obj_ptrtransformer_dim transformerr rrrrrrrreturnNc sZt|_||_||_td|_|d|_t|j|_ | |_ |j r/td|_ ||_ t tjddddtd|tjddddd||_||_|rstjdddd|_tjdddd|_tfddt|jD|_t||j||d|_|j rtd|_| rtdd |_| |_| |_| |_d S) a Predicts masks given an image and prompt embeddings, using a transformer architecture. Arguments: transformer_dim (int): the channel dimension of the transformer transformer (nn.Module): the transformer used to predict masks num_multimask_outputs (int): the number of masks to predict when disambiguating masks activation (nn.Module): the type of activation to use when upscaling masks iou_head_depth (int): the depth of the MLP used to predict mask quality iou_head_hidden_dim (int): the hidden dimension of the MLP used to predict mask quality ) kernel_sizestridecsg|] }tddqS)r!r )r ).0irR/mnt/petrelfs/dingshuangrui/SAM2-Video-Predictor/sam2/modeling/sam/mask_decoder.py Wsz(MaskDecoder.__init__..)Zsigmoid_outputr N)super__init__rrr r Embedding iou_tokennum_mask_tokens mask_tokensrobj_score_tokenr SequentialConvTranspose2droutput_upscalingrConv2dconv_s0conv_s1 ModuleListrangeoutput_hypernetworks_mlpsr iou_prediction_headLinearpred_obj_score_headrrr)selfrrr rrrrrrrrrrr __class__r$r&r)sb "        zMaskDecoder.__init__image_embeddingsimage_pesparse_prompt_embeddingsdense_prompt_embeddingsmultimask_output repeat_imagehigh_res_featuresc Cs|j||||||d\}} } } |r,|ddddddddf}| ddddf} n)|jr;|js;||| \}} n|ddddddddf}| ddddf} |re|jre| ddddf} n | ddddf} || | | fS)a Predict masks given image and prompt embeddings. Arguments: image_embeddings (torch.Tensor): the embeddings from the image encoder image_pe (torch.Tensor): positional encoding with the shape of image_embeddings sparse_prompt_embeddings (torch.Tensor): the embeddings of the points and boxes dense_prompt_embeddings (torch.Tensor): the embeddings of the mask inputs multimask_output (bool): Whether to return multiple masks or a single mask. Returns: torch.Tensor: batched predicted masks torch.Tensor: batched predictions of mask quality torch.Tensor: batched SAM token for mask output )r>r?r@rArCrDNrr) predict_masksrtraining _dynamic_multimask_via_stabilityr) r;r>r?r@rArBrCrDmasksiou_predmask_tokens_outobject_score_logitsZsam_tokens_outr%r%r&forwardqs&     zMaskDecoder.forwardc! Cs~d}|jrtj|jj|jj|jjgdd}d}n tj|jj|jjgdd}|d| ddd}tj||fdd} |rItj || j ddd} n|j d| j dksUJ|} | |} | ddksfJdtj || j ddd} | j \} } }}| | | | \}} |dd|ddf}|dd|d|d|j ddf}| dd| | ||} |js|| }n|j\}}}}}|\}}|||| |}||||}g}t|j D]}||j||dd|ddfqtj|dd}|j \} } }}||| | ||| d||}||}|jr.|dksJ||dddddf} n d||j dd} |||| fS) z/Predicts masks. See 'forward' for more details.rdimrz@image_pe should have size 1 in batch dim (from `get_dense_pe()`)Nrg$@)rtorchcatr.weightr+r- unsqueezeexpandsizerepeat_interleaveshaperr, transposeviewrr1r6appendr7stackr8r:new_ones)!r;r>r?r@rArCrDs output_tokenstokenssrcZpos_srcbchwhsZ iou_token_outrJZupscaled_embeddingZdc1Zln1Zact1Zdc2Zact2Zfeat_s0Zfeat_s1Z hyper_in_listr#Zhyper_inrHrIrKr%r%r&rEsf   ( "  zMaskDecoder.predict_maskscCsX|d}|j}tj||kdd}tj|| kdd}t|dk||d}|S)zz Compute stability scores of the mask logits based on the IoU between upper and lower thresholds. rOrMrg?)flattenrrPsumfloatwhere)r;Z mask_logitsZstability_deltaZarea_iZarea_ustability_scoresr%r%r&_get_stability_scoress z!MaskDecoder._get_stability_scorescCs|ddddddddf}|ddddf}tj|dd}tj|d|jd}|||f}|d}|||f}|d}|ddddddddf} |ddddf} || } | |jk} t| d | | |} t| | | |}| |fS)as When outputting a single mask, if the stability score from the current single-mask output (based on output token 0) falls below a threshold, we instead select from multi-mask outputs (based on output token 1~3) the mask with the highest predicted IoU score. This is intended to ensure a valid mask for both clicking and tracking. NrrOrMr)device).NN) rPargmaxarangerUrmrSrlrrj expand_as)r;Zall_mask_logitsZall_iou_scoresZmultimask_logitsZmultimask_iou_scoresZbest_scores_inds batch_indsZbest_multimask_logitsZbest_multimask_iou_scoresZsinglemask_logitsZsinglemask_iou_scoresrkZ is_stableZmask_logits_outZiou_scores_outr%r%r&rGs2          z,MaskDecoder._dynamic_multimask_via_stability)N)__name__ __module__ __qualname__rGELUintModulerboolr)rPTensorrrrrLrErlrG __classcell__r%r%r<r&r s h  A  Q r )typingrrrrrPrpdb fvcore.nnrsam2.modeling.sam2_utilsrr rwr r%r%r%r&s