׼hlhtddlmZddlZddlmcmZddlmZmZmZddl m Z ddl m Z ddl mZmZmZmZddlmcmZddlZddlmZddlZddlZdZej:d d Z d"d ej>de e!e"zdej>de#de!de!dzde#de#de#fdZ$ejJdddd ddejLfd ej>dej>dee!e!e!e!fde!deee!e"fdee!de#de#de#dejNdej>fdZ( d#d Z)ejTfd!Z+y)$)timeN)get_sequence_parallel_rank get_sequence_parallel_world_size get_sp_group)xFuserLongContextAttention)sinusoidal_embedding_1d)ListUnionOptionalTuplec|j\}}}||z }tj||||j|j}tj ||gd}|S)Ndtypedevicerdim)shapetorchonesrrcat)original_tensor target_lenseq_lens1s2pad_sizepadding_tensor padded_tensors 5/root/Wan2.1/wan/distributed/xdit_context_parallel.py pad_freqsr!sc%++OGRG#HZZ  ##%% 'N II?QGM F)enabledc l|jd|jd|jddz}}}|j|d|dzzz |dz|dzgd}g}t|jD]\}\}} } || z| z} t j ||d|fj tjj||dd} t j|dd|j|dddj|| | d|dd| jd| ddj|| | d|dd| jdd| dj|| | dgdj| dd} t}t}t| ||z} |}| ||z|dz|zddddf}t j| |zj!d} t j| |||dfg} |j#| t j$|j'S)zX x: [B, L, N, C]. grid_sizes: [B, 3]. freqs: [M, C // 2]. rrNr)sizesplit enumeratetolistrview_as_complextofloat64reshaperviewexpandrrr! view_as_realflattenappendstackfloat)x grid_sizesfreqssncoutputifhwrx_ifreqs_isp_sizesp_rank s_per_rank freqs_i_ranks r rope_applyrH sLffQiAFF1IN!qA KKQ!q&\)1616:K BEF!*"3"3"56 9Aq!a%!)##Aa!eHKK $>$F$F q"a%)) !HRaL  aAr * 1 1!Q2 > !HRaL  aAr * 1 1!Q2 > !HRaL  aAr * 1 1!Q2 > !# $ %,GGQ$; 34,.GQ[1 * 4! 7A8B CDEq IJ   |!34<d? }tjbj]||} tCjd| d@dABtCjh"y #t2$r;}t/d|d |j5t/d!|d"|d"|d#|d }~wwxYw#tf$r}t/dC| dD|Yd }~~d }~wwxYw#tCjhwxYw)Ea$ Visualizes cross-attention weights for specific context tokens across video frames. Can visualize a single head or the average across all heads. Args: attention_map (torch.Tensor): The attention weights tensor with shape [Batch_size, Head_num, x_tokens, context_tokens]. word_indices (list[int] | slice): A list of indices or a slice object representing the positions of the target word(s) in the context_tokens dimension. grid_sizes (torch.Tensor): Tensor of shape [Batch_size, 3] containing the original grid dimensions (F, H_patch, W_patch) for each item in the batch before flattening x_tokens. save_dir (str): The directory path where the visualization images will be saved. batch_index (int, optional): The index of the batch item to visualize. Defaults to 0. head_index (int | None, optional): The index of the attention head to visualize. If None, the average attention across all heads is visualized. Defaults to None (average). aggregation_method (str, optional): How to aggregate attention scores if multiple word_indices are provided ('mean', 'sum', 'max'). Defaults to 'mean'. colormap (str, optional): Matplotlib colormap name. Defaults to 'viridis'. file_prefix (str, optional): Prefix for the saved image filenames. Defaults to "attention_viz". Returns: None. Saves image files to the specified directory. $attention_map must be a torch.Tensor!grid_sizes must be a torch.Tensorz;attention_map must have 4 dimensions [B, H, Lx, Lctx], got rr%r&z'grid_sizes must have shape [B, 3], got batch_index  out of range for batch size N head_index  out of range for head count T)exist_okrrzAverage of All HeadsavgzHead r@rIr'summaxzUnknown aggregation_method: rrz Info: Calculated actual tokens (z#) < attention map token dimension (z%). Assuming padding in attention map.zError reshaping scores: Need z elements, but got zTarget shape: (z, ))figsizenearest)cmapvminvmax interpolationzAggregated Attention ()labelz - Frame /z Batch z , Word Idx zWidth Patch IndexzHeight Patch Indexbothgreyg?z--)whichcolor linewidth linestyle-z-endzstart-_none_b_w_frame03dz.pngtight) bbox_inchesdpizError saving figure z: )5 isinstancerTensor TypeErrorr ValueErrorr IndexErrorosmakedirsrIslicelenrangeindicesr\r]valuessqueezezerosrrr+printr/ RuntimeErrornumelcpudetachnumpyr(npminpltfigureimshowcolorbartitlexlabelylabelxticksarangeyticksgridstopstartjoinmapstrpathsavefig Exceptionclose)!rJrKr8rLrMrNrOrPrQBHLxLctxattn_map_processed head_info_strhead_info_fileword_attn_scores num_wordsaggregated_scoresr?h_patchw_patchactual_num_tokensscores_unpaddedattention_video_grideattention_video_grid_nprdre frame_idxwsfilenamefilepaths! r visualize_attention_for_wordrIsR mU\\ 2>?? j%,, /;<<aVWdWhWhWjVklmm~~1 0 0 3q 8B:CSCSBTUVV"((NAq"da< }4QRSQTUVV*/K |3PQRPSTU UKK4(+;1a+?@EE!EL. +; Aq+HI |, ZL)*!\/:,&|33D9:;  % 1}  ' 0 5 5" 5 =  5 ( 0 4 4 4 <  5 ( 0 4 4 4 < C C ;hthzhzg{{fAB#))5VLDUDUVWDWCX;YB#c<01BFr"]"[M>2B"RDyY\o]ab77<<(3  KKg3 ? IIK;  -.?-@@STcTiTiTkSlmn s"WIRy:;N : ( "QC8 9 9 : IIKsB;UV  V 6VV  V0V+&V3+V00V33W rbtarget_x_shapetarget_word_indicesword_aggregation_methodupsample_mode_spatialupsample_mode_temporal output_dtypereturnc  t|tjs tdt|tjs td|j dk7rt d|j d|j dk7s|j ddk7rt d |j t|dk7rt d t||j \} } } } |\}}}}d |cxkr| ksntd |d | |d |cxkr| ksntd|d| |dvrt d||dvrt d||dvrt d||||jd }n|||f}t|trt|j| }|sd }n6|j| k\s|j| kr t|}n t|}d|jd|jd|jd}|dd|f}nt|t rr|Dcgc]}| |cxkr| ksnn|}}|s1d }tj"| d f|j$|j&}n|dd|f}t|}t)|}ntdt+||dkDrR|dk(r|jd}ns|dk(r|j-d}n[|d k(rV|j/dj0}n9|dk(r|j3d}n"tj4|| |j$!St7t8||j;\}}}||z|z}|d k(r"tj4|| |j$!S|| kDr0|j=z }t?j@|d |fd"d }|} n || krd|} n} | jC|||}!|!jGd jGd jI}#|||f}$|d#k(r ||k7rd$}%d%}&nd&}%d}&t?jJ|#|$|%|&'}'|'j3d j3d }(|(tj.|(z })|)jGd jM||||}*|*jO| (Scc}w#tD$r}"|"d}"~"wwxYw))a Generates a binary mask from an attention map based on attention towards target words. The mask identifies regions in the video (x) that attend strongly to the specified context words, exceeding a given threshold. The mask has the same dimensions as x. Args: attention_map (torch.Tensor): Attention weights [B, Head_num, Lx, Lctx]. Lx = flattened video tokens (patches), Lctx = context tokens (words). target_word_indices (Union[List[int], slice]): Indices or slice for the target word(s) in the Lctx dimension. grid_sizes (torch.Tensor): Patch grid dimensions [B, 3] -> (F, H_patch, W_patch) for each batch item, corresponding to Lx. F, H_patch, W_patch should be integers. target_x_shape (Tuple[int, int, int, int]): The desired output shape [C, T, H, W], matching the original video tensor x. threshold (float): Value between 0 and 1. Attention scores >= threshold become 1 (True), otherwise 0 (False). batch_index (int, optional): Batch item to process. Defaults to 0. head_index (Optional[int], optional): Specific head to use. If None, average attention across all heads. Defaults to None. word_aggregation_method (str, optional): How to aggregate scores if multiple target_word_indices are given ('mean', 'sum', 'max'). Defaults to 'mean'. upsample_mode_spatial (str, optional): PyTorch interpolate mode for H, W dimensions. Defaults to 'nearest'. upsample_mode_temporal (str, optional): PyTorch interpolate mode for T dimension. Defaults to 'nearest'. output_dtype (torch.dtype, optional): Data type of the output mask. Defaults to torch.bool. Returns: torch.Tensor: A binary mask tensor of shape target_x_shape [C, T, H, W]. Raises: TypeError: If inputs are not torch.Tensors. ValueError: If tensor dimensions or indices are invalid, or if aggregation/upsample modes are unknown. IndexError: If batch_index or head_index are out of bounds. rSrTrUz,attention_map must be [B, H, Lx, Lctx], got z dimsrr%r&zgrid_sizes must be [B, 3], got z0target_x_shape must be [C, T, H, W], got length rrVrWNrXrY)rIr\r]z!Unknown word_aggregation_method: )rbbilinearzUnknown upsample_mode_spatial: )rblinearz Unknown upsample_mode_temporal: rzslice(:r_r^z/target_word_indices must be list or slice, got rIr'r\r]rconstantr trilinearFrb)r(mode align_cornersr)(r{rr|r}rr~rrrrIrrrrrsteplistemptyrrrtyper\r]rrrrintr+rFpadr/r unsqueezer6 interpolater1r-)+rJr8rrMrrNrrrrrrrrC_outT_outH_outW_outr_slice_indicesrword_indices_strridx valid_indicesrf_patchrrr padding_size scores_paddedrattention_patch_gridrgrid_for_upsampletarget_size_3dupsample_mode_3dalign_corners_3dupsampled_scores_gridupsampled_scoresbinary_mask_thw final_masks+ r generate_attention_maskrsan mU\\ 2>?? j%,, /;<<aG HYHYH[G\\abcc~~1 0 0 3q 8::;K;K:LMNN >aLSQ_M`Labc c"((NAq"d!/E5% a < }4QRSQTUVVa:&9&9K |3PQRPSTU U&<<<=T;N;N:OqQ_QdQdPeefg-a1D.DE ' .)<T#@St@S#TTY % RGM->?+.66wQ-66q9CCAFLLNUE*N)g.>& %MM*;,:,<5EG-44Q7??B(%))4D*EEO!**1-44UE5%PJ ==|= ,,yU^ s$2R=R=S S S  Sc  |jdk(r||J|jjj} |jj| k7r |jj | |_|5t ||D cgc]\} } tj| | gd }} } |D cgc]"} |j| jd$}} tj|D cgc]4} tj| jddtj6c} }|D cgc]#} | jdjdd%}} tj|D cgc]} | j!dc} tj}|j#|ksJtj|D cgc]M} tj| | j%d|| j!dz | j!dgdOc} }t'j(tj*5|j-t/|j0|j3}|j5|j7dd|j8f}|j:tj*k(r|j:tj*k(sJ dddd}|j=tj|D cgc]T} tj| | j%|j>| j!dz | j!dgVc} }|*|jA|}tjB||gd}tE|||j||d }tjF|tIdtK}|}d}d}tM|jND]*\}}d |d <||k(rd |d <||fi|\}}"||fi|},|jQ|}tSjU|d}|jW||}|d k7r<|:tSjU|d}tY||||djddd}|D cgc]} | j3c} |fScc} } wcc} wcc} wcc} wcc} wcc} w#1swYxYwcc} wcc} w)z x: A list of videos each with shape [C, T, H, W]. t: [B]. context: A list of text embeddings each with shape [L, C]. i2vNrrrrr%r`F)rseq_lensr8r9context context_lenscollect_attn_maprTr'rI)rJrr8rrMrNr)- model_typepatch_embeddingweightrr9r-ziprrrr5tensorrlongr3 transposer(r] new_zerosampautocastfloat32time_embeddingr freq_dimr6time_projection unflattenrrtext_embeddingtext_lenimg_embconcatdictchunkrrr*blocksheadr all_gather unpatchifyr)selfr7trrclip_feay words_indicesblock_idrtimestepruvr8rre0r context_clipkwargs save_block_idattn_map binary_maskr>blockrJs r usp_dit_forwardrs^$ %# 55  ! ! ( ( / /F zzF"ZZ]]6* }25a) <$!QUYY1v1 % < <899!  akk!n -9A9>?@aggabk 4@BJ011!1  1 %1A1||21QVVAY2%**EH <<>W $$ $    1akk!Wqvvay%8!&&)DE1M A EMM *F    #DMM1 5 ; ; = ?  ! !! $ . .q1dhh- @ww%--'BHH ,EEE,E FL!!   IIq!++dmmaffQi&?KL M   G ||H- ,, g6A> jj! FD   + -  )+ -AMHKdkk*#5%*!" )-F% &,V,KAxa"6"A # !QA!!!!+A :&A}8$11(1B .&3,9#-'(tzz$%#'06  ! !!AGGI !; ..e = :@12 FF r "s?#S*'S&9S *(S'S>AS;BSAS, $S1S)c g|jddjj\  tjtj f  fd} fd}||\}} } t |||}t | ||} td|||| || j}|jd}j|}|S)NrcF|jvr|S|jSN)rr-)r7r half_dtypess r halfzusp_attn_forward..halfFs!GG{*q;U ;r"cjj|j}jj |j}j |j}|||fSr)norm_qqr0norm_kkr) r7r r"rbdr;r:rs r qkv_fnz usp_attn_forward..qkv_fnJsx KKq " ' '1a 3 KKq " ' '1a 3 FF1INN1aA &!Qwr")querykeyvalue window_size) r num_headshead_dimrfloat16bfloat16rHrr)r3o)rr7rr8r9rrr%r r"rr#r$rr;r:s` ` @@@@@r usp_attn_forwardr/=s =!''"1+r?s 33F3//   e%'%'Z!$&[<<[s)e#[ [ [  [ d [[[[| 37 $#)!*"+ % B-<<B- B-#sC,-B- B- tCy%/0 B-  B-!B-B- B-++B- \\B-B-V   L/j!>> * r"