o _g~@sddlZddlmZmZmZddlmZddlmZm Z m Z ddl Z ddl m Z ddlm mZddlmZddZGd d d e jZGd d d e jZGd dde jZGddde jZGddde jZGdddZGddde jZddZddZddZddZdS)N)ListOptionalTuple)flash_attn_varlen_func)index_first_axis pad_input unpad_input)RMSNormcCs|d|dS)Nr ) unsqueeze)xscaler>/mnt/petrelfs/gaopeng/qinqi/lumina2_open-sorce/models/model.pymodulatesrcs8eZdZdZd fdd Zed ddZdd ZZS) TimestepEmbedderz> Embeds scalar timesteps into vector representations. c stttj||ddttj||dd|_tjj|jdj ddtj |jdj tjj|jdj ddtj |jdj ||_ dS)NTbiasr{Gz?std) super__init__nn SequentialLinearSiLUmlpinitnormal_weightzeros_rfrequency_embedding_size)self hidden_sizer$ __class__rrr&s&   zTimestepEmbedder.__init__'cCs|d}tt| tjd|tjd|j|jd}|dddf|d}tj t |t |gdd}|drRtj |t |ddddfgdd}|S) ai Create sinusoidal timestep embeddings. :param t: a 1-D Tensor of N indices, one per batch element. These may be fractional. :param dim: the dimension of the output. :param max_period: controls the minimum frequency of the embeddings. :return: an (N, D) Tensor of positional embeddings. rr)startenddtype)deviceNdimr ) torchexpmathlogarangefloat32tor-floatcatcossin zeros_like)tr0 max_periodhalffreqsargs embeddingrrrtimestep_embedding<s ((z#TimestepEmbedder.timestep_embeddingcCs,|||j}|||jdjj}|S)Nr)rCr$rr7r"r,)r%r=Zt_freqt_embrrrforwardQszTimestepEmbedder.forward)r)r)) __name__ __module__ __qualname____doc__r staticmethodrCrE __classcell__rrr'rr!s  rcseZdZdZdededeedeffdd Zede j d e j d e j fd d Z d dZ de j de j d e j d e j fddZ ZS)JointAttentionzMulti-head attention module.r0n_heads n_kv_headsqk_normcst|dur |n||_||_|j|_|j|j|_|||_tj|||j|j|jdd|_ tj |j j tj||j|dd|_ tj |j j |r_t|j|_t|j|_dSt|_|_dS)z Initialize the Attention module. Args: dim (int): Number of input dimensions. n_heads (int): Number of heads. n_kv_heads (Optional[int]): Number of kv heads, if using GQA. NFr)rrrN n_local_headsn_local_kv_headsn_rephead_dimrrqkvr xavier_uniform_r"outr q_normk_normIdentity)r%r0rMrNrOr'rrr_s,   zJointAttention.__init__x_in freqs_cisreturncCstjjjdd2t|jg|jddddR}|d}t || d}| |WdS1s=wYdS)aA Apply rotary embeddings to input tensors using the given frequency tensor. This function applies rotary embeddings to the given query 'xq' and key 'xk' tensors using the provided frequency tensor 'freqs_cis'. The input tensors are reshaped as complex numbers, and the frequency tensor is reshaped for broadcasting compatibility. The resulting tensors contain rotary embeddings and are returned as real tensors. Args: x_in (torch.Tensor): Query or Key tensor to apply rotary embeddings. freqs_cis (torch.Tensor): Precomputed frequency tensor for complex exponentials. Returns: Tuple[torch.Tensor, torch.Tensor]: Tuple of modified query tensor and key tensor with rotary embeddings. F)enabledNr.r) r1cudaampautocastview_as_complexr8reshapeshaper view_as_realflattentype_as)rZr[r x_outrrrapply_rotary_embs , $zJointAttention.apply_rotary_embcCsdd}||\}}} |j\} } } } t|| | | | |}t|| | | | |}|| krBt|| | |j| |}|}| }|}n2|dkr`d}tj| dtj|jd}|dd}|d}n|dd| df}t ||\}}}}||||||f|| ffS)NcSsV|jdtjd}tj|dd}|}ttj |dtjdd}|||fS)Nr.)r0r,F)as_tupler)r r) sumr1int32nonzerorfmaxitemFpadcumsum)attention_maskseqlens_in_batchindicesmax_seqlen_in_batch cu_seqlensrrr_get_unpad_datas z3JointAttention._upad_input.._get_unpad_datar r,r-r.) rdrrcrPr1r5rlr-squeezer)r% query_layer key_layer value_layerrs query_lengthrx indices_k cu_seqlens_kmax_seqlen_in_batch_k batch_size kv_seq_lennum_key_value_headsrS cu_seqlens_qmax_seqlen_in_batch_q indices_qrrr _upad_inputsF   zJointAttention._upad_inputr x_maskc Cs|j\}}}|j}tj|||j|j|j|j|j|jgdd\}} } ||||j|j}| |||j|j} | |||j|j} | |}| | } t j ||d}t j | |d} | || |}} td|j} |tjtjfvr||| | ||\} } }}}}|\}}|\}}t| | |||||dd| d }t||||}n^|j|j}|dkr| dddd|dd d} | dddd|dd d} tj|d d dd| d d dd| d d dd||dd|d|j|d| d d d dd |}|d }||S) ze Args: x: x_mask: freqs_cis: Returns: r.r/)r[r F)rr max_seqlen_q max_seqlen_k dropout_pcausal softmax_scaler^rr) attn_maskr )rdr,r1splitrTrPrSrQviewrWrXrLrir7r3sqrtfloat16bfloat16rrrr repeatrfrpscaled_dot_product_attentionpermuteboolexpandrV)r%r rr[bszseqlen_r,xqxkxvr query_states key_states value_statesr cu_seq_lens max_seq_lensrrrrattn_output_unpadoutputrRrrrrEsv               zJointAttention.forward)rFrGrHrIintrrrrJr1TensorrirrErKrrr'rrL\s:+ 4rLc sBeZdZdedededeeffdd ZddZd d ZZ S) FeedForwardr0 hidden_dim multiple_offfn_dim_multipliercst|durt||}|||d|}tj||dd|_tj|jjtj||dd|_ tj|j jtj||dd|_ tj|j jdS)a Initialize the FeedForward module. Args: dim (int): Input dimension. hidden_dim (int): Hidden dimension of the feedforward layer. multiple_of (int): Value to ensure hidden dimension is a multiple of this value. ffn_dim_multiplier (float, optional): Custom multiplier for hidden dimension. Defaults to None. Nr Fr) rrrrrw1r rUr"w2w3)r%r0rrrr'rrr8s,  zFeedForward.__init__cCst||SN)rpsilu)r%x1x3rrr_forward_silu_gatingesz FeedForward._forward_silu_gatingcCs||||||Sr)rrrr)r%r rrrrEhszFeedForward.forward) rFrGrHrrr8rrrErKrrr'rr7s-rcspeZdZ ddededededededed ed d ffd d Z ddejdejdejde ejfddZ Z S)JointTransformerBlockTlayer_idr0rMrNrrnorm_epsrOr\Nc st||_|||_t|||||_t|d|||d|_||_t ||d|_ t ||d|_ t ||d|_ t ||d|_ | |_| rntttjt|dd|dd|_tj|jdjtj|jdjdSdS) a Initialize a TransformerBlock. Args: layer_id (int): Identifier for the layer. dim (int): Embedding dimension of the input features. n_heads (int): Number of attention heads. n_kv_heads (Optional[int]): Number of attention heads in key and value features (if using GQA), or set to None for the same as query. multiple_of (int): ffn_dim_multiplier (float): norm_eps (float): )r0rrrepsTrr N)rrr0rSrL attentionr feed_forwardrr attention_norm1 ffn_norm1attention_norm2 ffn_norm2 modulationrrrrminadaLN_modulationr r#r"r) r%rr0rMrNrrrrOrr'rrrms8  zJointTransformerBlock.__init__r rr[ adaln_inputc Cs|jrI|dus J||jddd\}}}}||d||t|||||}||d| | t| ||}|S|dusOJ|||||||}|| | | |}|S)aL Perform a forward pass through the TransformerBlock. Args: x (torch.Tensor): Input tensor. freqs_cis (torch.Tensor): Precomputed cosine and sine frequencies. Returns: torch.Tensor: Output tensor after applying attention and feedforward layers. Nrr r/) rrchunkr tanhrrrrrrr) r%r rr[r scale_msagate_msa scale_mlpgate_mlprrrrEs<  zJointTransformerBlock.forward)Tr) rFrGrHrr8rrr1rrrErKrrr'rrls>   >rcs(eZdZdZfddZddZZS) FinalLayerz% The final layer of NextDiT. csttj|ddd|_tj||||dd|_tj|jj tj|jj t t tjt |d|dd|_tj|jdj tj|jdj dS)NFgư>)elementwise_affinerTrrr )rrr LayerNorm norm_finalrlinearr r#r"rrrrr)r%r& patch_size out_channelsr'rrrs.  zFinalLayer.__init__cCs(||}t|||}||}|Sr)rrrr)r%r cr rrrrEs  zFinalLayer.forward)rFrGrHrIrrErKrrr'rrs rcsFeZdZ d dedeedeeffdd Zd ejfd d Z Z S) RopeEmbedder@8rr rtheta axes_dims axes_lenscs8t||_||_||_tj|j|j|jd|_dS)N)r)rrrrrNextDiTprecompute_freqs_cisr[)r%rrrr'rrrs zRopeEmbedder.__init__idsc sfdd|jD|_g}tt|jD]9}dddd||dfdd|j|jdtj}| tj |j| d|jdddd|dqtj |ddS)Ncsg|]}|jqSr)r7r-).0r[rrr sz)RopeEmbedder.__call__..r r.r)r0indexr/) r[rangelenrrrdr7r1int64appendgatherr r9)r%rresultirrrr__call__ s <4zRopeEmbedder.__call__)rrr) rFrGrHr8rrrr1rrrKrrr'rrs rcseZdZdZ           d8d ededededededeededeededededeedeeddffdd Z d9de j dee eefd eedee j fd!d"Z dee j e j Bd#e j d$e j d%e j de e j e j ee eefeee j ff d&d'Zd(d)Z * *d:d+d,Ze -d;deed.eed/efd0d1Zdefd2d3Zdeejfd4d5Zdeejfd6d7ZZS)Frrr in_channelsr0n_layersn_refiner_layersrMrNrrrrO cap_feat_dimrrr\c st||_||_||_tj|||dd|_tj |jj tj |jj dt fddt|D|_t fddt|D|_ttd|_tt| dtj| dd |_tjj|jd j d d tj|jd j t fd dt|D|_td|_t||j|_t| ksJ| |_||_t | |d|_!|_"|_#dS)NT) in_features out_featuresrrc &g|]}t|dd qS)Trrrrr0rrrMrNrrOrrr; z$NextDiT.__init__..c r)FrrrrrrrKrrrrr rrc s"g|] }t|qSrrrrrrris )rr)$rrrrrrr x_embedderr rUr" constant_r ModuleListr noise_refinercontext_refinerrr t_embedderrr cap_embedder trunc_normal_r#layersrr final_layerrkrrr rope_embedderr0rM)r%rrr0rrrMrNrrrrOrrrr'rrrsZ       zNextDiT.__init__r img_sizecap_sizec Cs|j}}g}t|dD]<}||\} } ||} | | || |} |||| | | || ||||jdddddddddq|rTtj |dd}|S)zI x: (N, T, patch_size**2 * C) imgs: (N, H, W, C) rrrr r^r/) rrsizerrrrrfr1stack) r%r r r  return_tensorpHpWimgsrHWbeginr+rrr unpatchifys  zNextDiT.unpatchify cap_featscap_maskr=c$s t|}|j|dj}|jdd}dd|D}fdd|D} tddt|| D} t|} t| } tj|| d tj |d } t |D]x}||}| |}||\}}||}}|||kskJtj |tj |d | |d|df<|| ||||df<tj |tj |d  d d d|}tj |tj |d  dd  |d}|| ||||df<|| ||||d f<qJ|| }t|j}|jd|d<tj|||jd }t|j}| |d<tj|||jd }t |D](}||}| |}||d|f||d|f<|||||f||d|f<q|jD] }||||}q!g}t |D].}||}|\}}}| |||dd d ddd dd}||q1|}tj|| |djd ||djd } tj|| tj|d }!t |D]}||| |d| |f<d|!|d| |f<q|| } |jD] }|| |!||} qtj|| tj|d }"tj|| |j||djd }#t |D]3}||}| |}d|"|d||f<||d|f|#|d|f<| |d|f|#||||f<q|#|"|||fS)Nrr r/cSs g|] }|d|dfqS)r r)r )rimgrrrr z.NextDiT.patchify_and_embed..cs g|] \}}||qSrr)rrrrrrrrrcss|] \}}||VqdSrr)rcap_lenimg_lenrrr sz-NextDiT.patchify_and_embed..r^ryr.rr-r,rT)rrr-rktolistrnzipr1zerosrlrr5rrrfr listrdr,rr rrrrrr0)$r%r rrr=rr-Zl_effective_cap_lenZ img_sizesZl_effective_img_len max_seq_lenZ max_cap_lenZ max_img_len position_idsrrrrrZH_tokensZW_tokensZrow_idsZcol_idsr[Zcap_freqs_cis_shapeZ cap_freqs_cisZimg_freqs_cis_shapeZ img_freqs_cislayerflat_xrCZpadded_img_embedZpadded_img_maskmaskZpadded_full_embedrrrpatchify_and_embeds|     &&    "  6$    $zNextDiT.patchify_and_embedc Cs||}|}||}t|tj}|||||\}}}} } | |j} |jD] } | ||| |}q(| ||}|j ||| |d}|S)z Forward pass of NextDiT. t: (N,) tensor of diffusion timesteps y: (N,) tensor of text tokens/features )r) rr isinstancer1rr)r7r-rrr) r%r r=rrrZ x_is_tensorr(r r r[r%rrrrEs      zNextDiT.forwardr c Cs|dt|d}|d|krtj||gdd} || |||} | ddd|jf| dd|jdf} } tj| t| ddd\} }||| |}t|dkrtjj| t t dt| j dd}|t|}tjj|t t dt|j dd}||kr|||}n;|} || |dt|d|dt|d|dt|d} | ddd|jf| dd|jdf} } | }tj||gdd}|S) z Forward pass of NextDiT, but also batches the unconditional forward pass for classifier-free guidance. Nrrr/rr T)r0keepdim) rr1r9rErrr8linalg vector_normtuplerrd)r%r r=rr cfg_scaleZ cfg_truncZ renorm_cfgr?combinedZ model_outrrestZcond_epsZ uncond_epsZhalf_epsZ ori_pos_normZ max_new_normZ new_pos_normrrrrforward_with_cfg s0 .   @.zNextDiT.forward_with_cfgrr+rc Csg}tt||D];\}\}}d|tjd|dtjdd|}tj||jtjd}t||}tt || tj } | | q |S)a Precompute the frequency tensor for complex exponentials (cis) with given dimensions. This function calculates a frequency tensor with complex exponentials using the given dimension 'dim' and the end index 'end'. The 'theta' parameter scales the frequencies. The returned tensor contains complex values in complex64 data type. Args: dim (list): Dimension of the frequency tensor. end (list): End index for precomputing frequencies. theta (float, optional): Scaling factor for frequency computation. Defaults to 10000.0. Returns: torch.Tensor: Precomputed frequency tensor with complex exponentials. g?rrcpuryr) enumerater r1r5float64r-outerr8polar ones_liker7 complex64r) r0r+rr[rder@timestepZ freqs_cis_irrrr9s" zNextDiT.precompute_freqs_ciscsdfdd|S)Nrcs8|jddD]}|7q|D]}|qdS)NF)recurse) parametersnumelchildren)moduleparam submodule_recursive_count_paramsZ total_paramsrrrE_s   z8NextDiT.parameter_count.._recursive_count_paramsrr%rrDrparameter_count\szNextDiT.parameter_countcC t|jSrr"rrFrrrget_fsdp_wrap_module_listi z!NextDiT.get_fsdp_wrap_module_listcCrHrrIrFrrr"get_checkpointing_wrap_module_listlrKz*NextDiT.get_checkpointing_wrap_module_list)rrrrrrNrNrFrrr)F)r r )r)rFrGrHrIrrr8rrrr1rrrr)rEr2rJrrGrModulerJrLrKrrr'rrs     f & V$ , " rc K&td dddddgdgdd|S) Nri )rrri,rrrr0rrMrNrrrrkwargsrrr#NextDiT_2B_GQA_patch2_Adaln_RefinertrWc KrN) Nri rPrQ)$rZrZrRrSrrTrUrrr#NextDiT_3B_GQA_patch2_Adaln_RefinerrXr[c KrN) Nri@ rrPrQ(r]r]rRrSrrTrUrrr#NextDiT_4B_GQA_patch2_Adaln_RefinerrXr^c Ks&tddddddgdgdd|S) NrirrQr\rRrSrrTrUrrr#NextDiT_7B_GQA_patch2_Adaln_RefinerrXr_) r3typingrrr flash_attnrflash_attn.bert_paddingrrrr1torch.nnrZtorch.nn.functional functionalrp componentsr rrMrrLrrrrrrWr[r^r_rrrrs.     ;\5n&_