o Ag=@sddlZddlmZddlmZmZmZmZmZm Z ddl m Z ddl m Z ddlmZddlmZmZmZddlmZmZmZddlmZdd lmZmZmZmZmZe e!Z"Gd d d eZ#dS) N)AnyDictListOptionalUnionTuple)set_module_tensor_to_device)Transformer2DModelOutput)AdaLayerNormContinuous)*CombinedTimestepGuidanceTextProjEmbeddings"CombinedTimestepTextProjEmbeddings FluxPosEmbed)FluxTransformer2DModelFluxTransformerBlockFluxSingleTransformerBlock)register_to_config)USE_PEFT_BACKENDis_torch_versionloggingscale_lora_layersunscale_lora_layerscseZdZdZe        d/d edededededededededeedeffdd Ze fddZ ddZ ddZ !d0d"e jd#eed$e jd%e jd&e jd'e jd(e jd)e jd*eeeefd+ed,ee jeffd-d.ZZS)1CustomFluxTransformer2DModela The Transformer model introduced in Flux. Reference: https://blackforestlabs.ai/announcing-black-forest-labs/ Parameters: patch_size (`int`): Patch size to turn the input data into small patches. in_channels (`int`, *optional*, defaults to 16): The number of channels in the input. num_layers (`int`, *optional*, defaults to 18): The number of layers of MMDiT blocks to use. num_single_layers (`int`, *optional*, defaults to 18): The number of layers of single DiT blocks to use. attention_head_dim (`int`, *optional*, defaults to 64): The number of channels in each head. num_attention_heads (`int`, *optional*, defaults to 18): The number of heads to use for multi-head attention. joint_attention_dim (`int`, *optional*): The number of `encoder_hidden_states` dimensions to use. pooled_projection_dim (`int`): Number of dimensions to use when projecting the `pooled_projections`. guidance_embeds (`bool`, defaults to False): Whether to use guidance embeddings. @&F8r" patch_size in_channels num_layersnum_single_layersattention_head_dimnum_attention_headsjoint_attention_dimpooled_projection_dimguidance_embedsaxes_dims_rope max_layer_numc s<tt|_jjjj_td| d_ | rt nt } | jjj d_ tjjj_tjjjj_tfddtjjD_tfddtjjD_tjjddd _tjj||jd d _d_| _t t!d jd d j_"tj#j$j"d dddddS)Ni')thetaZaxes_dim) embedding_dimr+c$g|]}tjjjjjdqS)dimr)r()r inner_dimconfigr)r(.0iself//home/wyb/yanbin/ART_v1.0/custom_model_mmdit.py Dz9CustomFluxTransformer2DModel.__init__..cr1r2)rr4r5r)r(r6r9r;r<r=Or>Fgư>)elementwise_affineepsT)biasrgg{Gz?gg@)meanstdab)%superr__init__ out_channelsr5r)r(r4r pos_embedr r r+time_text_embednnLinearr*context_embeddertorchr% x_embedder ModuleListranger&transformer_blocksr'single_transformer_blocksr norm_outproj_outgradient_checkpointingr. Parameteremptylayer_peinit trunc_normal_) r:r$r%r&r'r(r)r*r+r,r-r.Ztext_time_guidance_cls __class__r9r<rG$s6       z%CustomFluxTransformer2DModel.__init__csFtj|i|}|D] \}}|dkr|j}nq |j||S)NrY)rFfrom_pretrainednamed_parametersdevicerYto)clsargskwargmodelnameparar`r\r;r<r^rs z,CustomFluxTransformer2DModel.from_pretrainedcCsg}t|jdD]G}||dkrq ||\}}}}|d|d|d|df\}}}}|dd|||||ddf} | j\} } } } | | d| } || q tj|dd}|S)z hidden_states: [1, n_layers, h, w, inner_dim] list_layer_box: List, length=n_layers, each element is a Tuple of 4 elements (x1, y1, x2, y2) rNr!r3)rQshapereshapeappendrNcat)r: hidden_stateslist_layer_box token_list layer_idxx1y1x2y2Z layer_tokenbshwcresultr;r;r<crop_each_layer|s $" z,CustomFluxTransformer2DModel.crop_each_layerc Csd}|jd}t|jdD][}||dkrq||\}}} } |d|d| d| df\}}} } |dd||| || |ddf|| || |d|dd||| || ddf<|| || |}q|S)z hidden_states: [1, h1xw1 + h2xw2 + ... + hlxwl , inner_dim] full_hidden_states: [1, n_layers, h, w, inner_dim] list_layer_box: List, length=n_layers, each element is a Tuple of 4 elements (x1, y1, x2, y2) rrNr!rh)rjrQrk) r:rnfull_hidden_statesroZused_token_lenrvrqrrrsrtrur;r;r<fill_in_processed_tokenss  $\z5CustomFluxTransformer2DModel.fill_in_processed_tokensNTrnroencoder_hidden_statespooled_projectionstimestepimg_idstxt_idsguidancejoint_attention_kwargs return_dictreturnc  Cs|| dur| } | dd} nd} trt|| n| dur*| dddur*td|j\} } }}}|| | ||dd|dd}| dddd dd d }| | | |d|d|d }| |}t |}|jd|jdd|j}||ddd| f}|||}||jd }|dur||jd }nd}|dur|||n||||}||}|jdkrtd |d}|jdkrtd|d}t j||fdd}||}t|jD]:\}}|jr|jrddd}tddrddini}t jjj||||||fi|\}}q|||||d\}}qt j||gdd}t|j D]7\}}|jrW|jrWddd}tddrCddini}t jjj|||||fi|}q(||||d}q(|dd|jdddf}|!|||}|| d|j}|"||}|#|}|| | |d|d|dd}| ddd dd dd }| | | |||}trt$|| | s|fSt%|dS)ae The [`FluxTransformer2DModel`] forward method. Args: hidden_states (`torch.FloatTensor` of shape `(batch size, channel, height, width)`): Input `hidden_states`. encoder_hidden_states (`torch.FloatTensor` of shape `(batch size, sequence_len, embed_dims)`): Conditional embeddings (embeddings computed from the input conditions such as prompts) to use. pooled_projections (`torch.FloatTensor` of shape `(batch_size, projection_dim)`): Embeddings projected from the embeddings of input conditions. timestep ( `torch.LongTensor`): Used to indicate denoising step. block_controlnet_hidden_states: (`list` of `torch.Tensor`): A list of tensors that if specified are added to the residuals of transformer blocks. joint_attention_kwargs (`dict`, *optional*): A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain tuple. Returns: If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a `tuple` where the first element is the sample tensor. Nscaleg?z\Passing `scale` via `joint_attention_kwargs` when not using the PEFT backend is ineffective.rrizrPassing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch TensorzrPassing `img_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensorricfdd}|S)Ncdur |diS|SNrr;inputsmodulerr;r<custom_forward[CustomFluxTransformer2DModel.forward..create_custom_forward..custom_forwardr;rrrr;rr<create_custom_forwardzCCustomFluxTransformer2DModel.forward..create_custom_forwardz>=z1.11.0 use_reentrantF)rnr~tembimage_rotary_embcr)Ncrrr;rrr;r<r"rrr;rr;rr<r!r)rnrr.rh)sample)N)&copypoprrgetloggerwarningrjviewpermuterkrOrN zeros_likerYr.r4r{radtyperJrMndimrmrI enumeraterRtrainingrVrutils checkpointrSr}rTrUrr )r:rnror~rrrrrrrZ lora_scalervn_layersZchannel_latentheightwidthr|rYridsrZ index_blockblockrZ ckpt_kwargsoutputr;r;r<forwards'                       z$CustomFluxTransformer2DModel.forward) rrrrrrrrFr r#) NNNNNNNNT)__name__ __module__ __qualname____doc__rintboolrrG classmethodr^r{r}rNTensorr LongTensorrrstrrr FloatTensorr r __classcell__r;r;r\r<rs    M     r)$rNtorch.nnrKtypingrrrrrraccelerate.utilsrZ!diffusers.models.modeling_outputsr Zdiffusers.models.normalizationr Zdiffusers.models.embeddingsr r r Z.diffusers.models.transformers.transformer_fluxrrrZdiffusers.configuration_utilsrZdiffusers.utilsrrrrr get_loggerrrrr;r;r;r<s