a *f@sdZddlmZmZmZmZddlZddlmZddl mm Z ddl m Z ddlZejejdddZGd d d ejZdS) a$ --- title: Latent Diffusion Models summary: > Annotated PyTorch implementation/tutorial of latent diffusion models from paper High-Resolution Image Synthesis with Latent Diffusion Models --- # Latent Diffusion Models Latent diffusion models use an auto-encoder to map between image space and latent space. The diffusion model works on the diffusion space, which makes it a lot easier to train. It is based on paper [High-Resolution Image Synthesis with Latent Diffusion Models](https://papers.labml.ai/paper/2112.10752). They use a pre-trained auto-encoder and train the diffusion U-Net on the latent space of the pre-trained auto-encoder. For a simpler diffusion implementation refer to our [DDPM implementation](../ddpm/index.html). We use same notations for $lpha_t$, $eta_t$ schedules, etc. )ListTupleOptionalUnionN) UNetModel)conststcCs|d|}|ddddS)z6Gather consts for $t$ and reshape to feature map shaper)gatherreshape)rr crE/home/music/interactive_symbolic_music_demo/model/latent_diffusion.pyr s r cseZdZUdZeed<deeeeeee dfdd Z e ddZ e je jd d d Ze je jee je jfd d dZde je jee jdddZe je jdddZde jee jdddZZS)LatentDiffusionz ## Latent diffusion model This contains following components: * [AutoEncoder](model/autoencoder.html) * [U-Net](model/unet.html) with [attention](model/unet_attention.html) eps_modelF) unet_modellatent_scaling_factorn_steps linear_start linear_end debug_modec st||_||_||_tj|d|d|tjdd}d|}tj|dd} t j | tj dd|_ t j | tj dd|_t j | tj dd|_t| dg| d d g|_td |jd |jd |j|j|_|j|_||_d S) aE :param unet_model: is the [U-Net](model/unet.html) that predicts noise $\epsilon_ ext{cond}(x_t, c)$, in latent space :param autoencoder: is the [AutoEncoder](model/autoencoder.html) :param latent_scaling_factor: is the scaling factor for the latent space. The encodings of the autoencoder are scaled by this before feeding into the U-Net. :param n_steps: is the number of diffusion steps $T$. :param linear_start: is the start of the $eta$ schedule. :param linear_end: is the end of the $eta$ schedule. ?)dtypeg?r)dimF) requires_gradNr r)super__init__rrrtorchlinspacefloat64cumprodnn Parametertofloat32alphabeta alpha_barcat new_tensoralpha_bar_prevsqrt sigma_ddimZsigma2r) selfrrrrrrr(r'r) __class__rrr2s$  ,zLatentDiffusion.__init__cCstt|jjS)z& ### Get model device )nextiterr parametersdevice)r/rrrr5_szLatentDiffusion.device)xr cCs |||S)z ### Predict noise Predict noise given the latent representation $x_t$, time step $t$, and the conditioning context $c$. $$\epsilon_ ext{cond}(x_t, c)$$ )r)r/r6r rrrforwardhs zLatentDiffusion.forward)x0r returncCs,t|j|d|}dt|j|}||fS)z4 #### Get $q(x_t|x_0)$ distribution rr)r r))r/r8r meanvarrrrq_xt_x0sszLatentDiffusion.q_xt_x0N)r8r epscCs2|durt|}|||\}}||d|S)z/ #### Sample from $q(x_t|x_0)$ Nr)r randn_liker<)r/r8r r=r:r;rrrq_samples zLatentDiffusion.q_sample)xtr c Cs|||}t|j|}t|j|}t|j|}|d|d||d}d||dd|}tj|j|jd} |d|||| } | S)zP #### Sample from $ extcolor{lightgreen}{p_ heta}(x_{t-1}|x_t)$ rrr)r5) rr r)r,r.rrandnshaper5) r/r@r eps_thetar)r,r.Z predicted_x0Zdirection_to_xtr=Zx_tm_1rrrp_samples    zLatentDiffusion.p_sample)r8noisec Cs<|jd}tjd|j|f|jtjd}|d|jjkr|j rLt d||durj| tj }t |}|j|||d}|||}t||}n|j rt d||durt |ddddf}|j|ddddf||d}|ddddf} t|| gd}|||}t||}|j r8t d ||S) z& #### Simplified Loss r)r5rrzIn the mode of root level:N)r=zIn the mode of non-root level:rzloss:)rBrrandintrr5longsizer out_channelsrprintr%r&r>r?Fmse_lossr*) r/r8rE batch_sizer r@rClossZfront_tbackground_condrrrrNs2        zLatentDiffusion.loss)F)N)N)__name__ __module__ __qualname____doc__r__annotations__floatintrboolrpropertyr5rTensorr7rr<r?rDrN __classcell__rrr0rr&s6  -     r)rStypingrrrrrtorch.nnr#Ztorch.nn.functional functionalrKarchitecture.unetrrandomrYr Modulerrrrrs