@g].ddlZddlmZddlZddlZddlZddlZddl m Z ddl m cm Z ddlmZmZmZddlmZddlmZddlmZmZmZmZmZddlmZmZddlm Z m!Z!ddl"m#Z#m$Z$dd l%m&Z&dd l'm(Z(m)Z)m*Z*m+Z+m,Z,dd l-m.Z.m/Z/m0Z0dd l1m2Z2ddl3Z3dd l4m5Z5ddl6m7Z7m8Z8ddlZddlZddl9m:Z:m;Z;mZ>m?Z?m@Z@Gdde jAZBGdde jAZCGdde jAZDGdde jAZEGdde jAZFGdde jAZGGddej jAZHGdde jAZIGd d!e jAZJGd"d#e jAZKGd$d%e jAZLGd&d'e jAZMGd(d)e jAZNGd*d+e jAZOGd,d-e jAZPGd.d/e jAZQGd0d1e jAZRd2ZSd;d4ZTdr?r@ConvTranspose2drBrCs rGr<zLearnedUpSample.__init__Is $ ?f $ $ DIII _ . .*66vV\ek}CMSTTTDIII _ & &*66vV\ek|}HIJJJDIIIgjnjyyzz zrHc,||SrJrKrLs rGrNzLearnedUpSample.forwardWrOrHrPrUs@rGrWrWHsL { { { { {rHrWc$eZdZfdZdZxZS) DownSamplecVt||_dSrJr;r<r=rDr=rFs rGr<zDownSample.__init__[$ $rHcT|jdkr|S|jdkrtj|dS|jdkrZ|jddzdkr1t j||ddgd }tj|dStd |jz) Nr*r+r/r7r0r.rddimr:)r=F avg_pool2dshapetorchcat unsqueezerBrLs rGrNzDownSample.forward_s ?f $ $H _ . .<6** * _ & &wr{Q!##Iq!G*"6"6r":":;DDD<1%% %ilpl{{|| |rHrPrUs@rGr^r^ZsN%%%%% } } } } } } }rHr^c$eZdZfdZdZxZS)UpSamplecVt||_dSrJr`ras rGr<zUpSample.__init__mrbrHc|jdkr|S|jdkrtj|ddS|jdkrtj|ddStd|jz) Nr*r+r/nearest scale_factormoder7r0rZ)r=rh interpolaterBrLs rGrNzUpSample.forwardqst ?f $ $H _ . .=iHHH H _ & &=CCC Cgjnjyyzz zrHrPrUs@rGrorolsN%%%%%{{{{{{{rHrocZeZdZejdddffd ZdZdZdZdZ xZ S) ResBlk皙?Fr*ct||_||_t ||_t |||_||k|_| ||dSrJ) r;r<actv normalizer^ downsampler(downsample_res learned_sc_build_weights)rDrEdim_outr{r|r}rFs rGr<zResBlk.__init__}sp  "$Z00/ FCC G+ FG,,,,,rHc ttj||ddd|_ttj||ddd|_|jr6tj|d|_tj|d|_|j r.ttj||dddd|_ dSdSNr-r.TaffinerFbias) rr>rAconv1conv2r|InstanceNorm2dnorm1norm2rconv1x1rDrErs rGrzResBlk._build_weightss"29VVQ1#E#EFF "29VWaA#F#FGG > @*6$???DJ*6$???DJ ? Z(67Aq!RW)X)X)XYYDLLL Z ZrHcv|jr||}|jr||}|SrJrrr}rLs rG _shortcutzResBlk._shortcuts; ?  QA ? #""ArHcH|jr||}||}||}||}|jr||}||}||}|SrJ)r|rr{rr~rrrLs rG _residualzResBlk._residuals >  1 A IIaLL JJqMM    " " >  1 A IIaLL JJqMMrHc||||z}|tjdz SNr0rrmathsqrtrLs rGrNzResBlk.forward7 NN1  q 1 1 149Q<<rH rQrRrSr> LeakyReLUr<rrrrNrTrUs@rGrxrx|s-9R\#->-> V------ZZZ          rHrxc&eZdZdfd ZdZxZS) StyleEncoder0c Xtg}|ttjd|dddgz }d}t |D]-}t |dz|}|t||dgz }|}.|tjdgz }|ttj||ddd gz }|tj dgz }|tjdgz }tj ||_ tj |||_ dS) Nr.r-r0r7r}ryr)r;r<rr>rArangeminrxrAdaptiveAvgPool2d SequentialsharedLinearunshared) rDrE style_dim max_conv_dimblocks repeat_num_rrFs rGr<zStyleEncoder.__init__s) =1faA!>!>??@@ z""  A&(L11G vfg&AAAB BFFF2<$$%%=7GQ1!E!EFFGG2'**++2<$$%%mV,  '955 rHc||}||dd}||}|S)Nrrd)rviewsizer)rDrMhss rGrNzStyleEncoder.forwardsD KKNN FF166!99b ! ! MM!  rH)rrrrPrUs@rGrrsL666666&rHrc&eZdZdfd ZdZxZS) LinearNormTlinearcHtt|tj||||_tjj|jj tjj |dS)Nr)gain) r;rr<rkr>r linear_layerinitxavier_uniform_weightcalculate_gain)rDin_dimout_dimr w_init_gainrFs rGr<zLinearNorm.__init__s j$((***!HOOFG$OGG  %%   $--k:: & < < < < rArrrxrrrmain) rDrE num_domainsrrrlidrrFs rGr<zDiscriminator2d.__init__s8 =1faA!>!>??@@$$  C&(L11G vfg&AAAB BFFF2<$$%%=7GQ1!E!EFFGG2<$$%%2'**++=7KAq!I!IJJKKM6* rHcg}|jD]"}||}||#|d}||dd}||fS)Nrdr)rappendrr)rDrMfeatureslouts rG get_featurezDiscriminator2d.get_featuresj  A!A OOA    rlhhsxx{{B''H}rHcb||\}}|}||fSrJ)rsqueeze)rDrMrrs rGrNzDiscriminator2d.forwards0((++ XkkmmH}rH)rr.rr)rQrRrSr<rrNrTrUs@rGrrs[++++++"rHrcbeZdZejddddffd ZdZdZdZdZ d Z xZ S) ResBlk1dryFr*c dt||_||_||_||k|_|||||_|jdkrtj |_ dSttj ||dd|d|_ dS)Nr*r-r0r.r2) r;r<r{r|downsample_typerr dropout_pr>r?poolrConv1d)rDrErr{r|r}rrFs rGr<zResBlk1d.__init__s  ") G+ FG,,,"  6 ) ) DIII#BIff!TU^dno$p$p$pqqDIIIrHc ttj||ddd|_ttj||ddd|_|jr6tj|d|_tj|d|_|j r.ttj||dddd|_ dSdSr) rr>rrrr|InstanceNorm1drrrrrs rGrzResBlk1d._build_weightss 661a!C!CDD  67Aq!!D!DEE > @*6$???DJ*6$???DJ ? X&ry!QPU'V'V'VWWDLLL X XrHc|jdkr|S|jddzdkr1tj||ddgd}t j|dS)Nr*rdr0rrerf)rrjrkrlrmrh avg_pool1drLs rGr}zResBlk1d.downsamplesg  6 ) )Hwr{Q!##Iq!G*"6"6r":":;DDD<1%% %rHch|jr||}||}|SrJrrLs rGrzResBlk1d._shortcuts1 ?  QA OOA  rHc|jr||}||}tj||j|j}||}||}|jr| |}||}tj||j|j}| |}|S)Nptraining) r|rr{rhdropoutrrrrrrrLs rGrzResBlk1d._residuals >  1 A IIaLL Ia4>DM B B B JJqMM IIaLL >  1 A IIaLL Ia4>DM B B B JJqMMrHc||||z}|tjdz SrrrLs rGrNzResBlk1d.forward&rrH) rQrRrSr>rr<rr}rrrNrTrUs@rGrrs-9R\#->-> Vs r r r r r rXXX&&& "       rHrc&eZdZdfd ZdZxZS) LayerNormh㈵>ct||_||_t jt j||_t jt j ||_ dSrJ) r;r<channelsepsr> Parameterrkonesgammazerosbeta)rDrrrFs rGr<zLayerNorm.__init__+sa   \%*X"6"677 LX!6!677 rHc|dd}tj||jf|j|j|j}|ddS)Nr.rd) transposerh layer_normrrrrrLs rGrNzLayerNorm.forward3sK KK2   LT],dj$)TX N N{{1b!!!rHrrPrUs@rGrr*sL888888"""""""rHrcPeZdZejdffd ZdZdZdZxZ S) TextEncoderryc ttj|||_t ||dz|_t |dz||_tttddd|d|dz|_ |dz dz}tj |_ t|D]s}|j tjt#tj||||t'||tjd tt+|j |_dS) Nr0rconv1d_kernel_sizeqkv_proj_blocksize num_headsmlstm mlstm_blockcontext_length num_blocks embedding_dimr.)r3r6ry)r;r<r> Embedding embeddingrprepare_projectionpost_projectionr!r"r#cfg ModuleListcnnrrrrrrDropoutr lstm) rDrr3depth n_symbolsr{r6rrFs rGr<zTextEncoder.__init__:sV i:: *8HM B B'A h??(0@.>?@UVbc/*/*/*1&1&1&4>w}Q?OQSTT\\]deexQ 1 1! 4 455 rH) rQrRrSr>rr<rNrr$rTrUs@rGrr9s|EQR\RUEVEV+/+/+/+/+/+/X(((TrHrc$eZdZfdZdZxZS)AdaIN1dcttj|d|_tj||dz|_dS)NFrr0)r;r<r>rnormrfc)rDr num_featuresrFs rGr<zAdaIN1d.__init__sK %l5AAA )I|A~66rHc||}||d|dd}tj|dd\}}d|z||z|zS)Nrr.r0chunksrg)r*rrrkchunkr)rDrMrrrrs rGrNzAdaIN1d.forwardsr GGAJJ FF166!99affQii + +k!A1555 tE TYYq\\)D00rHrPrUs@rGr'r'sG77777 1111111rHr'c$eZdZfdZdZxZS) UpSample1dcVt||_dSrJr`ras rGr<zUpSample1d.__init__rbrHcJ|jdkr|Stj|ddS)Nr*r0rrrs)r=rhrvrLs rGrNzUpSample1d.forwards+ ?f $ $H=CCC CrHrPrUs@rGr2r2sN%%%%%DDDDDDDrHr2c\eZdZdejdddffd ZdZdZdZd Z xZ S) AdainResBlk1d@ryr*r c t||_||_t ||_||k|_||||tj ||_ |dkrtj |_ dSttj||dd|dd|_ dS)Nr*r-r0r.)r3r4r5r6rY)r;r<r{ upsample_typer2upsamplerrr>rrr?rrConvTranspose1d)rDrErrr{r:rrFs rGr<zAdainResBlk1d.__init__s  %"8,,  G+ FGY777z),, v   DIII#B$6vvST]^gmwxJK%L%L%LMMDIIIrHc lttj||ddd|_ttj||ddd|_t |||_t |||_|jr.ttj||dddd|_ dSdS)Nr-r.rFr) rr>rrrr'rrrr)rDrErrs rGrzAdainResBlk1d._build_weightss 67Aq!!D!DEE  7GQ1!E!EFF Y// Y00 ? X&ry!QPU'V'V'VWWDLLL X XrHch||}|jr||}|SrJ)r:rrrLs rGrzAdainResBlk1d._shortcuts1 MM!   ?  QArHc||||}||}||}|||}|||}||}|||}|SrJ)rr{rrrrr)rDrMrs rGrzAdainResBlk1d._residuals JJq!   IIaLL IIaLL JJt||A ' ' JJq!   IIaLL JJt||A ' 'rHc|||}|||ztjdz }|Sr)rrrr)rDrMrrs rGrNzAdainResBlk1d.forwards=nnQ""T^^A&&&$)A,,6 rHrrUs@rGr6r6s24<2<;L;L C M M M M M M XXX rHr6c&eZdZdfd ZdZxZS) AdaLayerNormrct||_||_t j||dz|_dSr)r;r<rrr>rr*)rDrrrrFs rGr<zAdaLayerNorm.__init__sB   )Ixz22rHc8|dd}|dd}||}||d|dd}t j|dd\}}|dd|dd}}t j||jf|j }d|z|z|z}|ddddS)Nrdrr.rr0r-)r) rr*rrrkr/rhrrrr0s rGrNzAdaLayerNorm.forwards KKB   KK2   GGAJJ FF166!99affQii + +k!A1555 tooa,,dnnQ.C.Ct LT],$( ; ; ; Y!Od "{{1b!!++B333rHrrPrUs@rGrArAsL333333 4 4 4 4 4 4 4rHrAc4eZdZd fd Zd dZdZdZxZS) ProsodyPredictor2皙?c 0tttt ddd|d||z|_ttt ddddd||z|_t|||||_t|j|_ tj ||z||_ t|||_t|j|_tj|_|jt)|||||jt)||d z|d | |jt)|d z|d z||tj|_|jt)|||||jt)||d z|d | |jt)|d z|d z||tj|d zd d d d |_tj|d zd d d d |_dS)Nrrrrri)sty_dimd_modelnlayersr)rr0T)r:rr.r)r;r<r!r"r#rcfg_predDurationEncoder text_encoderr r r>rrr duration_projrrF0rr6NrF0_projN_proj)rDrd_hidrKmax_durrrFs rGr<zProsodyPredictor.__init__8s (0@.>?@UVbc/*/*/*1&1&1& 49/027)2C " " ".0@.>?@UVbc/*/*/*1&1&1& 48/027)2C " " " 4,I494;4;=== $DH-- "$)EI,=u"E"E'88%dm44 -// }UE9PPPQQQ }UEQJ D\cdddeee }UaZ!YRYZZZ[[[  mE5)wOOOPPP  mE5A:y4[bcccddd  mEQJ IQXYYYZZZy!Q1a88 i Aq!Q77 rHNFcr|r||}}||dd}||}|dd} |jD]} | | |} || } |dd} |jD]} | | |} || } | d| dfS|||||} | j d} | j d}| }| }| |j d}||}||}|dd}|ddd}|t$j|d|j}| dd|z}|d|fS)Nrdrr.rr0g?)r)rrrrPrRrQrSrrNrjrrrrrmr permuterOr> functionalrr)rDtextsstyle text_lengths alignmentrf0rMrrPblockrQd batch_size text_sizerdurationens rGrNzProsodyPredictor.forwards > ,%qA AKKB//00A''**AR$$B " "U2q\\b!!B B##A  E!QKK AA::a==!))A,,. .!!% a@@AJ I ),,..4466M A\())33A66A ! A''**A Br""A !Aa  A))"-*?*?3QUQ^*?*_*_``H++b"%% 1B##B''+ +rHc||dd}||}|dd}|jD]}|||}||}|dd}|jD]}|||}||}|d|dfS)Nrdrr.)rrrrPrRrQrSr)rDrMrrPr^rQs rGF0NtrainzProsodyPredictor.F0Ntrains KK B++ , ,  # #A & &[[R W  Er1BB \\"   KKB  V  Ea AA KKNNzz!}}aiill**rHc2tj|d|jdd|}tj|dz|d}|Srrr!s rGr$zProsodyPredictor.length_to_maskr%rH)rFrG)NNNF)rQrRrSr<rNrer$rTrUs@rGrErE6s{E8E8E8E8E8E8P@,@,@,@,F+++2rHrEc>eZdZdfd ZdZdZdZdZdZxZS) rMrGc ttj|_t |D]b}|jtj||z|dzddd||jt||c||_ ||_ ||_ dS)Nr0r.T) num_layers batch_first bidirectionalr) r;r<r>rlstmsrrLSTMrArrJrI)rDrIrJrKrrrFs rGr<zDurationEncoder.__init__s ]__ w > >A J  bgg&7!(A,--1/3)0 222 3 3 3 J  l7G<< = = = =   rHc ||j}|ddd}||jd|jdd}t j||gd}||d ddd| dd}| }| dd}|j D]}t|tr|| dd| dd}t j||dddgd}||d ddd| dd}tjj||dd }|||\}} tjj|d \}} t+j||j|j }| dd}t j|jd|jd|jdg} || ddddd|jdf<| |j}| ddS) Nr0rr.rdaxisr rTF)rjenforce_sorted)rjr)rrrWrrjrkrlrrmrrrrl isinstancerAr>utilsrnnpack_padded_sequenceflatten_parameterspad_packed_sequencerhrrr) rDrMrZr[rmasksrrr^rx_pads rGrNzDurationEncoder.forwards\()) IIaA   LLQWQZ 4 4 Iq!f2 & & & ur**44Q::C@@@ KK1  $((**0022 KKB  Z ' 'E%.. 'E!++b"--u55??BGGIq!))Ar1"5"56Q???ur22<> rHc2tj|d|jdd|}tj|dz|d}|Srrr!s rGr$zDurationEncoder.length_to_maskr%rHc||ddtj|jz}||jd|jdd}tj||gd}| |}| |dd}|Sr{r|rs rGrzDurationEncoder.inference$rrHc2tj|d|jdd|}tj|dz|d}|Srrr!s rGr$zDurationEncoder.length_to_mask,r%rH)rG)rQrRrSr<rNrr$rTrUs@rGrMrMs"!#!#!#F rHrMctdd}tj|dd}|||}|S)Nr.) num_classseq_lenr map_locationnet)rrkloadload_state_dicttrain)pathF0_modelparamsrs rGload_F0_modelsr3sW3///H Z5 1 1 1% 8F V$$$A OrHku-nlp/deberta-v3-base-japanesectj|}|dditj||}|SN num_labels)config)r from_pretrainedupdater )rr model_ckptkotodama_prompts rGload_KotoDama_Prompterr>sO  $Z 0 0CJJ#&5d3GGGO rH.line-corporation/line-distilbert-base-japanesectj|}|dditj||}|Sr)r rrr)rrrkotodama_samplers rGload_KotoDama_TextSamplerrJsP  $Z 0 0CJJ#%4T#FFF rHchd}d}||}|||}|}|S)Nct|5}tj|}dddn #1swxYwY|d}|S)N model_params)openyaml safe_load)rfr model_configs rG _load_configz%load_ASR_models.._load_configws $ZZ '1^A&&F ' ' ' ' ' ' ' ' ' ' ' ' ' ' 'n- s 155ctdi|}tj|dd}|||S)Nrrmodel)rrkrr)r model_pathrrs rG _load_modelz$load_ASR_models.._load_model}sG&&&&JU;;;GD f%%% rH)r)ASR_MODEL_PATHASR_MODEL_CONFIGrrasr_model_config asr_modelrs rGload_ASR_modelsrusZ  $|$455 ,n==IA rHcD|jjdvs Jd|jjdkrqddlm}||j|j|j|jj|jj|jj |jj |jj |jj |jj  }nZddlm}||j|j|j|jj|jj|jj |jj |jj }t|jd|j|j }t%|j|j|j|j|j } t+|j|j|j } t+|j|j|j } |jr5t1d|jd z|jj|jd zd |jj} n+t;d|jd z|jjd|jj} t=d|jj|jj|jj |jd z|jd z} tC| j"tG|jj$j%|jj$j&|jj$j'd| _| | j_(| | _"tS|tUj+|jj|j| ||| | | ||tYt[t]|j/j0|j/j1|j/j2||}|S)N)istftnethifiganzDecoder type unknownrr)Decoder) rErrresblock_kernel_sizesupsample_ratesupsample_initial_channelresblock_dilation_sizesupsample_kernel_sizesgen_istft_n_fftgen_istft_hop_size)rErrrrrrrr)rr3r r )rrTrKrUr)rErrr0)rcontext_embedding_featurescontext_features)rrr.) in_channelsembedding_max_lengthembedding_featuresembedding_mask_probarr)meanstdr )rsigma_distribution sigma_datadynamic_threshold)bert bert_encoder predictordecoderrNpredictor_encoder style_encoder diffusion text_alignerpitch_extractormpdmsdwdr rr)3rtypeModules.istftnetr hidden_dimrn_melsrrrrrrrModules.hifiganrn_layern_tokenrErUrrrE multispeakerrr hidden_sizer transformerrrmax_position_embeddingsrrunetrdistrrrrrr>rrrrslmhiddenrKinitial_channel)argsrrrr rrrrNrrrrrnetss rG build_modelrs] <  7 7 7 79O 7 7 7 |J&&,,,,,,'DNTXT_(, (J!%!<)-)N(, (L&*l&H $ Z^Zfptp|GKGSTTTI  t~\`\klllM$DK4>`d`oppp B(B$.2B?C{?V59^A5EBB'+n&@BB $BT^A-=?C{?VBB&*n&@BB *![@;2!^@!) I% N08K8PX\XfXkXoppp>&1 I *I IN 4;#:DOLL%/''-*,,+--$DHOTX5EtxG_``-)/   D: KrHFc jtj|d}|d}td|D]K}||vrC||vr> ||||dn#ddlm}||} |} t|d t ||d t | t|| | D]\\} } \} }|| | <||| dYnxYwtd |zM|s,|d }|d }||dnd}d}||||fS)Nrrrz,loading the ckpt using the correct function.T)strictr) OrderedDictz key length: z, state_dict key length: z %s loadedepochiters optimizer) rkrprintr collectionsrlen state_dictkeyszipitems)rrrload_only_paramsignore_modulesstaterkeyrrnew_state_dictk_mv_mk_cv_crrs rGload_checkpointrs Jt% 0 0 0E 5\F 8999 % % &==S66 Hc **6#;t*DDDD H333333#C[ !,BB3uSz/D/D/F/F/K/K/M/M+N+NBBilmwm|m|m~m~iiBBCCC.1%*2G2G2I2I2O2O2Q2QS]ScScSeSe.f.f..*JS# c*-N3''c **>$*GGGGG +# $ $ $ gg!!% "45555 )UE ))s #A##DE&)Nr)Nr)Yosos.pathrospcopyrrnprktorch.nnr>torch.nn.functionalrXrhtorch.nn.utilsrrrUtils.ASR.modelsrUtils.JDC.modelr transformersrr r r r Modules.KotoDama_samplerr rModules.diffusion.samplerrrModules.diffusion.modulesrrModules.diffusion.diffusionr)Modules.diffusion.audio_diffusion_pytorchrrrrrModules.discriminatorsrrrmunchrrdistutils.versionrtypingrrxlstmr r!r"r#r$r%r&Moduler(rWr^rorxrrrrrrr'r2r6rArErMrrrrrrrrHrGrs   IIIIIIIIII######""""""srrrrrrrrrrrrrCCCCCCCCGGGGGGGGGGGGGGGGAAAAAAvvvvvvvvvvvvvvjjjjjjjjjj +*****   "bi$}}}}}}}}$ { { { { {ry { { { ) ) ) ) ) RY) ) ) V296 $ $ $ $ $ $ $ $bi@: : : : : ry: : : x " " " " " " " "ccccc")cccN 1 1 1 1 1bi 1 1 1 D D D D D D D D,,,,,BI,,,\4444429444`iiiiiryiiiVNNNNNbiNNNd        V(WWWt>CSU******rH