HgUddlZddlmZddlmZmZddlmZddlm Z ddl m Z  GddejZ edk(re d d d gd d Zej ddd Zee\ZZedej*edej*ej-eZedej*ej1eZedej*ej5ej5k(r ednededej9DZedj=edz yy)N)ListTuple) ResidualFSQ)ECAPA_TDNN_GLOB_c512)PerceiverResamplerc eZdZdZddddgddfded ed ed ed eed ef fd ZdejdejfdZ dejdejfdZ dejde ejejffdZ dejdejfdZ dejdejfdZxZS)SpeakerEncodera Args: input_dim (int): acoustic feature dimension out_dim (int): output dimension of x-vector and d-vector latent_dim (int): latent dimension before quantization token_num (int): sequence length of speaker tokens fsq_levels (List[int]): number of levels for each quantizer fsq_num_quantizers (int): number of quantizers Return: speaker_embs: (B, T2, out_dim) di rrrrr input_dimout_dim latent_dim token_num fsq_levelsfsq_num_quantizersctt| t|||_t |d||_t|||dd|_tj||z||_ y)N)feat_dim embed_dimi)dim dim_context num_latentsTF)levelsnum_quantizersris_channel_firstquantize_dropout) superr __init__rspeaker_encoderrperceiver_samplerr quantizernnLinearproject)selfrrrrrr __class__s `/aifs4su/xinshengwang/code/Inference/Space/Spark-TTS/sparktts/modules/speaker/speaker_encoder.pyr!zSpeakerEncoder.__init__,sp nd,.3' "4Y" %-!"  yyi!7A indicesreturnc||jj|jdd}|jddSNr)r$get_codes_from_indices transpose)r(r,zqs r*r1z%SpeakerEncoder.get_codes_from_indicesGs4 ^^ 2 273D3DQ3J K||Aq!!r+melsc|jdd}|j|jdd}|j|\}}|Sr/)r2r#r$)r(r4xr3r,s r* get_indiceszSpeakerEncoder.get_indicesKsF~~a#  " "4 ( 2 21a 8nnQ' Gr+c|j|d\}}|j|jddjdd}|j|\}}|j |j dd}|j |}||fS)z Args: mels: (B, D_mel, T1) Return: x_vector: (B, out_dim) d_vector: (B, out_dim) Trr0r)r"r#r2r$reshapeshaper')r(r4x_vectorfeaturesr6r3r,d_vectors r*forwardzSpeakerEncoder.forwardQs"11$=(  " "8#5#5a#; < F Fq! LnnQ' G JJrxx{B '<<?!!r+c|j|d\}}|j|jddjdd}|j|\}}|S)z"tokenize the input mel spectrogramTrr0)r"r#r2r$)r(r4_r=r6r3r,s r*tokenizezSpeakerEncoder.tokenizedsY**46 8  " "8#5#5a#; < F Fq! LnnQ' Gr+c|jj|jddjdd}|j|jdd}|j |}|S)z(detokenize the input indices to d-vectorrr0rr9)r$get_output_from_indicesr2r:r;r')r(r,r3r6r>s r* detokenizezSpeakerEncoder.detokenizeks\ ^^ 3 3G4E4Ea4K L V VWXZ[ \ JJrxx{B '<<?r+)__name__ __module__ __qualname____doc__intrr!torchTensorr1r7rr?rBrE __classcell__)r)s@r*r r s   2"#BBB B  B I B B6"ell"u||"  "ELL"U5<<3M-N"&U\\ell%,,5<<r+r __main__r r r r r)rrrrrzx-vector shapezd-vector shapez indices shapez'd-vector post and d-vector are the samez(d-vector post and d-vector are differentc#<K|]}|jyw)N)numel).0params r* rUsCuU[[]Csz{} Mg.A)rKtorch.nnr%typingrr!sparktts.modules.fsq.residual_fsqr#sparktts.modules.speaker.ecapa_tdnnr*sparktts.modules.speaker.perceiver_encoderrModuler rFmodelrandnmelr<r>printr;rBr,rE d_vector_postallsum parameters num_paramsformatr+r*rgs0  9DI SRYYSj z %  E %++ac "CsHh HNN+ HNN+nnS!G /7==)$$W-M M//0hlln, 78 89C0@0@0BCCJ &-- S( )*-r+