HgddlZddlZddlmZddlmZmZmZddl m Z m Z ddl m Z ddlmZddlmZGdd Zed k(rddlZej,ej.j1rd nd Zed eZdZej7e\ZZej=ej?deZ ejBde dyy)N)Path)AnyDictTuple)Wav2Vec2FeatureExtractor Wav2Vec2Model) load_config) load_audio)BiCodecceZdZdZddedej ffd ZdZdej fdZ de jde jfd Z d ede ejejffd Zd ejdejfd ZdeeefdejfdZdede ejejffdZdejdejde j,fdZxZS)BiCodecTokenizerz!>~~5 6" "T[[/ >B%%:rc||_|jj|j|jj|jyN)rrrr )rrs rrzBiCodecTokenizer.to9s5  dkk" !!$++.rwavreturnct|jd|jdz|jdz|jdz}t|}||kDrtj|d|z|z}|d|S)z/Get reference audio clip for speaker embedding. sample_rateref_segment_durationlatent_hop_lengthN)intrlennptile)rr$ref_segment_length wav_lengths r get_ref_clipzBiCodecTokenizer.get_ref_clip>s  M*T[[9O-PP Q{{./ 0kk-. /  X  *''#$6 6:EFC&&''rwav_pathct||jd|jd}|j|}tj|j dj }||fS)z0load auido and get reference audio from wav pathr'volume_normalize) sampling_rater4r)r rr1torch from_numpy unsqueezefloat)rr2r$wav_refs r process_audiozBiCodecTokenizer.process_audioMsh ++m4![[);<  ##C(""7+55a8>>@G|rwavsc|j|ddddj}|j|j|jj}|j d|j dz|j dzdz }|S) zextract wav2vec2 features>ptT)r5return_tensorspaddingr! )r input_valuesr rr hidden_states)rr<inputsfeat feats_mixs rextract_wav2vec2_featuresz*BiCodecTokenizer.extract_wav2vec2_featuresZs !%   , %%fii0F0F0M0M&NO   r "T%7%7%; ;d>P>PQS>T T  rbatchcx|j|d}||d<|jj|\}}||fS)atokenize the batch of audio Args: batch: wavs (List[np.ndarray]): batch of audio ref_wavs (torch.Tensor): reference audio. shape: (batch_size, seq_len) Returns: semantic_tokens: semantic tokens. shape: (batch_size, seq_len, latent_dim) global_tokens: global tokens. shape: (batch_size, seq_len, global_dim) r$rI)rKrtokenize)rrLfeatssemantic_tokens global_tokenss rtokenize_batchzBiCodecTokenizer.tokenize_batchjsE..uU|<f )-)<))"r6r[r-pathlibrtypingrrr transformersrrsparktts.utils.filer sparktts.utils.audior sparktts.models.bicodecr r r] soundfilesfrrg is_available tokenizerr2rNrQrPrWrYr\writerrrts" ##@+++z8z8| z U\\EJJ$;$;$=&5 IF 4I*H%.%7%7%A"M?""=#8#8#;_MG BHH '%8r