HgL |ddlZddlZddlmZddlmZddlmZmZddl m Z ddl m Z ddl mZmZmZGdd Zy) N)Tuple)Path) AutoTokenizerAutoModelForCausalLM) load_config)BiCodecTokenizer) LEVELS_MAP GENDER_MAPTASK_TOKEN_MAPcNeZdZdZej dfdedej fdZdZdej fdZ dd e d ed e d e e ejffd Z de de de d e fdZej dd e d ed e de de de dededed ejfdZy)SparkTTSz2 Spark-TTS for text-to-speech generation. zcuda:0 model_dirdevicec||_||_t|d|_|jd|_|j y)a Initializes the SparkTTS model with the provided configurations and device. Args: model_dir (Path): Directory containing the model and config files. device (torch.device): The device (CPU/GPU) to run the model on. z /config.yaml sample_rateN)rrrconfigsr_initialize_inference)selfrrs D/aifs4su/xinshengwang/code/Inference/Space/Spark-TTS/cli/SparkTTS.py__init__zSparkTTS.__init__ sB ""i[ #=> << 6 ""$c6tj|jd|_t j|jd|_t |j|j|_|j j|jy)zDInitializes the tokenizer, model, and audio tokenizer for inference.z/LLM)rN) rfrom_pretrainedr tokenizerrmodelrraudio_tokenizerto)rs rrzSparkTTS._initialize_inference.si&66$..9I7NO)99T^^z<|bicodec_semantic_tts<|start_content|><|end_content|>z<|start_global_token|>z<|end_global_token|>z<|start_semantic_token|>)rtokenizejoinsqueezer ) rr r!r"global_token_idssemantic_token_idsi global_tokenssemantic_tokensinputss rprocess_promptzSparkTTS.process_prompt:s$04/C/C/L/L 0 ,,0@0H0H0J K1 2 & K  " gg6H6P6P6RS&qc,SOu%#!(&* Fu%#!(&F'''C L Ts B</ CgenderpitchspeedcV|tjvsJ|tjvsJ|tjvsJt|}t|}t|}d|d}d|d} d|d} dj| || g} tdd|dd | d g} dj| S) ah Process input for voice creation. Args: gender (str): female | male. pitch (str): very_low | low | moderate | high | very_high speed (str): very_low | low | moderate | high | very_high text (str): The text input to be converted to speech. Return: str: Input prompt z<|pitch_label_r&z<|speed_label_z <|gender_r%controllable_ttsr(r)z<|start_style_label|>z<|end_style_label|>)r keysr r+r ) rr4r5r6r gender_idpitch_level_idspeed_level_idpitch_label_tokensspeed_label_tokens gender_tokensattribte_tokenscontrol_tts_inputss rprocess_prompt_controlzSparkTTS.process_prompt_controlss&*** ))) )))v& #E*#E*-n-=R@-n-=R@#I;b1 '' .0B C  - .    #  ! ww)**r temperaturetop_ktop_pc ||j||||} n|j|||\} } |j| gdj|j} |j j d i| dd|| |d} t| j| Dcgc]\}}|t|d} }}|jj| dd}tjtjd |Dcgc] }t|c}j!j#d}|ltjtjd |Dcgc] }t|c}j!j#dj#d} |j$j' j|jj)d|j|j}|Scc}}wcc}wcc}w) ai Performs inference to generate speech from text, incorporating prompt audio and/or text. Args: text (str): The text input to be converted to speech. prompt_speech_path (Path): Path to the audio file used as a prompt. prompt_text (str, optional): Transcript of the prompt audio. gender (str): female | male. pitch (str): very_low | low | moderate | high | very_high speed (str): very_low | low | moderate | high | very_high temperature (float, optional): Sampling temperature for controlling randomness. Default is 0.8. top_k (float, optional): Top-k sampling parameter. Default is 50. top_p (float, optional): Top-p (nucleus) sampling parameter. Default is 0.95. Returns: torch.Tensor: Generated waveform as a tensor. Npt)return_tensorsi T)max_new_tokens do_samplerDrErC)skip_special_tokensrzbicodec_semantic_(\d+)zbicodec_global_(\d+))rBr3rrrrgeneratezip input_idslen batch_decodetorchtensorrefindallintlong unsqueezer detokenizer,)rr r!r"r4r5r6rCrDrEpromptr- model_inputs generated_idsrO output_idspredictstokenpred_semantic_idswavs r inferencezSparkTTS.inferences<  00tLF(,':':(+( $F$~~vht~DGG T , ++  #  *-\-C-C])S % : s9~' (  >>..}RV.WXYZ LL"**=VX`2ab#e*b c TV Yq\    bjjAXZb6cdUc%jde11 ""--    , 4 4Q 7   -   ; cesG=:HHr)NNNNNg?2gffffff?)__name__ __module__ __qualname____doc__rRrrrrrstrrTensorr3rBno_gradfloatrbrLrrr r sX@Lu||H?U %$ % %#-- 7(7(!7( 7( sELL ! 7(r-+-+-+ -+  -+^U]]_$( NN!N N  N  NNNNN NNrr )rTrRtypingrpathlibr transformersrrsparktts.utils.filersparktts.models.audio_tokenizerrsparktts.utils.token_parserr r r r rLrrrrs.  <+<NNVVr