license: apache-2.0
datasets:
- k2-fsa/TTS_eval_datasets
language:
- en
- zh
pipeline_tag: text-to-speech
This repository contains models for the objective evaluation of text-to-speech (TTS) models.:
WER: Includes Hubert-based ASR model for LibriSpeech-PC testset, Paraformer-based ASR model for Chinese datasets, Whisper-large-v3 model for general English test sets, WhisperD model for English dialogue speech.
cpWER: WhisperD model is used to compute concatenated minimum permutation word error rate (cpWER) for English dialogue speech.
SIM-o: A wavlm-based speaker verification model is used to compute the speaker similarity between prompt and generated speech.
cpSIM: A speaker diarization model is used along with the above wavlm-based model to compute concatenated maximum permutation speaker similarity (cpSIM).
UTMOS: The mos prediction model UTMOS is used.
For details of the evaluation metrics, see ZipVoice and ZipVoice-Dialog.