This repository contains models for the objective evaluation of text-to-speech (TTS) models.:
WER: Includes Hubert-based ASR model for LibriSpeech-PC testset, Paraformer-based ASR model for Chinese datasets, Whisper-large-v3 model for general English test sets, WhisperD model for English dialogue speech.
cpWER: WhisperD model is used to compute concatenated minimum permutation word error rate (cpWER) for English dialogue speech.
SIM-o: A wavlm-based speaker verification model is used to compute the speaker similarity between prompt and generated speech.
cpSIM: A speaker diarization model is used along with the above wavlm-based model to compute concatenated maximum permutation speaker similarity (cpSIM).
UTMOS: The mos prediction model UTMOS is used.
For details of the evaluation metrics, see ZipVoice and ZipVoice-Dialog.