Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,7 @@ language:
|
|
8 |
pipeline_tag: text-to-speech
|
9 |
---
|
10 |
|
11 |
-
This repository
|
12 |
|
13 |
- **WER**: Includes [Hubert-based ASR model](https://huggingface.co/facebook/hubert-large-ls960-ft) for LibriSpeech-PC testset, [Paraformer-based ASR model](https://huggingface.co/funasr/paraformer-zh) for Chinese datasets, [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for general English test sets, [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) model for English dialogue speech.
|
14 |
|
@@ -19,4 +19,7 @@ This repository consists of models for objective evaluation of text-to-speech (T
|
|
19 |
|
20 |
- **cpSIM**: A [speaker diarization model](https://huggingface.co/pyannote/speaker-diarization-3.1) is used along with the above wavlm-based model to compute concatenated maximum permutation speaker similarity ([cpSIM](https://arxiv.org/abs/2507.09318)).
|
21 |
|
22 |
-
- **UTMOS**: The mos prediction model [UTMOS](https://github.com/sarulab-speech/UTMOS22) is used.
|
|
|
|
|
|
|
|
8 |
pipeline_tag: text-to-speech
|
9 |
---
|
10 |
|
11 |
+
This repository contains models for the objective evaluation of text-to-speech (TTS) models.:
|
12 |
|
13 |
- **WER**: Includes [Hubert-based ASR model](https://huggingface.co/facebook/hubert-large-ls960-ft) for LibriSpeech-PC testset, [Paraformer-based ASR model](https://huggingface.co/funasr/paraformer-zh) for Chinese datasets, [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for general English test sets, [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) model for English dialogue speech.
|
14 |
|
|
|
19 |
|
20 |
- **cpSIM**: A [speaker diarization model](https://huggingface.co/pyannote/speaker-diarization-3.1) is used along with the above wavlm-based model to compute concatenated maximum permutation speaker similarity ([cpSIM](https://arxiv.org/abs/2507.09318)).
|
21 |
|
22 |
+
- **UTMOS**: The mos prediction model [UTMOS](https://github.com/sarulab-speech/UTMOS22) is used.
|
23 |
+
|
24 |
+
|
25 |
+
For details of the evaluation metrics, see [ZipVoice](https://arxiv.org/abs/2506.13053) and [ZipVoice-Dialog](https://arxiv.org/abs/2507.09318).
|