huzy0 commited on
Commit
2fcf5a4
·
verified ·
1 Parent(s): 3ca00dc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -6
README.md CHANGED
@@ -27,11 +27,14 @@ language:
27
 
28
 
29
 
30
- We introduce **MERaLiON-SpeechEncoder-2**, an update of [MERaLiON-SpeechEncoder-v1](https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-v1) that greatly expands our pre-training data to **1.4 million hours** of unlabeled audio, with a **strong focus on Southeast Asian (SEA) languages and accents**. As a speech foundation model, it encodes speech into a general-purpose, multilingual acoustic representation that can serve as a high-performance backbone for a wide range of downstream tasks -- including automatic speech recognition (ASR), speech translation, speaker and language identification, and emotion recognition.
31
 
32
- Unlike many existing models optimized for high-resource, Western languages, MERaLiON-SpeechEncoder-2 is designed from the ground up to reflect the linguistic diversity and complexity of Southeast Asia. The model can be finetuned on custom datasets, allowing developers to build speech systems tailored to their specific needs.
33
 
34
- <img src="data1.svg" width="650"/> <img src="data2.svg" width="650"/>
 
 
 
35
 
36
  ## Model Highlights
37
 
@@ -42,7 +45,7 @@ With only 630M parameters (≈2.5 GB in memory), the model is easily deployable
42
  Building on our [v1 release](https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-v1) (which focused on English and Singlish), this version expands to include English, Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese, along with codeswitching support across these languages. Given the wide coverage of languages in the training corpus, it may also be applicable beyond the officially supported languages.
43
 
44
  ### Competitive performance on downstream speech tasks
45
- The model retains near state-of-the-art results on the SUPERB benchmark for English, and showcases strong multilingual capabilities deomnstrated through its integration into a high-performance ASR system shown below.
46
 
47
  ### Innovative pre-training techniques
48
  MERaLiON-SpeechEncoder-2 was trained from scratch with an novel extension of the BEST-RQ self-supervised objective, by using more informative latent targets. We also adopted the Muon optimizer, which has previously only been shown to outperform the popular AdamW for LLM training. We find its advantages also carry over to speech-based models.
@@ -51,12 +54,22 @@ MERaLiON-SpeechEncoder-2 was trained from scratch with an novel extension of the
51
 
52
  - **Developed by:** I<sup>2</sup>R, A\*STAR
53
  - **Model type:** Speech Encoder
54
- - **Language(s):** Primarily English (Global & Singapore), Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese. See [pre-training data](#Language coverage of pre-training data) for full breakdown of language coverage.
55
  - **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)
56
 
57
  For details on background, pre-training, tuning experiments and evaluation, please refer to our [technical report](https://arxiv.org/abs/2412.11538).
58
 
59
- ## Language coverage of pre-training data
 
 
 
 
 
 
 
 
 
 
60
 
61
 
62
 
 
27
 
28
 
29
 
30
+ We introduce **MERaLiON-SpeechEncoder-2**, an update of [MERaLiON-SpeechEncoder-v1](https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-v1) that greatly expands our pre-training data to **1.4 million hours** of unlabeled audio, with a **strong focus on Southeast Asian (SEA) languages and accents**. As a speech foundation model, it encodes speech into a general-purpose, multilingual acoustic representation that can serve as a high-performance backbone for a wide range of downstream tasks including automatic speech recognition (ASR), speech translation, speaker and language identification, and emotion recognition.
31
 
32
+ Unlike many existing models optimized for high-resource, Western languages, MERaLiON-SpeechEncoder-2 is designed from the ground up to reflect the linguistic diversity and complexity of Southeast Asia. See below for a full breakdown of the language coverage of our pre-training data. **This model can be finetuned on custom datasets, allowing developers to build speech systems tailored to their specific needs.**
33
 
34
+ <p align="center">
35
+ <img src="data1.svg" width="640"/>
36
+ <img src="data2.svg" width="640"/>
37
+ </p>
38
 
39
  ## Model Highlights
40
 
 
45
  Building on our [v1 release](https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-v1) (which focused on English and Singlish), this version expands to include English, Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese, along with codeswitching support across these languages. Given the wide coverage of languages in the training corpus, it may also be applicable beyond the officially supported languages.
46
 
47
  ### Competitive performance on downstream speech tasks
48
+ The model retains near state-of-the-art results on the SUPERB benchmark for English, and showcases strong multilingual capabilities deomnstrated through its integration into a [high-performance ASR system shown below](#Automatic Speech Recognition (ASR)).
49
 
50
  ### Innovative pre-training techniques
51
  MERaLiON-SpeechEncoder-2 was trained from scratch with an novel extension of the BEST-RQ self-supervised objective, by using more informative latent targets. We also adopted the Muon optimizer, which has previously only been shown to outperform the popular AdamW for LLM training. We find its advantages also carry over to speech-based models.
 
54
 
55
  - **Developed by:** I<sup>2</sup>R, A\*STAR
56
  - **Model type:** Speech Encoder
57
+ - **Language(s):** English (Global & Singapore), Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese.
58
  - **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)
59
 
60
  For details on background, pre-training, tuning experiments and evaluation, please refer to our [technical report](https://arxiv.org/abs/2412.11538).
61
 
62
+ ## Benchmarks
63
+
64
+ ### SUPERB
65
+ | Model | Overall Score | PR↓ | ASR↓ | KS↑ | QbE↑ | SID↑ | ASV↓ | SD↓ | ER↑ | IC↑ | SF (F1↑ / CER↓) |
66
+ |----------------------------------|---------------|------|------|-------|--------|-------|------|------|-------|-------|----------------------|
67
+ | HuBERT large | 82.25 | 3.53 | 3.62 | 95.29 | 0.0354 | 90.33 | 5.98 | 5.75 | 67.62 | 98.76 | 89.91 / 21.76 |
68
+ | WavLM large | 84.77 | 3.06 | 3.44 | 97.86 | 0.0886 | 95.49 | 3.77 | 3.24 | 70.62 | 99.31 | 92.21 / 18.36 |
69
+ | MERaLiON-SpeechEncoder-V1 | 82.62 | 3.14 | 4.16 | 97.63 | 0.0590 | 91.09 | 5.18 | 5.06 | 68.02 | 98.60 | 88.99 / 23.89 |
70
+ | MERaLiON-SpeechEncoder-2 | 82.72 | 3.40 | 4.96 | 97.57 | 0.0575 | 88.96 | 3.93 | 3.90 | 68.80 | 98.95 | 89.50 / 23.46 |
71
+
72
+ ### Automatic Speech Recognition (ASR)
73
 
74
 
75