huzy0 commited on
Commit
90306ea
·
verified ·
1 Parent(s): e10c417

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -31,7 +31,7 @@ We introduce **MERaLiON-SpeechEncoder-2**, our next-generation multilingual spee
31
  Unlike many existing models optimized for high-resource, Western languages, MERaLiON-SpeechEncoder-2 is designed from the ground up to reflect the linguistic diversity and complexity of Southeast Asia. Our training data was curated to contain a substantial amount originating from Singapore and SEA, including 60,000 hours of Singapore-accented speech, with a further 160,000 hours covering Singapore’s official languages Chinese, Malay and Tamil, along with a smaller portion of dialects like Hokkien and Cantonese. SEA data amounts to 200,000 hours, including significant proportions of Malay, Thai, Indonesian, Vietnamese, with smaller amounts of Tagalog, Burmese, Javanese, Sundanese, Khmer and Lao. See below for a regional breakdown of the language coverage of our pre-training data.
32
 
33
  <p align="center">
34
- <img src="data2.svg" width="600"/>
35
  </p>
36
 
37
  ## Model Highlights
@@ -139,7 +139,7 @@ with torch.no_grad():
139
  ## Downstream Use
140
 
141
  <p align="center">
142
- <img src="downstream.svg" width="600"/>
143
  </p>
144
 
145
  Speech encoders are normally used in finetuning setups to provide the frontend to downstream speech applications. We provide an example below of an ASR finetuning setup with Huggingface. Please refer to this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for the full ASR finetuning recipe using Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom finetuning loops.
 
31
  Unlike many existing models optimized for high-resource, Western languages, MERaLiON-SpeechEncoder-2 is designed from the ground up to reflect the linguistic diversity and complexity of Southeast Asia. Our training data was curated to contain a substantial amount originating from Singapore and SEA, including 60,000 hours of Singapore-accented speech, with a further 160,000 hours covering Singapore’s official languages Chinese, Malay and Tamil, along with a smaller portion of dialects like Hokkien and Cantonese. SEA data amounts to 200,000 hours, including significant proportions of Malay, Thai, Indonesian, Vietnamese, with smaller amounts of Tagalog, Burmese, Javanese, Sundanese, Khmer and Lao. See below for a regional breakdown of the language coverage of our pre-training data.
32
 
33
  <p align="center">
34
+ <img src="data2.svg" width="580"/>
35
  </p>
36
 
37
  ## Model Highlights
 
139
  ## Downstream Use
140
 
141
  <p align="center">
142
+ <img src="downstream.svg" width="580"/>
143
  </p>
144
 
145
  Speech encoders are normally used in finetuning setups to provide the frontend to downstream speech applications. We provide an example below of an ASR finetuning setup with Huggingface. Please refer to this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for the full ASR finetuning recipe using Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom finetuning loops.