library_name: transformers
license: other
license_name: meralion-public-license
license_link: >-
https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf
tags:
- speech
- best-rq
- meralion
- meralion-2
language:
- en
- zh
- ms
- ta
- id
- th
- vi
🎧 MERaLiON-SpeechEncoder-2 🎧
We introduce MERaLiON-SpeechEncoder-2, an update of MERaLiON-SpeechEncoder-v1 that greatly expands our pre-training data to 1.4 million hours of unlabeled audio, with a strong focus on Southeast Asian (SEA) languages and accents. As a speech foundation model, it encodes speech into a general-purpose, multilingual acoustic representation that can serve as a high-performance backbone for a wide range of downstream tasks — including automatic speech recognition (ASR), speech translation, speaker and language identification, and emotion recognition.
Unlike many existing models optimized for high-resource, Western languages, MERaLiON-SpeechEncoder-2 is designed from the ground up to reflect the linguistic diversity and complexity of Southeast Asia. See below for a full breakdown of the language coverage of our pre-training data. This model can be finetuned on custom datasets, allowing developers to build speech systems tailored to their specific needs.
Model Highlights
Small model size
With only 630M parameters (≈2.5 GB in memory), the model is easily deployable on most commercial GPUs, eliminating the need for distributed or large-scale compute setups.
Natively multilingual
Building on our v1 release (which focused on English and Singlish), this version expands to include English, Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese, along with codeswitching support across these languages. Given the wide coverage of languages in the training corpus, it may also be applicable beyond the officially supported languages.
Competitive performance on downstream speech tasks
The model retains near state-of-the-art results on the SUPERB benchmark for English, and showcases strong multilingual capabilities deomnstrated through its integration into a [high-performance ASR system shown below](#Automatic Speech Recognition (ASR)).
Innovative pre-training techniques
MERaLiON-SpeechEncoder-2 was trained from scratch with an novel extension of the BEST-RQ self-supervised objective, by using more informative latent targets. We also adopted the Muon optimizer, which has previously only been shown to outperform the popular AdamW for LLM training. We find its advantages also carry over to speech-based models.
Model Summary
- Developed by: I2R, A*STAR
- Model type: Speech Encoder
- Language(s): English (Global & Singapore), Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese.
- License: MERaLiON Public License
For details on background, pre-training, tuning experiments and evaluation, please refer to our technical report.
Benchmarks
SUPERB
Model | Overall Score | PR↓ | ASR↓ | KS↑ | QbE↑ | SID↑ | ASV↓ | SD↓ | ER↑ | IC↑ | SF (F1↑ / CER↓) |
---|---|---|---|---|---|---|---|---|---|---|---|
HuBERT large | 82.25 | 3.53 | 3.62 | 95.29 | 0.0354 | 90.33 | 5.98 | 5.75 | 67.62 | 98.76 | 89.91 / 21.76 |
WavLM large | 84.77 | 3.06 | 3.44 | 97.86 | 0.0886 | 95.49 | 3.77 | 3.24 | 70.62 | 99.31 | 92.21 / 18.36 |
MERaLiON-SpeechEncoder-V1 | 82.62 | 3.14 | 4.16 | 97.63 | 0.0590 | 91.09 | 5.18 | 5.06 | 68.02 | 98.60 | 88.99 / 23.89 |
MERaLiON-SpeechEncoder-2 | 82.72 | 3.40 | 4.96 | 97.57 | 0.0575 | 88.96 | 3.93 | 3.90 | 68.80 | 98.95 | 89.50 / 23.46 |
Automatic Speech Recognition (ASR)
Model Sources [optional]
- Repository: [More Information Needed]
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
Direct Use
[More Information Needed]
Downstream Use [optional]
[More Information Needed]
Out-of-Scope Use
[More Information Needed]
Bias, Risks, and Limitations
[More Information Needed]
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Data
[More Information Needed]
Training Procedure
Preprocessing [optional]
[More Information Needed]
Training Hyperparameters
- Training regime: [More Information Needed]
Speeds, Sizes, Times [optional]
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
[More Information Needed]
Summary
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]