MERaLiON
/

MERaLiON-SpeechEncoder-2

@@ -1,29 +1,68 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
 ### Model Sources [optional]

 ---
 library_name: transformers
+license: other
+license_name: meralion-public-license
+license_link: >-
+  https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf
+tags:
+  - speech
+  - best-rq
+  - meralion
+  - meralion-2
+language:
+  - en
+  - zh
+  - ms
+  - ta
+  - id
+  - th
+  - vi
 ---
+<h1 align="center">🎧 MERaLiON-SpeechEncoder-2 🎧</h1>
+<p align="center">
+  <a href="https://meralion.org/demo/">💻 ASR Web Demo (Coming Soon!)</a> |
+</p>
+We introduce **MERaLiON-SpeechEncoder-2**, an update of [MERaLiON-SpeechEncoder-v1](https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-v1) that greatly expands our pre-training data to **1.4 million hours** of unlabeled audio, with a **strong focus on Southeast Asian (SEA) languages and accents**. As a speech foundation model, it encodes speech into a general-purpose, multilingual acoustic representation that can serve as a high-performance backbone for a wide range of downstream tasks -- including automatic speech recognition (ASR), speech translation, speaker and language identification, and emotion recognition.
+Unlike many existing models optimized for high-resource, Western languages, MERaLiON-SpeechEncoder-2 is designed from the ground up to reflect the linguistic diversity and complexity of Southeast Asia. The model can be finetuned on custom datasets, allowing developers to build speech systems tailored to their specific needs.
+## Model Highlights
+### Small model size
+With only 630M parameters (≈2.5 GB in memory), the model is easily deployable on most commercial GPUs, eliminating the need for distributed or large-scale compute setups.
+### Natively multilingual
+Building on our [v1 release](https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-v1) (which focused on English and Singlish), this version expands to include English, Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese, along with codeswitching support across these languages. Given the wide coverage of languages in the training corpus, it may also be applicable beyond the officially supported languages.
+### Competitive performance on downstream speech tasks
+The model retains near state-of-the-art results on the SUPERB benchmark for English, and showcases strong multilingual capabilities deomnstrated through its integration into a high-performance ASR system shown below.
+### Innovative pre-training techniques
+MERaLiON-SpeechEncoder-2 was trained from scratch with an novel extension of the BEST-RQ self-supervised objective, by using more informative latent targets. We also adopted the Muon optimizer, which has previously only been shown to outperform the popular AdamW for LLM training. We find its advantages also carry over to speech-based models.
+## Model Summary
+- **Developed by:** I<sup>2</sup>R, A\*STAR
+- **Model type:** Speech Encoder
+- **Language(s):** Primarily English (Global & Singapore), Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese. See [pre-training data](#Language coverage of pre-training data) for full breakdown of language coverage.
+- **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)
+The following Hugging Face-compatible models are implemented:
+-   **`MeralionBestRqModel`**: The base BEST-RQ Conformer encoder. It outputs the final hidden states and is suitable for feature extraction or as a base for other heads.
+-   **`MeralionBestRqModelForCTC`**: The Conformer model with a linear CTC head for ASR.
+-   **`MeralionBestRqModelForLSTMCTC`**: The Conformer model with a more powerful CTC head that includes two LSTM layers before the final projection layer. This version can also be configured to use a weighted sum of all encoder hidden states.
+For details on background, pre-training, tuning experiments and evaluation, please refer to our [technical report](https://arxiv.org/abs/2412.11538).
+## Language coverage of pre-training data
 ### Model Sources [optional]