File size: 12,091 Bytes
e2a8e6f 8c63da9 fd9b18e 8c63da9 b3f9b36 8c63da9 b3f9b36 e2a8e6f 8c63da9 e2a8e6f 8c63da9 59be175 8c63da9 e2a8e6f 21215c6 e2a8e6f 2b92cef e2a8e6f 2fcf5a4 572cdbd 2fcf5a4 59be175 8c63da9 21215c6 905e21d 8c63da9 21215c6 905e21d 8c63da9 21215c6 905e21d 8c63da9 21215c6 2dc5fc6 8c63da9 2fcf5a4 fd9b18e 8c63da9 2fcf5a4 6e5f833 2fcf5a4 a8e5393 572cdbd 2fcf5a4 e2a8e6f 572cdbd 905e21d 572cdbd e2a8e6f 905e21d 6e5f833 c741484 6e5f833 c741484 6e5f833 6789036 6e5f833 c741484 6e5f833 c741484 6e5f833 6b744c3 c741484 e2a8e6f c741484 59be175 c741484 e2a8e6f c741484 e2a8e6f c741484 e2a8e6f c741484 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
---
library_name: transformers
license: other
license_name: meralion-public-license
license_link: https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-2/blob/main/MERaLiON-Public-Licence-v1_Speech-Encoder-2.pdf
tags:
- speech
- best-rq
- meralion
- meralion-2
language:
- en
- zh
- ms
- ta
- id
- th
- vi
---
<h1 align="center">🎧 MERaLiON-SpeechEncoder-2 🎧</h1>
<p align="center">
<a href="https://meralion.org/demo/">💻 ASR Web Demo (Coming Soon!)</a>
</p>
We introduce **MERaLiON-SpeechEncoder-2**, our next-generation multilingual speech encoder that was pre-trained from scratch on a greatly expanded corpus of **1.4 million hours** of unlabeled audio, with a **strong focus on Southeast Asian (SEA) languages and accents**. As a speech foundation model, it encodes speech into a general-purpose, multilingual acoustic representation that can serve as a high-performance backbone for a wide range of downstream tasks — including automatic speech recognition (ASR), speech translation, speaker and language identification, and emotion recognition. **This model can be finetuned on custom datasets, allowing developers to build speech systems tailored to their specific needs.**
Unlike many existing models optimized for high-resource, Western languages, MERaLiON-SpeechEncoder-2 is designed from the ground up to reflect the linguistic diversity and complexity of Southeast Asia. Our training data was curated to contain a substantial amount originating from Singapore and SEA, including 60,000 hours of Singapore-accented speech, with a further 160,000 hours covering Singapore’s official languages Chinese, Malay and Tamil, along with a smaller portion of dialects like Hokkien and Cantonese. SEA data amounts to 200,000 hours, including significant proportions of Malay, Thai, Indonesian, Vietnamese, with smaller amounts of Tagalog, Burmese, Javanese, Sundanese, Khmer and Lao. See below for a regional breakdown of the language coverage of our pre-training data.
<p align="center">
<img src="data2.svg" width="620"/>
</p>
## Model Highlights
#### Small model size
With only **630M parameters (≈2.5 GB in memory)**, the model is easily deployable on most commercial GPUs, eliminating the need for distributed or large-scale compute setups.
#### Natively multilingual
Building on [MERaLiON-SpeechEncoder-v1](https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-v1) (which focused on English and Singlish), this version expands to include **English, Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese, along with codeswitching support across these languages**. Given the wide coverage of languages in the training corpus, it may also be applicable beyond the officially supported languages.
#### Competitive performance on downstream speech tasks
The model retains near state-of-the-art results on the SUPERB benchmark for English, and showcases strong multilingual capabilities demonstrated through its integration into a [high-performance ASR system](#automatic-speech-recognition-asr).
#### Innovative pre-training techniques
MERaLiON-SpeechEncoder-2 was trained from scratch with a **novel extension of the BEST-RQ** self-supervised objective, by using more informative latent targets. We also adopted the **Muon optimizer**, which has previously only been shown to outperform the widely-used AdamW optimizer for LLM training. We find its advantages also carry over to speech-based models.
## Model Summary
- **Developed by:** I<sup>2</sup>R, A\*STAR
- **Model type:** Speech Encoder
- **Language(s):** English (Global & Singapore), Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese.
- **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-2/blob/main/MERaLiON-Public-Licence-v1_Speech-Encoder-2.pdf)
For details on background, pre-training, tuning experiments and evaluation, please refer to our [technical report](https://arxiv.org/abs/2412.11538).
## Benchmarks
### SUPERB
| Model | Overall Score | PR↓ | ASR↓ | KS↑ | QbE↑ | SID↑ | ASV↓ | SD↓ | ER↑ | IC↑ | SF (F1↑ / CER↓) |
|----------------------------------|---------------|------|------|-------|--------|-------|------|------|-------|-------|----------------------|
| HuBERT large | 82.25 | 3.53 | 3.62 | 95.29 | 0.0354 | 90.33 | 5.98 | 5.75 | 67.62 | 98.76 | 89.91 / 21.76 |
| WavLM large | 84.77 | 3.06 | 3.44 | 97.86 | 0.0886 | 95.49 | 3.77 | 3.24 | 70.62 | 99.31 | 92.21 / 18.36 |
| MERaLiON-SpeechEncoder-v1 | 82.62 | 3.14 | 4.16 | 97.63 | 0.0590 | 91.09 | 5.18 | 5.06 | 68.02 | 98.60 | 88.99 / 23.89 |
| MERaLiON-SpeechEncoder-2 | 82.72 | 3.40 | 4.96 | 97.57 | 0.0575 | 88.96 | 3.93 | 3.90 | 68.80 | 98.95 | 89.50 / 23.46 |
[SUPERB](https://superbbenchmark.github.io/#/) is an English-based benchmark for speech encoders covering a wide range of downstream speech tasks across domains such as recognition, detection, semantics, speaker, and paralinguistics, where each task is finetuned separately with a frozen encoder.
MERaLiON-SpeechEncoder-2 is competitive to state-of-the-art, improving slightly against our own v1 model on speaker and paralinguistic tasks.
### Automatic Speech Recognition (ASR)
<p align="center">
<img src="overall_wer.svg" width="720"/>
<img src="audiobench_wer.svg" width="720"/>
<img src="fleurs_wer.svg" width="720"/>
</p>
Leveraging on the multilingual capabilities of MERaLiON-SpeechEncoder-2, we further finetuned the model for ASR on supervised speech data to produce a lightweight MERaLiON-SpeechEncoder-2-ASR-CTC, which is competitive to models many times its size in transcribing the target languages, while offering much faster inference speeds. It outperforms the popular Whisper large v3 across most languages in [Audiobench](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) and maintains close performance on FLEURS. Our comprehensive internal benchmarking, shown in the 'Overall ASR Performance', also contains several private datasets in addition to Audiobench and FLEURS.
## Direct Use
The following code snippet can be used to directly obtain latent features i.e. encoded speech by forwarding through the model. Inputs into the model are expected to be 80-dimensional Mel-spectrogram features transformed from 16kHz sampled audio. The AutoFeatureExtractor method can carry out the conversion.
```python
import torch
from datasets import load_dataset
from transformers import AutoModel, AutoFeatureExtractor
repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-2'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load model and feature extractor
model = AutoModel.from_pretrained(
repo_id,
trust_remote_code=True,
)
model = model.to(device)
feature_extractor = AutoFeatureExtractor.from_pretrained(
repo_id,
trust_remote_code=True
)
# prepare data
data = load_dataset("distil-whisper/librispeech_long", "clean",
split="validation")
def batch_collater(data):
tensors = []
for idx, sample in enumerate(data):
tensors.append(sample['audio']['array'])
return tensors
audio_array = batch_collater(data)
inputs = feature_extractor(audio_array, sampling_rate=16_000,
return_attention_mask=True,
return_tensors='pt', do_normalize=False)
inputs = inputs.to(device)
# model inference to obtain features
with torch.no_grad():
model.eval()
output = model(input_values=inputs['input_values'],
attention_mask=inputs['attention_mask'],
output_hidden_states=True)
# output is a Wav2Vec2BaseModelOutput or tuple containing:
# last_hidden_state: torch.FloatTensor containing hidden states of the last layer of the model
# extract_features: torch.FloatTensor containing extracted features from the convolution downsampling layers
# hidden_states: tuple(torch.FloatTensor) containing hidden states of each layer of the model
# attentions: tuple(torch.FloatTensor) containing attention states of each layer of the model
```
## Downstream Use
Speech encoders are normally used in finetuning setups to provide the frontend to downstream speech applications. We provide an example below of an ASR finetuning setup with Huggingface. Please refer to this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for the full ASR finetuning recipe using Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom finetuning loops.
```python
import torch
import json
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoFeatureExtractor, Wav2Vec2CTCTokenizer
repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-2'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# prepare data
def pre_processing(batch):
batch["text"] = batch["text"].lower()
return batch
def extract_all_chars(batch):
all_text = " ".join(batch["text"])
vocab = list(set(all_text))
return {"vocab": [vocab], "all_text": [all_text]}
librispeech100h_train = load_dataset("openslr/librispeech_asr", split="train.clean.100")
librispeech100h_test = load_dataset("openslr/librispeech_asr", split="validation.clean")
librispeech100h_train = librispeech100h_train.remove_columns(
['file', 'speaker_id', 'chapter_id', 'id'])
librispeech100h_test = librispeech100h_test.remove_columns(
['file', 'speaker_id', 'chapter_id', 'id'])
librispeech100h_train = librispeech100h_train.map(pre_processing)
librispeech100h_test = librispeech100h_test.map(pre_processing)
vocab_train = librispeech100h_train.map(extract_all_chars, batched=True,
batch_size=-1, keep_in_memory=True,
remove_columns=librispeech100h_train.column_names)
vocab_test = librispeech100h_test.map(extract_all_chars, batched=True,
batch_size=-1, keep_in_memory=True,
remove_columns=librispeech100h_test.column_names)
vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
with open('ls_vocab.json', 'w') as vocab_file:
json.dump(vocab_dict, vocab_file)
# load model, feature extractor and tokenizer
feature_extractor = AutoFeatureExtractor.from_pretrained(
repo_id,
trust_remote_code = True,
)
tokenizer = Wav2Vec2CTCTokenizer("./ls_vocab.json",
unk_token="[UNK]", pad_token="[PAD]",
word_delimiter_token="|")
model = AutoModelForCTC.from_pretrained(
repo_id,
trust_remote_code=True,
vocab_size=len(vocab_dict),
feat_proj_dropout=0.1,
activation_dropout=0.1,
hidden_dropout=0.1,
conformer_conv_dropout=0.1,
ctc_loss_reduction="mean",
pad_token_id=tokenizer.pad_token_id,
attention_dropout=0.1,
)
model = model.to(device)
```
### Compute and Infrastructure
MERaLiON-SpeechEncoder-2 was trained on the [**ASPIRE 2A+**](https://help.nscc.sg/aspire2aplus/about/) Supercomputer Cluster, provided by [**National Supercomputing Centre (NSCC)**](https://www.nscc.sg/), Singapore.
MERaLiON-SpeechEncoder-2 was trained with 64 H100 GPUs across 8 nodes for collectively around 3.5 million steps. Training time took approximately 15 days.
## Citation
If you find our work useful, please cite our technical report:
```
@misc{huzaifah2024speechfoundationmodelsingapore,
title={MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond},
author={{MERaLiON Team}},
year={2024},
eprint={2412.11538},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.11538},
}
```
|