huzy0 commited on
Commit
6e5f833
·
verified ·
1 Parent(s): 58bc004

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -19
README.md CHANGED
@@ -2,20 +2,21 @@
2
  library_name: transformers
3
  license: other
4
  license_name: meralion-public-license
5
- license_link: https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf
 
6
  tags:
7
- - speech
8
- - best-rq
9
- - meralion
10
- - meralion-2
11
  language:
12
- - en
13
- - zh
14
- - ms
15
- - ta
16
- - id
17
- - th
18
- - vi
19
  ---
20
 
21
  <h1 align="center">🎧 MERaLiON-SpeechEncoder-2 🎧</h1>
@@ -44,10 +45,10 @@ With only 630M parameters (≈2.5 GB in memory), the model is easily deployable
44
  Building on [MERaLiON-SpeechEncoder-v1](https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-v1) (which focused on English and Singlish), this version expands to include English, Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese, along with codeswitching support across these languages. Given the wide coverage of languages in the training corpus, it may also be applicable beyond the officially supported languages.
45
 
46
  #### Competitive performance on downstream speech tasks
47
- The model retains near state-of-the-art results on the SUPERB benchmark for English, and showcases strong multilingual capabilities deomnstrated through its integration into a [high-performance ASR system shown below](#automatic-speech-recognition).
48
 
49
  #### Innovative pre-training techniques
50
- MERaLiON-SpeechEncoder-2 was trained from scratch with an novel extension of the BEST-RQ self-supervised objective, by using more informative latent targets. We also adopted the Muon optimizer, which has previously only been shown to outperform the popular AdamW for LLM training. We find its advantages also carry over to speech-based models.
51
 
52
  ## Model Summary
53
 
@@ -65,7 +66,7 @@ For details on background, pre-training, tuning experiments and evaluation, plea
65
  |----------------------------------|---------------|------|------|-------|--------|-------|------|------|-------|-------|----------------------|
66
  | HuBERT large | 82.25 | 3.53 | 3.62 | 95.29 | 0.0354 | 90.33 | 5.98 | 5.75 | 67.62 | 98.76 | 89.91 / 21.76 |
67
  | WavLM large | 84.77 | 3.06 | 3.44 | 97.86 | 0.0886 | 95.49 | 3.77 | 3.24 | 70.62 | 99.31 | 92.21 / 18.36 |
68
- | MERaLiON-SpeechEncoder-V1 | 82.62 | 3.14 | 4.16 | 97.63 | 0.0590 | 91.09 | 5.18 | 5.06 | 68.02 | 98.60 | 88.99 / 23.89 |
69
  | MERaLiON-SpeechEncoder-2 | 82.72 | 3.40 | 4.96 | 97.57 | 0.0575 | 88.96 | 3.93 | 3.90 | 68.80 | 98.95 | 89.50 / 23.46 |
70
 
71
  SUPERB is an English-based benchmark for speech encoders covering a wide range of downstream speech tasks across domains such as recognition, detection, semantics, speaker, and paralinguistics, where each task is finetuned separately with a frozen encoder.
@@ -76,14 +77,135 @@ MERaLiON-SpeechEncoder-2 is competitive to state-of-the-art, improving slightly
76
  ### Automatic Speech Recognition (ASR)
77
 
78
  <p align="center">
79
- <img src="overall_wer.svg" width="650"/>
80
- <img src="audiobench_wer.svg" width="650"/>
81
- <img src="fleurs_wer.svg" width="650"/>
82
  </p>
83
 
84
- Leveraging on the multilingual capabilities of MERaLiON-SpeechEncoder-2, we further finetuned the model for on supervised speech data to produce a lightweight MERaLiON-SpeechEncoder-2-ASR-CTC, which is competitive to models many times its size in transcribing the target languages, while offering much faster inference speeds. It outperforms the popular Whisper large v3 across most ASR benchmarks, including [Audiobench](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) and FLEURS. Our internal benchmarking, shown in the 'Overall ASR Performance', also contains several private datasets in addition to Audiobench and FLEURS.
85
 
 
86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
 
89
 
 
2
  library_name: transformers
3
  license: other
4
  license_name: meralion-public-license
5
+ license_link: >-
6
+ https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf
7
  tags:
8
+ - speech
9
+ - best-rq
10
+ - meralion
11
+ - meralion-2
12
  language:
13
+ - en
14
+ - zh
15
+ - ms
16
+ - ta
17
+ - id
18
+ - th
19
+ - vi
20
  ---
21
 
22
  <h1 align="center">🎧 MERaLiON-SpeechEncoder-2 🎧</h1>
 
45
  Building on [MERaLiON-SpeechEncoder-v1](https://huggingface.co/MERaLiON/MERaLiON-SpeechEncoder-v1) (which focused on English and Singlish), this version expands to include English, Chinese, Malay, Tamil, Thai, Indonesian, and Vietnamese, along with codeswitching support across these languages. Given the wide coverage of languages in the training corpus, it may also be applicable beyond the officially supported languages.
46
 
47
  #### Competitive performance on downstream speech tasks
48
+ The model retains near state-of-the-art results on the SUPERB benchmark for English, and showcases strong multilingual capabilities deomnstrated through its integration into a [high-performance ASR system shown below](#automatic-speech-recognition-asr).
49
 
50
  #### Innovative pre-training techniques
51
+ MERaLiON-SpeechEncoder-2 was trained from scratch with a novel extension of the BEST-RQ self-supervised objective, by using more informative latent targets. We also adopted the Muon optimizer, which has previously only been shown to outperform the popular AdamW for LLM training. We find its advantages also carry over to speech-based models.
52
 
53
  ## Model Summary
54
 
 
66
  |----------------------------------|---------------|------|------|-------|--------|-------|------|------|-------|-------|----------------------|
67
  | HuBERT large | 82.25 | 3.53 | 3.62 | 95.29 | 0.0354 | 90.33 | 5.98 | 5.75 | 67.62 | 98.76 | 89.91 / 21.76 |
68
  | WavLM large | 84.77 | 3.06 | 3.44 | 97.86 | 0.0886 | 95.49 | 3.77 | 3.24 | 70.62 | 99.31 | 92.21 / 18.36 |
69
+ | MERaLiON-SpeechEncoder-v1 | 82.62 | 3.14 | 4.16 | 97.63 | 0.0590 | 91.09 | 5.18 | 5.06 | 68.02 | 98.60 | 88.99 / 23.89 |
70
  | MERaLiON-SpeechEncoder-2 | 82.72 | 3.40 | 4.96 | 97.57 | 0.0575 | 88.96 | 3.93 | 3.90 | 68.80 | 98.95 | 89.50 / 23.46 |
71
 
72
  SUPERB is an English-based benchmark for speech encoders covering a wide range of downstream speech tasks across domains such as recognition, detection, semantics, speaker, and paralinguistics, where each task is finetuned separately with a frozen encoder.
 
77
  ### Automatic Speech Recognition (ASR)
78
 
79
  <p align="center">
80
+ <img src="overall_wer.svg" width="680"/>
81
+ <img src="audiobench_wer.svg" width="680"/>
82
+ <img src="fleurs_wer.svg" width="680"/>
83
  </p>
84
 
85
+ Leveraging on the multilingual capabilities of MERaLiON-SpeechEncoder-2, we further finetuned the model for on supervised speech data to produce a lightweight MERaLiON-SpeechEncoder-2-ASR-CTC, which is competitive to models many times its size in transcribing the target languages, while offering much faster inference speeds. It outperforms the popular Whisper large v3 across most languages in [Audiobench](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) and maintains close perofrmance in FLEURS. Our internal benchmarking, shown in the 'Overall ASR Performance', also contains several private datasets in addition to Audiobench and FLEURS.
86
 
87
+ ### Direct Use
88
 
89
+ The follwing code snippet can be used to directly obtain latent features i.e. encoded speech by forwarding through the model.
90
+
91
+ ```python
92
+ import torch
93
+ from datasets import load_dataset
94
+ from transformers import AutoModel, AutoFeatureExtractor
95
+
96
+ repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-2'
97
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
98
+
99
+ # load model and feature extractor
100
+ model = AutoModel.from_pretrained(
101
+ repo_id,
102
+ trust_remote_code=True,
103
+ )
104
+ model = model.to(device)
105
+
106
+ feature_extractor = AutoFeatureExtractor.from_pretrained(
107
+ repo_id,
108
+ trust_remote_code=True
109
+ )
110
+
111
+ # prepare data
112
+ data = load_dataset("distil-whisper/librispeech_long", "clean",
113
+ split="validation")
114
+
115
+ def batch_collater(data):
116
+ tensors = []
117
+ for idx, sample in enumerate(data):
118
+ tensors.append(sample['audio']['array'])
119
+ return tensors
120
+
121
+ audio_array = batch_collater(data)
122
+ inputs = feature_extractor(audio_array, sampling_rate=16_000,
123
+ return_attention_mask=True,
124
+ return_tensors='pt', do_normalize=False)
125
+ inputs = inputs.to(device)
126
+
127
+ # model inference to obtain features
128
+ with torch.no_grad():
129
+ model.eval()
130
+ output = model(input_values=inputs['input_values'],
131
+ attention_mask=inputs['attention_mask'],
132
+ output_hidden_states=True)
133
+ ```
134
+
135
+ ### Downstream Use
136
+
137
+ Speech encoders are normally used in finetuning setups to provide the frontend to downstream speech applications. We provide an example below of an ASR finetuning setup with Huggingface. Please refer to this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for the full ASR finetuning recipe with Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom finetuning loops.
138
+
139
+ ```python
140
+ import torch
141
+ import json
142
+ from datasets import load_dataset
143
+ from transformers import AutoModelForCTC, AutoFeatureExtractor, Wav2Vec2CTCTokenizer
144
+
145
+ repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-2'
146
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
147
+
148
+ # prepare data
149
+ def pre_processing(batch):
150
+ batch["text"] = batch["text"].lower()
151
+ return batch
152
+
153
+ def extract_all_chars(batch):
154
+ all_text = " ".join(batch["text"])
155
+ vocab = list(set(all_text))
156
+ return {"vocab": [vocab], "all_text": [all_text]}
157
+
158
+ librispeech100h_train = load_dataset("openslr/librispeech_asr", split="train.clean.100")
159
+ librispeech100h_test = load_dataset("openslr/librispeech_asr", split="validation.clean")
160
+ librispeech100h_train = librispeech100h_train.remove_columns(
161
+ ['file', 'speaker_id', 'chapter_id', 'id'])
162
+ librispeech100h_test = librispeech100h_test.remove_columns(
163
+ ['file', 'speaker_id', 'chapter_id', 'id'])
164
+
165
+ librispeech100h_train = librispeech100h_train.map(pre_processing)
166
+ librispeech100h_test = librispeech100h_test.map(pre_processing)
167
+
168
+ vocab_train = librispeech100h_train.map(extract_all_chars, batched=True,
169
+ batch_size=-1, keep_in_memory=True,
170
+ remove_columns=librispeech100h_train.column_names)
171
+ vocab_test = librispeech100h_test.map(extract_all_chars, batched=True,
172
+ batch_size=-1, keep_in_memory=True,
173
+ remove_columns=librispeech100h_test.column_names)
174
+ vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))
175
+ vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}
176
+
177
+ vocab_dict["|"] = vocab_dict[" "]
178
+ del vocab_dict[" "]
179
+ vocab_dict["[UNK]"] = len(vocab_dict)
180
+ vocab_dict["[PAD]"] = len(vocab_dict)
181
+
182
+ with open('ls_vocab.json', 'w') as vocab_file:
183
+ json.dump(vocab_dict, vocab_file)
184
+
185
+ # load model, feature extractor and tokenizer
186
+ feature_extractor = AutoFeatureExtractor.from_pretrained(
187
+ repo_id,
188
+ trust_remote_code = True,
189
+ )
190
+
191
+ tokenizer = Wav2Vec2CTCTokenizer("./ls_vocab.json",
192
+ unk_token="[UNK]", pad_token="[PAD]",
193
+ word_delimiter_token="|")
194
+
195
+ model = AutoModelForCTC.from_pretrained(
196
+ repo_id,
197
+ trust_remote_code=True,
198
+ vocab_size=len(vocab_dict),
199
+ feat_proj_dropout=0.1,
200
+ activation_dropout=0.1,
201
+ hidden_dropout=0.1,
202
+ conformer_conv_dropout=0.1,
203
+ ctc_loss_reduction="mean",
204
+ pad_token_id=tokenizer.pad_token_id,
205
+ attention_dropout=0.1,
206
+ )
207
+ model = model.to(device)
208
+ ```
209
 
210
 
211