Whisper Multilingual Fine-tuned Model

This is a fine-tuned version of OpenAI's Whisper model for multilingual speech recognition.

Supported Languages

  • English (en)
  • Hindi (hi)
  • Bengali (bn)
  • Marathi (mr)
  • Tamil (ta)
  • Telugu (te)

Model Details

  • Base Model: Distil Whisper Large V3
  • Fine-tuned on: Custom multilingual dataset
  • Training Framework: Transformers
  • Model Type: Speech-to-Text

Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Load model and processor
processor = WhisperProcessor.from_pretrained("TheKingMonarch/whisper-multilang-finetuned")
model = WhisperForConditionalGeneration.from_pretrained("TheKingMonarch/whisper-multilang-finetuned")

# Fix generation config
model.generation_config.forced_decoder_ids = None

# Load audio
audio, _ = librosa.load("audio.wav", sr=16000)

# Transcribe
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
predicted_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Language-specific Usage

# For specific language (e.g., Hindi)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="hi", task="transcribe")
predicted_ids = model.generate(inputs.input_features, forced_decoder_ids=forced_decoder_ids)

Training Details

  • Fine-tuned using custom multilingual speech dataset

  • Optimized for Indian languages and English

  • Final WER: 27.08%

  • Training Steps: 600

  • Best WER achieved: 26.73% at step 550

Training Metrics

Step Training Loss Validation Loss WER (%)
50 2.075000 1.930286 133.45
100 1.206600 1.275027 89.54
150 0.793800 0.712475 93.42
200 0.528700 0.562679 88.92
250 0.379900 0.473467 89.27
300 0.289400 0.369892 69.88
350 0.244300 0.291235 49.58
400 0.268800 0.249055 42.80
450 0.122200 0.209867 36.29
500 0.084700 0.173593 31.44
550 0.073400 0.155249 26.73
600 0.044300 0.148559 27.08

Training Configuration

  • Base Model: distil whispwer large v3
  • Learning Rate: Optimized during training
  • Batch Size: Configured for optimal performance
  • Training Duration: 600 steps
  • Evaluation Strategy: Every 50 steps
  • Early Stopping: Based on WER improvement

Limitations

  • Performance may vary across different accents and dialects
  • Best results on clear audio with minimal background noise
  • Optimized for the specific languages listed above

Citation

If you use this model, please cite:

@misc{{whisper-multilang-finetuned,
  author = {{Your Name}},
  title = {{Whisper Multilingual Fine-tuned Model}},
  year = {{2025}},
  publisher = {{Hugging Face}},
  url = {{https://huggingface.co/TheKingMonarch/whisper-multilang-finetuned}}
}}
Downloads last month
69
Safetensors
Model size
756M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TheKingMonarch/whisper-multilang-finetuned

Finetuned
(25)
this model

Evaluation results