--- license: apache-2.0 language: - en - hi library_name: rkllm tags: - text-to-speech - tts - hindi - english - llama - audio - speech - india - rkllm - rockchip - rk3588 datasets: - proprietary pipeline_tag: text-to-speech base_model: maya-research/Veena base_model_relation: quantized co2_eq_emissions: emissions: 0 source: "Not specified" training_type: "unknown" geographical_location: "unknown" --- # Veena - Text to Speech for Indian Languages Veena is a state-of-the-art neural text-to-speech (TTS) model specifically designed for Indian languages, developed by Maya Research. Built on a Llama architecture backbone, Veena generates natural, expressive speech in Hindi and English with remarkable quality and ultra-low latency. ## Model Overview **Veena** is a 3B parameter autoregressive transformer model based on the Llama architecture. It is designed to synthesize high-quality speech from text in Hindi and English, including code-mixed scenarios. The model outputs audio at a 24kHz sampling rate using the SNAC neural codec. * **Model type:** Autoregressive Transformer * **Base Architecture:** Llama (3B parameters) * **Languages:** Hindi, English * **Audio Codec:** SNAC @ 24kHz * **License:** Apache 2.0 * **Developed by:** Maya Research * **Model URL:** [https://huggingface.co/maya-research/veena](https://huggingface.co/maya-research/veena) ## Key Features * **4 Distinct Voices:** `kavya`, `agastya`, `maitri`, and `vinaya` - each with unique vocal characteristics. * **Multilingual Support:** Native Hindi and English capabilities with code-mixed support. * **Ultra-Fast Inference:** Sub-80ms latency on H100-80GB GPUs. * **High-Quality Audio:** 24kHz output with the SNAC neural codec. * **Production-Ready:** Optimized for real-world deployment with 4-bit quantization support. ## How to Get Started with the Model ### Installation To use Veena, you need to install the `transformers`, `torch`, `torchaudio`, `snac`, and `bitsandbytes` libraries. ```bash pip install transformers torch torchaudio pip install snac bitsandbytes # For audio decoding and quantization ``` ### Basic Usage The following Python code demonstrates how to generate speech from text using Veena with 4-bit quantization for efficient inference. ## Uses Veena is ideal for a wide range of applications requiring high-quality, low-latency speech synthesis for Indian languages, including: * **Accessibility:** Screen readers and voice-enabled assistance for visually impaired users. * **Customer Service:** IVR systems, voice bots, and automated announcements. * **Content Creation:** Dubbing for videos, e-learning materials, and audiobooks. * **Automotive:** In-car navigation and infotainment systems. * **Edge Devices:** Voice-enabled smart devices and IoT applications. ## Technical Specifications ### Architecture Veena leverages a 3B parameter transformer-based architecture with several key innovations: * **Base Architecture:** Llama-style autoregressive transformer (3B parameters) * **Audio Codec:** SNAC (24kHz) for high-quality audio token generation * **Speaker Conditioning:** Special speaker tokens (``, ``, ``, ``) * **Parameter-Efficient Training:** LoRA adaptation with differentiated ranks for attention and FFN modules. * **Context Length:** 2048 tokens ### Training #### Training Infrastructure * **Hardware:** 8× NVIDIA H100 80GB GPUs * **Distributed Training:** DDP with optimized communication * **Precision:** BF16 mixed precision training with gradient checkpointing * **Memory Optimization:** 4-bit quantization with NF4 + double quantization #### Training Configuration * **LoRA Configuration:** * `lora_rank_attention`: 192 * `lora_rank_ffn`: 96 * `lora_alpha`: 2× rank (384 for attention, 192 for FFN) * `lora_dropout`: 0.05 * `target_modules`: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]` * `modules_to_save`: `["embed_tokens"]` * **Optimizer Configuration:** * `optimizer`: AdamW (8-bit) * `optimizer_betas`: (0.9, 0.98) * `optimizer_eps`: 1e-5 * `learning_rate_peak`: 1e-4 * `lr_scheduler`: cosine * `warmup_ratio`: 0.02 * **Batch Configuration:** * `micro_batch_size`: 8 * `gradient_accumulation_steps`: 4 * `effective_batch_size`: 256 #### Training Data Veena was trained on **proprietary, high-quality datasets** specifically curated for Indian language TTS. * **Data Volume:** 15,000+ utterances per speaker (60,000+ total) * **Languages:** Native Hindi and English utterances with code-mixed support * **Speaker Diversity:** 4 professional voice artists with distinct characteristics * **Audio Quality:** Studio-grade recordings at 24kHz sampling rate * **Content Diversity:** Conversational, narrative, expressive, and informational styles **Note:** The training datasets are proprietary and not publicly available. ## Performance Benchmarks | Metric | Value | | --------------------- | ------------------------- | | Latency (H100-80GB) | \<80ms | | Latency (A100-40GB) | \~120ms | | Latency (RTX 4090) | \~200ms | | Real-time Factor | 0.05x | | Throughput | \~170k tokens/s (8×H100) | | Audio Quality (MOS) | 4.2/5.0 | | Speaker Similarity | 92% | | Intelligibility | 98% | ## Risks, Limitations and Biases * **Language Support:** Currently supports only Hindi and English. Performance on other Indian languages is not guaranteed. * **Speaker Diversity:** Limited to 4 speaker voices, which may not represent the full diversity of Indian accents and dialects. * **Hardware Requirements:** Requires a GPU for real-time or near-real-time inference. CPU performance will be significantly slower. * **Input Length:** The model is limited to a maximum input length of 2048 tokens. * **Bias:** The model's performance and voice characteristics are a reflection of the proprietary training data. It may exhibit biases present in the data. ## Future Updates We are actively working on expanding Veena's capabilities: * Support for Tamil, Telugu, Bengali, Marathi, and other Indian languages. * Additional speaker voices with regional accents. * Emotion and prosody control tokens. * Streaming inference support. * CPU optimization for edge deployment. ## Citing If you use Veena in your research or applications, please cite: ```bibtex @misc{veena2025, title={Veena: Open Source Text-to-Speech for Indian Languages}, author={Maya Research Team}, year={2025}, publisher={HuggingFace}, url={[https://huggingface.co/maya-research/veena-tts](https://huggingface.co/maya-research/veena-tts)} } ``` ## Acknowledgments We thank the open-source community and all contributors who made this project possible. Special thanks to the voice artists who provided high-quality recordings for training.