Text2speech / README.md
Nick021402's picture
Update README.md
47ebcc4 verified

A newer version of the Gradio SDK is available: 5.41.0

Upgrade
metadata
license: mit
title: ' 🎀 Long-Form Text-to-Speech Generator'
sdk: gradio
emoji: πŸš€
colorFrom: indigo
colorTo: red
pinned: true
short_description: 'Unlimited Text Length**: Handle texts of any size.'

🎀 Long-Form Text-to-Speech Generator

A powerful Hugging Face Space that converts text of any length into natural, human-like speech using completely free AI models.

✨ Features

  • πŸš€ Unlimited Text Length: Handle texts of any size, from short sentences to entire articles
  • πŸ€– Human-like Voice: Uses Microsoft's SpeechT5 model for natural speech synthesis
  • ⚑ Smart Text Processing: Intelligent chunking preserves sentence flow and natural pauses
  • πŸ†“ Completely Free: Uses only open-source models, no API keys required
  • πŸ”§ Auto-preprocessing: Handles abbreviations, numbers, and text normalization
  • πŸ“± Easy to Use: Simple web interface built with Gradio

πŸ› οΈ How It Works

  1. Text Preprocessing: Cleans and normalizes input text, handling abbreviations and numbers
  2. Smart Chunking: Splits long text at natural sentence boundaries (max 500 chars per chunk)
  3. Speech Generation: Processes each chunk using SpeechT5 TTS model
  4. Audio Merging: Combines all audio segments with natural pauses between chunks

πŸš€ Models Used

  • Text-to-Speech: microsoft/speecht5_tts - High-quality neural TTS
  • Vocoder: microsoft/speecht5_hifigan - Neural vocoder for audio generation
  • Speaker Embeddings: CMU Arctic dataset for consistent voice characteristics

πŸ’» Usage

  1. Enter or paste your text in the input box (no length limit!)
  2. Click "Generate Speech"
  3. Wait for processing (longer texts take more time)
  4. Download or play the generated audio

πŸ“ Tips for Best Results

  • Use proper punctuation for natural pauses
  • Well-formatted text produces better speech quality
  • The system automatically handles common abbreviations
  • Numbers are converted to spoken form

πŸ”§ Technical Details

  • Architecture: Transformer-based neural TTS
  • Sample Rate: 16 kHz
  • Audio Format: WAV
  • Processing: CPU-optimized (works on free Hugging Face hardware)
  • Memory Efficient: Processes text in chunks to handle large documents

πŸš€ Local Installation

git clone <your-space-url>
cd <your-space-name>
pip install -r requirements.txt
python app.py

πŸ“„ License

This project uses open-source models and is available for free use. Please check individual model licenses:

  • SpeechT5: Microsoft Research License
  • CMU Arctic: Academic/Research License

🀝 Contributing

Feel free to submit issues and enhancement requests!

πŸ”— Links


Built with ❀️ using Hugging Face Transformers and Gradio