Spaces:
Running
Running
license: mit | |
title: ' π€ Long-Form Text-to-Speech Generator' | |
sdk: gradio | |
emoji: π | |
colorFrom: indigo | |
colorTo: red | |
pinned: true | |
short_description: 'Unlimited Text Length**: Handle texts of any size.' | |
# π€ Long-Form Text-to-Speech Generator | |
A powerful Hugging Face Space that converts text of any length into natural, human-like speech using completely free AI models. | |
## β¨ Features | |
- **π Unlimited Text Length**: Handle texts of any size, from short sentences to entire articles | |
- **π€ Human-like Voice**: Uses Microsoft's SpeechT5 model for natural speech synthesis | |
- **β‘ Smart Text Processing**: Intelligent chunking preserves sentence flow and natural pauses | |
- **π Completely Free**: Uses only open-source models, no API keys required | |
- **π§ Auto-preprocessing**: Handles abbreviations, numbers, and text normalization | |
- **π± Easy to Use**: Simple web interface built with Gradio | |
## π οΈ How It Works | |
1. **Text Preprocessing**: Cleans and normalizes input text, handling abbreviations and numbers | |
2. **Smart Chunking**: Splits long text at natural sentence boundaries (max 500 chars per chunk) | |
3. **Speech Generation**: Processes each chunk using SpeechT5 TTS model | |
4. **Audio Merging**: Combines all audio segments with natural pauses between chunks | |
## π Models Used | |
- **Text-to-Speech**: `microsoft/speecht5_tts` - High-quality neural TTS | |
- **Vocoder**: `microsoft/speecht5_hifigan` - Neural vocoder for audio generation | |
- **Speaker Embeddings**: CMU Arctic dataset for consistent voice characteristics | |
## π» Usage | |
1. Enter or paste your text in the input box (no length limit!) | |
2. Click "Generate Speech" | |
3. Wait for processing (longer texts take more time) | |
4. Download or play the generated audio | |
## π Tips for Best Results | |
- Use proper punctuation for natural pauses | |
- Well-formatted text produces better speech quality | |
- The system automatically handles common abbreviations | |
- Numbers are converted to spoken form | |
## π§ Technical Details | |
- **Architecture**: Transformer-based neural TTS | |
- **Sample Rate**: 16 kHz | |
- **Audio Format**: WAV | |
- **Processing**: CPU-optimized (works on free Hugging Face hardware) | |
- **Memory Efficient**: Processes text in chunks to handle large documents | |
## π Local Installation | |
```bash | |
git clone <your-space-url> | |
cd <your-space-name> | |
pip install -r requirements.txt | |
python app.py | |
``` | |
## π License | |
This project uses open-source models and is available for free use. Please check individual model licenses: | |
- SpeechT5: Microsoft Research License | |
- CMU Arctic: Academic/Research License | |
## π€ Contributing | |
Feel free to submit issues and enhancement requests! | |
## π Links | |
- [SpeechT5 Paper](https://arxiv.org/abs/2110.07205) | |
- [Hugging Face Transformers](https://huggingface.co/transformers/) | |
- [Gradio Documentation](https://gradio.app/docs/) | |
--- | |
**Built with β€οΈ using Hugging Face Transformers and Gradio** |