Spaces:

Nick021402
/

Text2speech

Running

App Files Files Community

Text2speech / README.md

Nick021402

Update README.md

47ebcc4 verified 3 months ago

preview code

raw

history blame contribute delete

2.92 kB

	---
	license: mit
	title: ' 🎤 Long-Form Text-to-Speech Generator'
	sdk: gradio
	emoji: 🚀
	colorFrom: indigo
	colorTo: red
	pinned: true
	short_description: 'Unlimited Text Length**: Handle texts of any size.'
	---
	# 🎤 Long-Form Text-to-Speech Generator

	A powerful Hugging Face Space that converts text of any length into natural, human-like speech using completely free AI models.

	## ✨ Features

	- 🚀 Unlimited Text Length: Handle texts of any size, from short sentences to entire articles
	- 🤖 Human-like Voice: Uses Microsoft's SpeechT5 model for natural speech synthesis
	- ⚡ Smart Text Processing: Intelligent chunking preserves sentence flow and natural pauses
	- 🆓 Completely Free: Uses only open-source models, no API keys required
	- 🔧 Auto-preprocessing: Handles abbreviations, numbers, and text normalization
	- 📱 Easy to Use: Simple web interface built with Gradio

	## 🛠️ How It Works

	1. Text Preprocessing: Cleans and normalizes input text, handling abbreviations and numbers
	2. Smart Chunking: Splits long text at natural sentence boundaries (max 500 chars per chunk)
	3. Speech Generation: Processes each chunk using SpeechT5 TTS model
	4. Audio Merging: Combines all audio segments with natural pauses between chunks

	## 🚀 Models Used

	- Text-to-Speech: `microsoft/speecht5_tts` - High-quality neural TTS
	- Vocoder: `microsoft/speecht5_hifigan` - Neural vocoder for audio generation
	- Speaker Embeddings: CMU Arctic dataset for consistent voice characteristics

	## 💻 Usage

	1. Enter or paste your text in the input box (no length limit!)
	2. Click "Generate Speech"
	3. Wait for processing (longer texts take more time)
	4. Download or play the generated audio

	## 📝 Tips for Best Results

	- Use proper punctuation for natural pauses
	- Well-formatted text produces better speech quality
	- The system automatically handles common abbreviations
	- Numbers are converted to spoken form

	## 🔧 Technical Details

	- Architecture: Transformer-based neural TTS
	- Sample Rate: 16 kHz
	- Audio Format: WAV
	- Processing: CPU-optimized (works on free Hugging Face hardware)
	- Memory Efficient: Processes text in chunks to handle large documents

	## 🚀 Local Installation

	```bash
	git clone <your-space-url>
	cd <your-space-name>
	pip install -r requirements.txt
	python app.py
	```

	## 📄 License

	This project uses open-source models and is available for free use. Please check individual model licenses:
	- SpeechT5: Microsoft Research License
	- CMU Arctic: Academic/Research License

	## 🤝 Contributing

	Feel free to submit issues and enhancement requests!

	## 🔗 Links

	- [SpeechT5 Paper](https://arxiv.org/abs/2110.07205)
	- [Hugging Face Transformers](https://huggingface.co/transformers/)
	- [Gradio Documentation](https://gradio.app/docs/)

	---

	Built with ❤️ using Hugging Face Transformers and Gradio