Spaces:
Runtime error
Runtime error
# CLAUDE.md | |
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | |
## Commands | |
### Setup and Installation | |
```bash | |
# Initial setup - creates necessary directories | |
./setup.sh | |
# Install Python dependencies | |
pip install -r requirements.txt | |
# Pre-installation requirements (if needed) | |
pip install -r pre-requirements.txt | |
``` | |
### Running the Application | |
```bash | |
# Run the optimized Gradio interface (recommended) | |
python app_optimized.py | |
# Run the original Gradio interface | |
python app.py | |
# Run the FastAPI server for API access | |
python api_server.py | |
``` | |
### Testing | |
```bash | |
# Run basic API tests | |
python test_api.py | |
# Run API client tests | |
python test_api_client.py | |
# Run performance tests | |
python test_performance.py | |
# Run optimized performance tests | |
python test_performance_optimized.py | |
# Run real-world performance tests | |
python test_performance_real.py | |
``` | |
## Architecture Overview | |
This is a **Talking Head Generation System** that creates lip-synced videos from audio and source images. The project is structured in three phases with Phase 3 focusing on performance optimization. | |
### Core Processing Pipeline | |
1. **Input**: Audio file (WAV) + Source image (PNG/JPG) | |
2. **Audio Processing**: Extract features using HuBERT model | |
3. **Motion Generation**: Generate facial motion from audio features | |
4. **Image Warping**: Apply motion to source image | |
5. **Video Generation**: Create final video with audio sync | |
### Key Components | |
#### Model Management (`model_manager.py`) | |
- Downloads models from Hugging Face on first run (~2.5GB) | |
- Manages PyTorch and TensorRT model variants | |
- Caches models in `/tmp/ditto_models` | |
#### Core Processing (`/core/`) | |
- **atomic_components/**: Basic processing units | |
- `audio2motion.py`: Audio to motion conversion | |
- `warping.py`: Image warping logic | |
- **aux_models/**: Supporting models (face detection, landmarks, HuBERT) | |
- **models/**: Main neural network architectures | |
- **optimization/**: Phase 3 performance optimizations | |
#### Phase 3 Optimizations (`/core/optimization/`) | |
- **resolution_optimization.py**: Fixed 320×320 processing | |
- **gpu_optimization.py**: Mixed precision, torch.compile | |
- **avatar_cache.py**: Pre-cached avatar system with tokens | |
- **cold_start_optimization.py**: Optimized model loading | |
- **inference_cache.py**: Result caching | |
- **parallel_processing.py**: CPU-GPU parallel execution | |
### Performance Targets | |
- Process 16 seconds of audio in ~15 seconds (50-65% faster with Phase 3) | |
- First Frame Delay (FFD): <400ms on A100 | |
- Real-time factor (RTF): <1.0 | |
- Latest target (2025-07-18): 2-second streaming chunks | |
### API Endpoints | |
#### Gradio API | |
- `/process_talking_head`: Main processing endpoint | |
- `/process_talking_head_optimized`: Optimized with caching | |
- `/preload_avatar`: Upload and cache avatars | |
- `/clear_cache`: Clear inference cache | |
#### FastAPI (api_server.py) | |
- `POST /generate`: Generate video from audio/image | |
- `GET /health`: Health check | |
- Additional endpoints for streaming support | |
### Important Notes | |
1. **GPU Requirements**: Requires NVIDIA GPU with CUDA support. Optimized for A100. | |
2. **First Run**: Models are downloaded automatically on first run. Ensure sufficient disk space. | |
3. **Caching**: The system uses multiple cache levels: | |
- Avatar cache: Pre-processed source images | |
- Inference cache: Recent generation results | |
- Model cache: Downloaded models | |
4. **Testing**: Always run performance tests after optimization changes to verify improvements. | |
5. **Streaming**: Latest SOW targets 2-second chunk processing for real-time streaming applications. | |
6. **File Formats**: | |
- Audio: WAV format required | |
- Images: PNG or JPG (will be resized to 320×320) | |
- Output: MP4 video |