Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.43.1
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Commands
Setup and Installation
# Initial setup - creates necessary directories
./setup.sh
# Install Python dependencies
pip install -r requirements.txt
# Pre-installation requirements (if needed)
pip install -r pre-requirements.txt
Running the Application
# Run the optimized Gradio interface (recommended)
python app_optimized.py
# Run the original Gradio interface
python app.py
# Run the FastAPI server for API access
python api_server.py
Testing
# Run basic API tests
python test_api.py
# Run API client tests
python test_api_client.py
# Run performance tests
python test_performance.py
# Run optimized performance tests
python test_performance_optimized.py
# Run real-world performance tests
python test_performance_real.py
Architecture Overview
This is a Talking Head Generation System that creates lip-synced videos from audio and source images. The project is structured in three phases with Phase 3 focusing on performance optimization.
Core Processing Pipeline
- Input: Audio file (WAV) + Source image (PNG/JPG)
- Audio Processing: Extract features using HuBERT model
- Motion Generation: Generate facial motion from audio features
- Image Warping: Apply motion to source image
- Video Generation: Create final video with audio sync
Key Components
Model Management (model_manager.py
)
- Downloads models from Hugging Face on first run (~2.5GB)
- Manages PyTorch and TensorRT model variants
- Caches models in
/tmp/ditto_models
Core Processing (/core/
)
- atomic_components/: Basic processing units
audio2motion.py
: Audio to motion conversionwarping.py
: Image warping logic
- aux_models/: Supporting models (face detection, landmarks, HuBERT)
- models/: Main neural network architectures
- optimization/: Phase 3 performance optimizations
Phase 3 Optimizations (/core/optimization/
)
- resolution_optimization.py: Fixed 320×320 processing
- gpu_optimization.py: Mixed precision, torch.compile
- avatar_cache.py: Pre-cached avatar system with tokens
- cold_start_optimization.py: Optimized model loading
- inference_cache.py: Result caching
- parallel_processing.py: CPU-GPU parallel execution
Performance Targets
- Process 16 seconds of audio in ~15 seconds (50-65% faster with Phase 3)
- First Frame Delay (FFD): <400ms on A100
- Real-time factor (RTF): <1.0
- Latest target (2025-07-18): 2-second streaming chunks
API Endpoints
Gradio API
/process_talking_head
: Main processing endpoint/process_talking_head_optimized
: Optimized with caching/preload_avatar
: Upload and cache avatars/clear_cache
: Clear inference cache
FastAPI (api_server.py)
POST /generate
: Generate video from audio/imageGET /health
: Health check- Additional endpoints for streaming support
Important Notes
GPU Requirements: Requires NVIDIA GPU with CUDA support. Optimized for A100.
First Run: Models are downloaded automatically on first run. Ensure sufficient disk space.
Caching: The system uses multiple cache levels:
- Avatar cache: Pre-processed source images
- Inference cache: Recent generation results
- Model cache: Downloaded models
Testing: Always run performance tests after optimization changes to verify improvements.
Streaming: Latest SOW targets 2-second chunk processing for real-time streaming applications.
File Formats:
- Audio: WAV format required
- Images: PNG or JPG (will be resized to 320×320)
- Output: MP4 video