CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Commands

Setup and Installation

# Initial setup - creates necessary directories
./setup.sh

# Install Python dependencies
pip install -r requirements.txt

# Pre-installation requirements (if needed)
pip install -r pre-requirements.txt

Running the Application

# Run the optimized Gradio interface (recommended)
python app_optimized.py

# Run the original Gradio interface
python app.py

# Run the FastAPI server for API access
python api_server.py

Testing

# Run basic API tests
python test_api.py

# Run API client tests
python test_api_client.py

# Run performance tests
python test_performance.py

# Run optimized performance tests
python test_performance_optimized.py

# Run real-world performance tests
python test_performance_real.py

Architecture Overview

This is a Talking Head Generation System that creates lip-synced videos from audio and source images. The project is structured in three phases with Phase 3 focusing on performance optimization.

Core Processing Pipeline

Input: Audio file (WAV) + Source image (PNG/JPG)
Audio Processing: Extract features using HuBERT model
Motion Generation: Generate facial motion from audio features
Image Warping: Apply motion to source image
Video Generation: Create final video with audio sync

Key Components

Model Management (`model_manager.py`)

Downloads models from Hugging Face on first run (~2.5GB)
Manages PyTorch and TensorRT model variants
Caches models in /tmp/ditto_models

Core Processing (`/core/`)

atomic_components/: Basic processing units
- audio2motion.py: Audio to motion conversion
- warping.py: Image warping logic
aux_models/: Supporting models (face detection, landmarks, HuBERT)
models/: Main neural network architectures
optimization/: Phase 3 performance optimizations

Phase 3 Optimizations (`/core/optimization/`)

resolution_optimization.py: Fixed 320×320 processing
gpu_optimization.py: Mixed precision, torch.compile
avatar_cache.py: Pre-cached avatar system with tokens
cold_start_optimization.py: Optimized model loading
inference_cache.py: Result caching
parallel_processing.py: CPU-GPU parallel execution

Performance Targets

Process 16 seconds of audio in ~15 seconds (50-65% faster with Phase 3)
First Frame Delay (FFD): <400ms on A100
Real-time factor (RTF): <1.0
Latest target (2025-07-18): 2-second streaming chunks

API Endpoints

Gradio API

/process_talking_head: Main processing endpoint
/process_talking_head_optimized: Optimized with caching
/preload_avatar: Upload and cache avatars
/clear_cache: Clear inference cache

FastAPI (api_server.py)

POST /generate: Generate video from audio/image
GET /health: Health check
Additional endpoints for streaming support

Important Notes

GPU Requirements: Requires NVIDIA GPU with CUDA support. Optimized for A100.
First Run: Models are downloaded automatically on first run. Ensure sufficient disk space.
Caching: The system uses multiple cache levels:
- Avatar cache: Pre-processed source images
- Inference cache: Recent generation results
- Model cache: Downloaded models
Testing: Always run performance tests after optimization changes to verify improvements.
Streaming: Latest SOW targets 2-second chunk processing for real-time streaming applications.
File Formats:
- Audio: WAV format required
- Images: PNG or JPG (will be resized to 320×320)
- Output: MP4 video