talkingAvater_bgk / CLAUDE.md
oKen38461's picture
テストスクリプトの削除に伴い、`tests/`を`.gitignore`に追加しました。また、`README.md`のAPIドキュメントセクションを更新しました。
d9a2a3d
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Commands
### Setup and Installation
```bash
# Initial setup - creates necessary directories
./setup.sh
# Install Python dependencies
pip install -r requirements.txt
# Pre-installation requirements (if needed)
pip install -r pre-requirements.txt
```
### Running the Application
```bash
# Run the optimized Gradio interface (recommended)
python app_optimized.py
# Run the original Gradio interface
python app.py
# Run the FastAPI server for API access
python api_server.py
```
### Testing
```bash
# Run basic API tests
python test_api.py
# Run API client tests
python test_api_client.py
# Run performance tests
python test_performance.py
# Run optimized performance tests
python test_performance_optimized.py
# Run real-world performance tests
python test_performance_real.py
```
## Architecture Overview
This is a **Talking Head Generation System** that creates lip-synced videos from audio and source images. The project is structured in three phases with Phase 3 focusing on performance optimization.
### Core Processing Pipeline
1. **Input**: Audio file (WAV) + Source image (PNG/JPG)
2. **Audio Processing**: Extract features using HuBERT model
3. **Motion Generation**: Generate facial motion from audio features
4. **Image Warping**: Apply motion to source image
5. **Video Generation**: Create final video with audio sync
### Key Components
#### Model Management (`model_manager.py`)
- Downloads models from Hugging Face on first run (~2.5GB)
- Manages PyTorch and TensorRT model variants
- Caches models in `/tmp/ditto_models`
#### Core Processing (`/core/`)
- **atomic_components/**: Basic processing units
- `audio2motion.py`: Audio to motion conversion
- `warping.py`: Image warping logic
- **aux_models/**: Supporting models (face detection, landmarks, HuBERT)
- **models/**: Main neural network architectures
- **optimization/**: Phase 3 performance optimizations
#### Phase 3 Optimizations (`/core/optimization/`)
- **resolution_optimization.py**: Fixed 320×320 processing
- **gpu_optimization.py**: Mixed precision, torch.compile
- **avatar_cache.py**: Pre-cached avatar system with tokens
- **cold_start_optimization.py**: Optimized model loading
- **inference_cache.py**: Result caching
- **parallel_processing.py**: CPU-GPU parallel execution
### Performance Targets
- Process 16 seconds of audio in ~15 seconds (50-65% faster with Phase 3)
- First Frame Delay (FFD): <400ms on A100
- Real-time factor (RTF): <1.0
- Latest target (2025-07-18): 2-second streaming chunks
### API Endpoints
#### Gradio API
- `/process_talking_head`: Main processing endpoint
- `/process_talking_head_optimized`: Optimized with caching
- `/preload_avatar`: Upload and cache avatars
- `/clear_cache`: Clear inference cache
#### FastAPI (api_server.py)
- `POST /generate`: Generate video from audio/image
- `GET /health`: Health check
- Additional endpoints for streaming support
### Important Notes
1. **GPU Requirements**: Requires NVIDIA GPU with CUDA support. Optimized for A100.
2. **First Run**: Models are downloaded automatically on first run. Ensure sufficient disk space.
3. **Caching**: The system uses multiple cache levels:
- Avatar cache: Pre-processed source images
- Inference cache: Recent generation results
- Model cache: Downloaded models
4. **Testing**: Always run performance tests after optimization changes to verify improvements.
5. **Streaming**: Latest SOW targets 2-second chunk processing for real-time streaming applications.
6. **File Formats**:
- Audio: WAV format required
- Images: PNG or JPG (will be resized to 320×320)
- Output: MP4 video