talkingAvater_bgk / CLAUDE.md
oKen38461's picture
テストスクリプトの削除に伴い、`tests/`を`.gitignore`に追加しました。また、`README.md`のAPIドキュメントセクションを更新しました。
d9a2a3d

A newer version of the Gradio SDK is available: 5.43.1

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Commands

Setup and Installation

# Initial setup - creates necessary directories
./setup.sh

# Install Python dependencies
pip install -r requirements.txt

# Pre-installation requirements (if needed)
pip install -r pre-requirements.txt

Running the Application

# Run the optimized Gradio interface (recommended)
python app_optimized.py

# Run the original Gradio interface
python app.py

# Run the FastAPI server for API access
python api_server.py

Testing

# Run basic API tests
python test_api.py

# Run API client tests
python test_api_client.py

# Run performance tests
python test_performance.py

# Run optimized performance tests
python test_performance_optimized.py

# Run real-world performance tests
python test_performance_real.py

Architecture Overview

This is a Talking Head Generation System that creates lip-synced videos from audio and source images. The project is structured in three phases with Phase 3 focusing on performance optimization.

Core Processing Pipeline

  1. Input: Audio file (WAV) + Source image (PNG/JPG)
  2. Audio Processing: Extract features using HuBERT model
  3. Motion Generation: Generate facial motion from audio features
  4. Image Warping: Apply motion to source image
  5. Video Generation: Create final video with audio sync

Key Components

Model Management (model_manager.py)

  • Downloads models from Hugging Face on first run (~2.5GB)
  • Manages PyTorch and TensorRT model variants
  • Caches models in /tmp/ditto_models

Core Processing (/core/)

  • atomic_components/: Basic processing units
    • audio2motion.py: Audio to motion conversion
    • warping.py: Image warping logic
  • aux_models/: Supporting models (face detection, landmarks, HuBERT)
  • models/: Main neural network architectures
  • optimization/: Phase 3 performance optimizations

Phase 3 Optimizations (/core/optimization/)

  • resolution_optimization.py: Fixed 320×320 processing
  • gpu_optimization.py: Mixed precision, torch.compile
  • avatar_cache.py: Pre-cached avatar system with tokens
  • cold_start_optimization.py: Optimized model loading
  • inference_cache.py: Result caching
  • parallel_processing.py: CPU-GPU parallel execution

Performance Targets

  • Process 16 seconds of audio in ~15 seconds (50-65% faster with Phase 3)
  • First Frame Delay (FFD): <400ms on A100
  • Real-time factor (RTF): <1.0
  • Latest target (2025-07-18): 2-second streaming chunks

API Endpoints

Gradio API

  • /process_talking_head: Main processing endpoint
  • /process_talking_head_optimized: Optimized with caching
  • /preload_avatar: Upload and cache avatars
  • /clear_cache: Clear inference cache

FastAPI (api_server.py)

  • POST /generate: Generate video from audio/image
  • GET /health: Health check
  • Additional endpoints for streaming support

Important Notes

  1. GPU Requirements: Requires NVIDIA GPU with CUDA support. Optimized for A100.

  2. First Run: Models are downloaded automatically on first run. Ensure sufficient disk space.

  3. Caching: The system uses multiple cache levels:

    • Avatar cache: Pre-processed source images
    • Inference cache: Recent generation results
    • Model cache: Downloaded models
  4. Testing: Always run performance tests after optimization changes to verify improvements.

  5. Streaming: Latest SOW targets 2-second chunk processing for real-time streaming applications.

  6. File Formats:

    • Audio: WAV format required
    • Images: PNG or JPG (will be resized to 320×320)
    • Output: MP4 video