talkingAvater_bgk

Runtime error

App Files Files Community

talkingAvater_bgk / CLAUDE.md

oKen38461

テストスクリプトの削除に伴い、`tests/`を`.gitignore`に追加しました。また、`README.md`のAPIドキュメントセクションを更新しました。

d9a2a3d about 1 month ago

preview code

raw

history blame contribute delete

3.74 kB

	# CLAUDE.md

	This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

	## Commands

	### Setup and Installation
	```bash
	# Initial setup - creates necessary directories
	./setup.sh

	# Install Python dependencies
	pip install -r requirements.txt

	# Pre-installation requirements (if needed)
	pip install -r pre-requirements.txt
	```

	### Running the Application
	```bash
	# Run the optimized Gradio interface (recommended)
	python app_optimized.py

	# Run the original Gradio interface
	python app.py

	# Run the FastAPI server for API access
	python api_server.py
	```

	### Testing
	```bash
	# Run basic API tests
	python test_api.py

	# Run API client tests
	python test_api_client.py

	# Run performance tests
	python test_performance.py

	# Run optimized performance tests
	python test_performance_optimized.py

	# Run real-world performance tests
	python test_performance_real.py
	```

	## Architecture Overview

	This is a Talking Head Generation System that creates lip-synced videos from audio and source images. The project is structured in three phases with Phase 3 focusing on performance optimization.

	### Core Processing Pipeline
	1. Input: Audio file (WAV) + Source image (PNG/JPG)
	2. Audio Processing: Extract features using HuBERT model
	3. Motion Generation: Generate facial motion from audio features
	4. Image Warping: Apply motion to source image
	5. Video Generation: Create final video with audio sync

	### Key Components

	#### Model Management (`model_manager.py`)
	- Downloads models from Hugging Face on first run (~2.5GB)
	- Manages PyTorch and TensorRT model variants
	- Caches models in `/tmp/ditto_models`

	#### Core Processing (`/core/`)
	- atomic_components/: Basic processing units
	- `audio2motion.py`: Audio to motion conversion
	- `warping.py`: Image warping logic
	- aux_models/: Supporting models (face detection, landmarks, HuBERT)
	- models/: Main neural network architectures
	- optimization/: Phase 3 performance optimizations

	#### Phase 3 Optimizations (`/core/optimization/`)
	- resolution_optimization.py: Fixed 320×320 processing
	- gpu_optimization.py: Mixed precision, torch.compile
	- avatar_cache.py: Pre-cached avatar system with tokens
	- cold_start_optimization.py: Optimized model loading
	- inference_cache.py: Result caching
	- parallel_processing.py: CPU-GPU parallel execution

	### Performance Targets
	- Process 16 seconds of audio in ~15 seconds (50-65% faster with Phase 3)
	- First Frame Delay (FFD): <400ms on A100
	- Real-time factor (RTF): <1.0
	- Latest target (2025-07-18): 2-second streaming chunks

	### API Endpoints

	#### Gradio API
	- `/process_talking_head`: Main processing endpoint
	- `/process_talking_head_optimized`: Optimized with caching
	- `/preload_avatar`: Upload and cache avatars
	- `/clear_cache`: Clear inference cache

	#### FastAPI (api_server.py)
	- `POST /generate`: Generate video from audio/image
	- `GET /health`: Health check
	- Additional endpoints for streaming support

	### Important Notes

	1. GPU Requirements: Requires NVIDIA GPU with CUDA support. Optimized for A100.

	2. First Run: Models are downloaded automatically on first run. Ensure sufficient disk space.

	3. Caching: The system uses multiple cache levels:
	- Avatar cache: Pre-processed source images
	- Inference cache: Recent generation results
	- Model cache: Downloaded models

	4. Testing: Always run performance tests after optimization changes to verify improvements.

	5. Streaming: Latest SOW targets 2-second chunk processing for real-time streaming applications.

	6. File Formats:
	- Audio: WAV format required
	- Images: PNG or JPG (will be resized to 320×320)
	- Output: MP4 video