talkingAvater_bgk

Runtime error

App Files Files Community

oKen38461 commited on about 1 month ago

Commit

d9a2a3d

1 Parent(s): 0f839d2

テストスクリプトの削除に伴い、`tests/`を`.gitignore`に追加しました。また、`README.md`のAPIドキュメントセクションを更新しました。

Browse files

Files changed (4) hide show

.huggingface.yaml +9 -0
CLAUDE.md +120 -0
app_streaming.py +195 -0
test_streaming.py +140 -0

.huggingface.yaml ADDED Viewed

	@@ -0,0 +1,9 @@

+# .huggingface.yaml
+sdk: gradio
+python_version: "3.10"
+hardware: "A100"
+timeout_seconds: 600            # 初回ロード時間を確保
+accelerator: gpu
+python:
+  pip_install:
+    - -r requirements.txt

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,120 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Commands
+### Setup and Installation
+```bash
+# Initial setup - creates necessary directories
+./setup.sh
+# Install Python dependencies
+pip install -r requirements.txt
+# Pre-installation requirements (if needed)
+pip install -r pre-requirements.txt
+```
+### Running the Application
+```bash
+# Run the optimized Gradio interface (recommended)
+python app_optimized.py
+# Run the original Gradio interface
+python app.py
+# Run the FastAPI server for API access
+python api_server.py
+```
+### Testing
+```bash
+# Run basic API tests
+python test_api.py
+# Run API client tests
+python test_api_client.py
+# Run performance tests
+python test_performance.py
+# Run optimized performance tests
+python test_performance_optimized.py
+# Run real-world performance tests
+python test_performance_real.py
+```
+## Architecture Overview
+This is a **Talking Head Generation System** that creates lip-synced videos from audio and source images. The project is structured in three phases with Phase 3 focusing on performance optimization.
+### Core Processing Pipeline
+1. **Input**: Audio file (WAV) + Source image (PNG/JPG)
+2. **Audio Processing**: Extract features using HuBERT model
+3. **Motion Generation**: Generate facial motion from audio features
+4. **Image Warping**: Apply motion to source image
+5. **Video Generation**: Create final video with audio sync
+### Key Components
+#### Model Management (`model_manager.py`)
+- Downloads models from Hugging Face on first run (~2.5GB)
+- Manages PyTorch and TensorRT model variants
+- Caches models in `/tmp/ditto_models`
+#### Core Processing (`/core/`)
+- **atomic_components/**: Basic processing units
+  - `audio2motion.py`: Audio to motion conversion
+  - `warping.py`: Image warping logic
+- **aux_models/**: Supporting models (face detection, landmarks, HuBERT)
+- **models/**: Main neural network architectures
+- **optimization/**: Phase 3 performance optimizations
+#### Phase 3 Optimizations (`/core/optimization/`)
+- **resolution_optimization.py**: Fixed 320×320 processing
+- **gpu_optimization.py**: Mixed precision, torch.compile
+- **avatar_cache.py**: Pre-cached avatar system with tokens
+- **cold_start_optimization.py**: Optimized model loading
+- **inference_cache.py**: Result caching
+- **parallel_processing.py**: CPU-GPU parallel execution
+### Performance Targets
+- Process 16 seconds of audio in ~15 seconds (50-65% faster with Phase 3)
+- First Frame Delay (FFD): <400ms on A100
+- Real-time factor (RTF): <1.0
+- Latest target (2025-07-18): 2-second streaming chunks
+### API Endpoints
+#### Gradio API
+- `/process_talking_head`: Main processing endpoint
+- `/process_talking_head_optimized`: Optimized with caching
+- `/preload_avatar`: Upload and cache avatars
+- `/clear_cache`: Clear inference cache
+#### FastAPI (api_server.py)
+- `POST /generate`: Generate video from audio/image
+- `GET /health`: Health check
+- Additional endpoints for streaming support
+### Important Notes
+1. **GPU Requirements**: Requires NVIDIA GPU with CUDA support. Optimized for A100.
+2. **First Run**: Models are downloaded automatically on first run. Ensure sufficient disk space.
+3. **Caching**: The system uses multiple cache levels:
+   - Avatar cache: Pre-processed source images
+   - Inference cache: Recent generation results
+   - Model cache: Downloaded models
+4. **Testing**: Always run performance tests after optimization changes to verify improvements.
+5. **Streaming**: Latest SOW targets 2-second chunk processing for real-time streaming applications.
+6. **File Formats**:
+   - Audio: WAV format required
+   - Images: PNG or JPG (will be resized to 320×320)
+   - Output: MP4 video

app_streaming.py ADDED Viewed

	@@ -0,0 +1,195 @@

+import os, tempfile, queue, threading, time, numpy as np, soundfile as sf
+import gradio as gr
+from stream_pipeline_offline import StreamSDK
+import torch
+from PIL import Image
+from pathlib import Path
+import cv2
+# モデル設定
+CFG_PKL = "checkpoints/ditto_cfg/v0.4_hubert_cfg_pytorch.pkl"
+DATA_ROOT = "checkpoints/ditto_pytorch"
+# サンプルファイルのディレクトリ
+EXAMPLES_DIR = (Path(__file__).parent / "example").resolve()
+# グローバルで一度だけロード（concurrency_count=1 前提）
+sdk: StreamSDK | None = None
+def init_sdk():
+    global sdk
+    if sdk is None:
+        sdk = StreamSDK(CFG_PKL, DATA_ROOT)
+    return sdk
+# 音声チャンクサイズ（秒）
+CHUNK_SEC = 0.20  # 16000*0.20 = 3200 sample ≒ 5 フレーム
+def generator(mic, src_img):
+    """
+    Gradio 生成関数
+    mic     : (sr, np.ndarray) 形式 (Gradio Audio streaming=True)
+    src_img : 画像ファイルパス
+    Yields  : PIL.Image (現在フレーム) または (最後に mp4)
+    """
+    if mic is None:
+        yield None, None, "マイク入力を開始してください"
+        return
+    if src_img is None:
+        yield None, None, "ソース画像をアップロードしてください"
+        return
+    try:
+        sr, wav_full = mic
+        sdk = init_sdk()
+        # setup: online_mode=True でストリーミング
+        tmp_out = tempfile.mktemp(suffix=".mp4")
+        sdk.setup(src_img, tmp_out, online_mode=True, max_size=1024)
+        N_total = int(np.ceil(len(wav_full) / sr * 25))  # 概算フレーム数
+        sdk.setup_Nd(N_total)
+        # 処理開始時刻
+        start_time = time.time()
+        frame_count = 0
+        # 音声を CHUNK_SEC ごとに送り込む
+        hop = int(sr * CHUNK_SEC)
+        for start_idx in range(0, len(wav_full), hop):
+            chunk = wav_full[start_idx : start_idx + hop]
+            if len(chunk) < hop:
+                chunk = np.pad(chunk, (0, hop - len(chunk)))
+            sdk.run_chunk(chunk)
+            # 直近で書き込まれたフレームをキューから取得
+            frames_processed = 0
+            while sdk.writer_queue.qsize() > 0 and frames_processed < 5:
+                try:
+                    frame = sdk.writer_queue.get_nowait()
+                    if frame is not None:
+                        # numpy array (H, W, 3) を PIL Image に変換
+                        pil_frame = Image.fromarray(frame)
+                        frame_count += 1
+                        elapsed = time.time() - start_time
+                        fps = frame_count / elapsed if elapsed > 0 else 0
+                        yield pil_frame, None, f"処理中... フレーム: {frame_count}, FPS: {fps:.1f}"
+                    frames_processed += 1
+                except queue.Empty:
+                    break
+            # 少し待機（CPU負荷調整）
+            time.sleep(0.01)
+        # 残りのフレームを処理
+        print("音声チャンクの送信完了、残りフレームを処理中...")
+        timeout_count = 0
+        while timeout_count < 50:  # 最大5秒待機
+            if sdk.writer_queue.qsize() > 0:
+                try:
+                    frame = sdk.writer_queue.get_nowait()
+                    if frame is not None:
+                        pil_frame = Image.fromarray(frame)
+                        frame_count += 1
+                        elapsed = time.time() - start_time
+                        fps = frame_count / elapsed if elapsed > 0 else 0
+                        yield pil_frame, None, f"処理中... フレーム: {frame_count}, FPS: {fps:.1f}"
+                    timeout_count = 0
+                except queue.Empty:
+                    time.sleep(0.1)
+                    timeout_count += 1
+            else:
+                time.sleep(0.1)
+                timeout_count += 1
+        # SDKを閉じて最終的なMP4を生成
+        print("SDKを閉じて最終的なMP4を生成中...")
+        sdk.close()  # ワーカー join & mp4 結合
+        # 処理完了
+        elapsed_total = time.time() - start_time
+        yield None, gr.Video(tmp_out), f"✅ 完了！ 総フレーム数: {frame_count}, 処理時間: {elapsed_total:.1f}秒"
+    except Exception as e:
+        import traceback
+        error_msg = f"❌ エラー: {str(e)}\n{traceback.format_exc()}"
+        print(error_msg)
+        yield None, None, error_msg
+# Gradio UI
+with gr.Blocks(title="DittoTalkingHead Streaming") as demo:
+    gr.Markdown("""
+    # DittoTalkingHead - ストリーミング版
+    音声をリアルタイムで処理し、生成されたフレームを逐次表示します。
+    ## 使い方
+    1. **ソース画像**（PNG/JPG形式）をアップロード
+    2. **Start**ボタンをクリックしてマイク録音開始
+    3. 録音中、ライブフレームが更新されます
+    4. 録音停止後、最終的なMP4が生成されます
+    """)
+    with gr.Row():
+        with gr.Column():
+            img_in = gr.Image(
+                type="filepath",
+                label="ソース画像 / Source Image",
+                value=str(EXAMPLES_DIR / "reference.png") if (EXAMPLES_DIR / "reference.png").exists() else None
+            )
+            mic_in = gr.Audio(
+                sources=["microphone"],
+                streaming=True,
+                label="マイク入力 (16 kHz)",
+                format="wav"
+            )
+        with gr.Column():
+            live_img = gr.Image(label="ライブフレーム", type="pil")
+            final_mp4 = gr.Video(label="最終結果 (MP4)")
+            status_text = gr.Textbox(label="ステータス", value="待機中...")
+    btn = gr.Button("Start Streaming", variant="primary")
+    # ストリーミング処理を開始
+    btn.click(
+        fn=generator,
+        inputs=[mic_in, img_in],
+        outputs=[live_img, final_mp4, status_text],
+        stream_every=0.1  # 100msごとに更新
+    )
+    # サンプル
+    if EXAMPLES_DIR.exists():
+        gr.Examples(
+            examples=[
+                [str(EXAMPLES_DIR / "reference.png")]
+            ],
+            inputs=[img_in],
+            label="サンプル画像"
+        )
+# 起動設定
+if __name__ == "__main__":
+    # GPU最適化設定
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.backends.cudnn.benchmark = True
+    # 環境変数設定
+    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
+    print("=== DittoTalkingHead ストリーミング版 起動 ===")
+    print(f"- チャンクサイズ: {CHUNK_SEC}秒")
+    print(f"- 最大解像度: 1024px")
+    print(f"- GPU: {'利用可能' if torch.cuda.is_available() else '利用不可'}")
+    # モデルの事前ロード
+    print("モデルを事前ロード中...")
+    init_sdk()
+    print("✅ モデルロード完了")
+    demo.queue(concurrency_count=1, max_size=8).launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False
+    )

test_streaming.py ADDED Viewed

	@@ -0,0 +1,140 @@

+"""
+ストリーミング実装のテストスクリプト
+"""
+import numpy as np
+import soundfile as sf
+import tempfile
+import time
+from pathlib import Path
+from stream_pipeline_offline import StreamSDK
+# テスト設定
+CFG_PKL = "checkpoints/ditto_cfg/v0.4_hubert_cfg_pytorch.pkl"
+DATA_ROOT = "checkpoints/ditto_pytorch"
+EXAMPLES_DIR = Path("example")
+def test_streaming():
+    """ストリーミング機能の基本テスト"""
+    print("=== ストリーミング機能テスト開始 ===")
+    # テスト用の音声を生成（3秒のサイン波）
+    duration = 3.0  # seconds
+    sample_rate = 16000
+    t = np.linspace(0, duration, int(sample_rate * duration))
+    audio_data = np.sin(2 * np.pi * 440 * t) * 0.5  # 440Hz
+    # SDKの初期化
+    print("1. SDK初期化...")
+    sdk = StreamSDK(CFG_PKL, DATA_ROOT)
+    print("✅ SDK初期化完了")
+    # セットアップ
+    print("\n2. ストリーミングモードでセットアップ...")
+    src_img = str(EXAMPLES_DIR / "reference.png")
+    tmp_out = tempfile.mktemp(suffix=".mp4")
+    sdk.setup(src_img, tmp_out, online_mode=True, max_size=1024)
+    N_total = int(np.ceil(duration * 25))  # 25fps
+    sdk.setup_Nd(N_total)
+    print("✅ セットアップ完了")
+    # チャンク単位で音声を送信
+    print("\n3. チャンク単位で音声送信...")
+    chunk_sec = 0.2  # 200ms
+    chunk_samples = int(sample_rate * chunk_sec)
+    chunks_sent = 0
+    frames_received = 0
+    start_time = time.time()
+    for i in range(0, len(audio_data), chunk_samples):
+        chunk = audio_data[i:i + chunk_samples]
+        if len(chunk) < chunk_samples:
+            chunk = np.pad(chunk, (0, chunk_samples - len(chunk)))
+        sdk.run_chunk(chunk)
+        chunks_sent += 1
+        # キューからフレームを確認
+        while sdk.writer_queue.qsize() > 0:
+            try:
+                frame = sdk.writer_queue.get_nowait()
+                if frame is not None:
+                    frames_received += 1
+                    print(f"  フレーム {frames_received} 受信 (チャンク {chunks_sent})")
+            except:
+                break
+        time.sleep(0.05)  # 少し待機
+    # 残りのフレームを待つ
+    print("\n4. 残りのフレームを処理...")
+    timeout = 5.0  # 5秒タイムアウト
+    timeout_start = time.time()
+    while time.time() - timeout_start < timeout:
+        if sdk.writer_queue.qsize() > 0:
+            try:
+                frame = sdk.writer_queue.get_nowait()
+                if frame is not None:
+                    frames_received += 1
+                    print(f"  フレーム {frames_received} 受信")
+            except:
+                pass
+        else:
+            time.sleep(0.1)
+    # クローズ
+    print("\n5. SDKクローズ...")
+    sdk.close()
+    elapsed = time.time() - start_time
+    # 結果
+    print("\n=== テスト結果 ===")
+    print(f"✅ 送信チャンク数: {chunks_sent}")
+    print(f"✅ 受信フレーム数: {frames_received}")
+    print(f"✅ 処理時間: {elapsed:.2f}秒")
+    print(f"✅ 出力ファイル: {tmp_out}")
+    # 期待される結果の確認
+    expected_frames = int(duration * 25)  # 25fps
+    if frames_received >= expected_frames * 0.8:  # 80%以上
+        print("✅ テスト成功！")
+    else:
+        print(f"⚠️ 期待フレーム数 ({expected_frames}) に対して受信数が少ない")
+    return True
+def test_writer_queue():
+    """writer_queueの動作確認"""
+    print("\n=== writer_queue 動作確認 ===")
+    sdk = StreamSDK(CFG_PKL, DATA_ROOT)
+    # キューの存在確認
+    if hasattr(sdk, 'writer_queue'):
+        print("✅ writer_queue が存在します")
+        print(f"  キューサイズ: {sdk.writer_queue.qsize()}")
+        print(f"  最大サイズ: {sdk.writer_queue.maxsize}")
+    else:
+        print("❌ writer_queue が見つかりません")
+        return False
+    return True
+if __name__ == "__main__":
+    # writer_queueの確認
+    if not test_writer_queue():
+        print("基本的な要件が満たされていません")
+        exit(1)
+    # ストリーミングテスト
+    try:
+        test_streaming()
+    except Exception as e:
+        print(f"❌ エラー: {e}")
+        import traceback
+        traceback.print_exc()