Spaces:
Sleeping
Sleeping
Commit
·
0cff18c
0
Parent(s):
Initial commit: Add project files and README
Browse files- README.md +78 -0
- __pycache__/modal_whisper_app.cpython-310.pyc +0 -0
- app.py +237 -0
- instructions.txt +421 -0
- modal_whisper_app.py +162 -0
- requirements.txt +5 -0
README.md
ADDED
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Video MCP Transcription Service
|
2 |
+
|
3 |
+
This project provides a video transcription service using Python, Modal, and Hugging Face's Whisper model. It allows users to submit a video URL (e.g., YouTube) or upload a video file, and it returns the transcribed audio text.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
|
7 |
+
- Downloads videos from URLs using `yt-dlp`.
|
8 |
+
- Extracts audio and transcribes it using a Modal-deployed Whisper model.
|
9 |
+
- Provides a simple web interface using Gradio for local testing.
|
10 |
+
- The Modal function (`modal_whisper_app.py`) handles the heavy lifting of transcription in a serverless environment.
|
11 |
+
|
12 |
+
## Project Structure
|
13 |
+
|
14 |
+
- `app.py`: The main Gradio application for local interaction and calling the Modal function.
|
15 |
+
- `modal_whisper_app.py`: Defines the Modal app and the `transcribe_video_audio` function that runs the Whisper model.
|
16 |
+
- `requirements.txt`: Lists Python dependencies for the local `app.py`.
|
17 |
+
|
18 |
+
## Setup
|
19 |
+
|
20 |
+
### Prerequisites
|
21 |
+
|
22 |
+
- Python 3.10+
|
23 |
+
- Modal account and CLI installed and configured (`pip install modal-client`, then `modal setup`).
|
24 |
+
- `ffmpeg` installed locally (for `yt-dlp` and `moviepy` to process video/audio).
|
25 |
+
- On Debian/Ubuntu: `sudo apt update && sudo apt install ffmpeg`
|
26 |
+
- On macOS (using Homebrew): `brew install ffmpeg`
|
27 |
+
|
28 |
+
### Local Setup
|
29 |
+
|
30 |
+
1. **Clone the repository:**
|
31 |
+
```bash
|
32 |
+
git clone https://github.com/jomasego/video_mcp.git
|
33 |
+
cd video_mcp
|
34 |
+
```
|
35 |
+
|
36 |
+
2. **Install local dependencies:**
|
37 |
+
```bash
|
38 |
+
pip install -r requirements.txt
|
39 |
+
```
|
40 |
+
|
41 |
+
3. **Deploy the Modal function (if not already deployed or if changes were made):**
|
42 |
+
Ensure your Modal CLI is authenticated.
|
43 |
+
```bash
|
44 |
+
modal deploy modal_whisper_app.py
|
45 |
+
```
|
46 |
+
This deploys the `transcribe_video_audio` function to Modal. You should see a success message with a deployment URL.
|
47 |
+
|
48 |
+
### Running the Local Application
|
49 |
+
|
50 |
+
1. **Start the Gradio app:**
|
51 |
+
```bash
|
52 |
+
python3 app.py
|
53 |
+
```
|
54 |
+
2. Open your web browser and go to the URL provided by Gradio (usually `http://127.0.0.1:7860`).
|
55 |
+
3. Enter a video URL or upload a video file to get the transcription.
|
56 |
+
|
57 |
+
## Modal Function Details
|
58 |
+
|
59 |
+
The `modal_whisper_app.py` script defines a Modal function that:
|
60 |
+
- Uses a custom Docker image with `ffmpeg`, `transformers`, `torch`, `moviepy`, `soundfile`, and `huggingface_hub`.
|
61 |
+
- Takes video bytes as input.
|
62 |
+
- Uses `moviepy` to extract audio from the video.
|
63 |
+
- Uses the Hugging Face `transformers` pipeline with a specified Whisper model (e.g., `openai/whisper-large-v3`) to transcribe the audio.
|
64 |
+
- Requires a Hugging Face token stored as a Modal secret (`HF_TOKEN_SECRET`) if using gated models or for authenticated access.
|
65 |
+
|
66 |
+
## Future Work
|
67 |
+
|
68 |
+
- Deploy as an MCP (Multi-Compute Platform) server on Hugging Face Spaces.
|
69 |
+
- Develop a chat interface (e.g., using Claude 3.5 Sonnet) to interact with the transcription service, allowing users to ask questions about the video content based on the transcription.
|
70 |
+
|
71 |
+
## Troubleshooting
|
72 |
+
|
73 |
+
- **`ModuleNotFoundError: No module named 'moviepy.editor'` (in Modal logs):**
|
74 |
+
This indicates `moviepy` might not be correctly installed in the Modal image. Ensure `moviepy` is in `pip_install` and/or `run_commands("pip install moviepy")` in `modal_whisper_app.py` and redeploy.
|
75 |
+
- **`yt-dlp` errors or warnings about `ffmpeg`:**
|
76 |
+
Ensure `ffmpeg` is installed on your local system where `app.py` is run, and also within the Modal image (`apt_install("ffmpeg")`).
|
77 |
+
- **Modal authentication errors:**
|
78 |
+
Ensure `modal setup` has been run and your Modal token is active. For Hugging Face Spaces, Modal tokens might need to be set as environment variables/secrets.
|
__pycache__/modal_whisper_app.cpython-310.pyc
ADDED
Binary file (4.26 kB). View file
|
|
app.py
ADDED
@@ -0,0 +1,237 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import gradio as gr
|
2 |
+
import os
|
3 |
+
import requests
|
4 |
+
import tempfile
|
5 |
+
import subprocess
|
6 |
+
import re
|
7 |
+
import shutil # Added for rmtree
|
8 |
+
import modal
|
9 |
+
|
10 |
+
def is_youtube_url(url_string: str) -> bool:
|
11 |
+
"""Checks if the given string is a YouTube URL."""
|
12 |
+
# More robust regex to find YouTube video ID, accommodating various URL formats
|
13 |
+
# and additional query parameters.
|
14 |
+
youtube_regex = (
|
15 |
+
r'(?:youtube(?:-nocookie)?\.com/(?:[^/\n\s]+/|watch(?:/|\?(?:[^&\n\s]+&)*v=)|embed(?:/|\?(?:[^&\n\s]+&)*feature=oembed)|shorts/|live/)|youtu\.be/)'
|
16 |
+
r'([a-zA-Z0-9_-]{11})' # This captures the 11-character video ID
|
17 |
+
)
|
18 |
+
# We use re.search because the video ID might not be at the start of the query string part of the URL.
|
19 |
+
# re.match only matches at the beginning of the string (or beginning of line in multiline mode).
|
20 |
+
# The regex now directly looks for the 'v=VIDEO_ID' or youtu.be/VIDEO_ID structure.
|
21 |
+
# The first part of the regex matches the domain and common paths, the second part captures the ID.
|
22 |
+
return bool(re.search(youtube_regex, url_string))
|
23 |
+
|
24 |
+
def download_video(url_string: str, temp_dir: str) -> str | None:
|
25 |
+
"""Downloads video from a URL (YouTube or direct link) to a temporary directory."""
|
26 |
+
if is_youtube_url(url_string):
|
27 |
+
print(f"Attempting to download YouTube video: {url_string}")
|
28 |
+
# Define a fixed output filename pattern within the temp_dir
|
29 |
+
output_filename_template = "downloaded_video.%(ext)s" # yt-dlp replaces %(ext)s
|
30 |
+
output_path_template = os.path.join(temp_dir, output_filename_template)
|
31 |
+
|
32 |
+
cmd = [
|
33 |
+
"yt-dlp",
|
34 |
+
"-f", "bestvideo[ext=mp4]+bestaudio[ext=m4a]/mp4/best", # Prefer mp4 format
|
35 |
+
"--output", output_path_template,
|
36 |
+
url_string
|
37 |
+
]
|
38 |
+
print(f"Executing yt-dlp command: {' '.join(cmd)}")
|
39 |
+
|
40 |
+
try:
|
41 |
+
result = subprocess.run(cmd, capture_output=True, text=True, timeout=300, check=False)
|
42 |
+
|
43 |
+
print(f"yt-dlp STDOUT:\n{result.stdout}")
|
44 |
+
print(f"yt-dlp STDERR:\n{result.stderr}")
|
45 |
+
|
46 |
+
if result.returncode == 0:
|
47 |
+
# Find the actual downloaded file based on the template
|
48 |
+
downloaded_file_path = None
|
49 |
+
for item in os.listdir(temp_dir):
|
50 |
+
if item.startswith("downloaded_video."):
|
51 |
+
potential_path = os.path.join(temp_dir, item)
|
52 |
+
if os.path.isfile(potential_path):
|
53 |
+
downloaded_file_path = potential_path
|
54 |
+
print(f"YouTube video successfully downloaded to: {downloaded_file_path}")
|
55 |
+
break
|
56 |
+
if downloaded_file_path:
|
57 |
+
return downloaded_file_path
|
58 |
+
else:
|
59 |
+
print(f"yt-dlp seemed to succeed (exit code 0) but the output file 'downloaded_video.*' was not found in {temp_dir}.")
|
60 |
+
return None
|
61 |
+
else:
|
62 |
+
print(f"yt-dlp failed with return code {result.returncode}.")
|
63 |
+
return None
|
64 |
+
except subprocess.TimeoutExpired:
|
65 |
+
print(f"yt-dlp command timed out after 300 seconds for URL: {url_string}")
|
66 |
+
return None
|
67 |
+
except Exception as e:
|
68 |
+
print(f"An unexpected error occurred during yt-dlp execution for {url_string}: {e}")
|
69 |
+
return None
|
70 |
+
|
71 |
+
elif url_string.startswith(('http://', 'https://')) and url_string.lower().endswith(('.mp4', '.mov', '.avi', '.mkv', '.webm')):
|
72 |
+
print(f"Attempting to download direct video link: {url_string}")
|
73 |
+
try:
|
74 |
+
response = requests.get(url_string, stream=True, timeout=300) # 5 min timeout
|
75 |
+
response.raise_for_status() # Raises HTTPError for bad responses (4XX or 5XX)
|
76 |
+
|
77 |
+
filename = os.path.basename(url_string) or "downloaded_video_direct.mp4"
|
78 |
+
video_file_path = os.path.join(temp_dir, filename)
|
79 |
+
|
80 |
+
with open(video_file_path, 'wb') as f:
|
81 |
+
for chunk in response.iter_content(chunk_size=8192):
|
82 |
+
f.write(chunk)
|
83 |
+
print(f"Direct video downloaded successfully to: {video_file_path}")
|
84 |
+
return video_file_path
|
85 |
+
except requests.exceptions.RequestException as e:
|
86 |
+
print(f"Error downloading direct video link {url_string}: {e}")
|
87 |
+
return None
|
88 |
+
except Exception as e:
|
89 |
+
print(f"An unexpected error occurred during direct video download for {url_string}: {e}")
|
90 |
+
return None
|
91 |
+
else:
|
92 |
+
print(f"Input '{url_string}' is not a recognized YouTube URL or direct video link for download.")
|
93 |
+
return None
|
94 |
+
|
95 |
+
|
96 |
+
def process_video_input(input_string: str) -> str:
|
97 |
+
"""
|
98 |
+
Processes the video (from URL or local file path) and returns its transcription status.
|
99 |
+
"""
|
100 |
+
if not input_string:
|
101 |
+
return "Error: No video URL or file path provided."
|
102 |
+
|
103 |
+
video_path_to_process = None
|
104 |
+
created_temp_dir = None # To store path of temp directory if created for download
|
105 |
+
|
106 |
+
try:
|
107 |
+
if input_string.startswith(('http://', 'https://')):
|
108 |
+
print(f"Input is a URL: {input_string}")
|
109 |
+
created_temp_dir = tempfile.mkdtemp()
|
110 |
+
print(f"Created temporary directory for download: {created_temp_dir}")
|
111 |
+
downloaded_path = download_video(input_string, created_temp_dir)
|
112 |
+
|
113 |
+
if downloaded_path and os.path.exists(downloaded_path):
|
114 |
+
video_path_to_process = downloaded_path
|
115 |
+
else:
|
116 |
+
# Error message is already printed by download_video or this block
|
117 |
+
print(f"Failed to download or locate video from URL: {input_string}")
|
118 |
+
# Cleanup is handled in finally, so just return error
|
119 |
+
return "Error: Failed to download video from URL."
|
120 |
+
|
121 |
+
elif os.path.exists(input_string):
|
122 |
+
print(f"Input is a local file path: {input_string}")
|
123 |
+
video_path_to_process = input_string
|
124 |
+
else:
|
125 |
+
return f"Error: Input '{input_string}' is not a valid URL or an existing file path."
|
126 |
+
|
127 |
+
if video_path_to_process:
|
128 |
+
print(f"Processing video: {video_path_to_process}")
|
129 |
+
print(f"Video path to process: {video_path_to_process}")
|
130 |
+
try:
|
131 |
+
print("Reading video file into bytes...")
|
132 |
+
with open(video_path_to_process, "rb") as video_file:
|
133 |
+
video_bytes_content = video_file.read()
|
134 |
+
print(f"Read {len(video_bytes_content)} bytes from video file.")
|
135 |
+
|
136 |
+
# Ensure MODAL_TOKEN_ID and MODAL_TOKEN_SECRET are set as environment variables
|
137 |
+
# in your Hugging Face Space. For local `python app.py` runs, Modal CLI's
|
138 |
+
# authenticated state is usually used.
|
139 |
+
# os.environ["MODAL_TOKEN_ID"] = "your_modal_token_id" # Replace or set in HF Space
|
140 |
+
# os.environ["MODAL_TOKEN_SECRET"] = "your_modal_token_secret" # Replace or set in HF Space
|
141 |
+
|
142 |
+
print("Looking up Modal function 'whisper-transcriber/transcribe_video_audio'...")
|
143 |
+
# The function name should match what was deployed.
|
144 |
+
# It's typically 'AppName/FunctionName' or just 'FunctionName' if app is default.
|
145 |
+
# Based on your deployment log, app name is 'whisper-transcriber'
|
146 |
+
# and function is 'transcribe_video_audio'
|
147 |
+
try:
|
148 |
+
f = modal.Function.from_name("whisper-transcriber", "transcribe_video_audio")
|
149 |
+
print("Modal function looked up successfully.")
|
150 |
+
except modal.Error as e:
|
151 |
+
print("Modal function 'whisper-transcriber/transcribe_video_audio' not found. Trying with just function name.")
|
152 |
+
# Fallback or alternative lookup, though the above should be correct for named apps.
|
153 |
+
# This might be needed if the app name context is implicit.
|
154 |
+
# For a named app 'whisper-transcriber' and function 'transcribe_video_audio',
|
155 |
+
# the lookup `modal.Function.lookup("whisper-transcriber", "transcribe_video_audio")` is standard.
|
156 |
+
# If it was deployed as part of the default app, then just "transcribe_video_audio" might work.
|
157 |
+
# Given the deployment log, the first lookup should be correct.
|
158 |
+
return "Error: Could not find the deployed Modal function. Please check deployment status and name."
|
159 |
+
|
160 |
+
print("Calling Modal function for transcription...")
|
161 |
+
# Using .remote() for asynchronous execution, .call() for synchronous
|
162 |
+
# For Gradio, synchronous (.call()) might be simpler to handle the response directly.
|
163 |
+
transcription = f.remote(video_bytes_content) # Use .remote() for Modal function call
|
164 |
+
print(f"Received transcription from Modal: {transcription[:100]}...")
|
165 |
+
return transcription
|
166 |
+
except FileNotFoundError:
|
167 |
+
print(f"Error: Video file not found at {video_path_to_process} before sending to Modal.")
|
168 |
+
return f"Error: Video file disappeared before processing."
|
169 |
+
except modal.Error as e: # Using modal.Error as the base Modal exception
|
170 |
+
print(f"Modal specific error: {e}")
|
171 |
+
return f"Error during Modal operation: {str(e)}"
|
172 |
+
except Exception as e:
|
173 |
+
print(f"An unexpected error occurred while calling Modal: {e}")
|
174 |
+
import traceback
|
175 |
+
traceback.print_exc()
|
176 |
+
return f"Error: Failed to get transcription. {str(e)}"
|
177 |
+
else:
|
178 |
+
# This case should ideally be caught by earlier checks
|
179 |
+
return "Error: No video available to process after input handling."
|
180 |
+
|
181 |
+
finally:
|
182 |
+
if created_temp_dir and os.path.exists(created_temp_dir):
|
183 |
+
print(f"Cleaning up temporary directory: {created_temp_dir}")
|
184 |
+
try:
|
185 |
+
shutil.rmtree(created_temp_dir)
|
186 |
+
print(f"Successfully removed temporary directory: {created_temp_dir}")
|
187 |
+
except Exception as e:
|
188 |
+
print(f"Error removing temporary directory {created_temp_dir}: {e}")
|
189 |
+
|
190 |
+
# Gradio Interface for the API endpoint
|
191 |
+
api_interface = gr.Interface(
|
192 |
+
fn=process_video_input,
|
193 |
+
inputs=gr.Textbox(label="Video URL or Local File Path for Transcription",
|
194 |
+
placeholder="Enter YouTube URL, direct video URL (.mp4, .mov, etc.), or local file path..."),
|
195 |
+
outputs="text",
|
196 |
+
title="Video Transcription API",
|
197 |
+
description="Provide a video URL or local file path to get its audio transcription status.",
|
198 |
+
allow_flagging="never"
|
199 |
+
)
|
200 |
+
|
201 |
+
# Gradio Interface for a simple user-facing demo
|
202 |
+
def demo_process_video(input_string: str) -> str:
|
203 |
+
"""
|
204 |
+
A simple demo function for the Gradio UI.
|
205 |
+
It calls the same backend logic as the API.
|
206 |
+
"""
|
207 |
+
print(f"Demo received input: {input_string}")
|
208 |
+
result = process_video_input(input_string) # Call the core logic
|
209 |
+
return result
|
210 |
+
|
211 |
+
demo_interface = gr.Interface(
|
212 |
+
fn=demo_process_video,
|
213 |
+
inputs=gr.Textbox(label="Upload Video URL or Local File Path for Demo",
|
214 |
+
placeholder="Enter YouTube URL, direct video URL (.mp4, .mov, etc.), or local file path..."),
|
215 |
+
outputs="text",
|
216 |
+
title="Video Transcription Demo",
|
217 |
+
description="Provide a video URL or local file path to see its transcription status.",
|
218 |
+
allow_flagging="never"
|
219 |
+
)
|
220 |
+
|
221 |
+
# Combine interfaces into a Blocks app
|
222 |
+
with gr.Blocks() as app:
|
223 |
+
gr.Markdown("# Contextual Video Data Server")
|
224 |
+
gr.Markdown("This Hugging Face Space acts as a backend for processing video context for AI models.")
|
225 |
+
|
226 |
+
with gr.Tab("API Endpoint (for AI Models)"):
|
227 |
+
gr.Markdown("### Use this endpoint from another application (e.g., another Hugging Face Space).")
|
228 |
+
gr.Markdown("The `process_video_input` function is exposed here.")
|
229 |
+
api_interface.render()
|
230 |
+
|
231 |
+
with gr.Tab("Demo (for Manual Testing)"):
|
232 |
+
gr.Markdown("### Manually test video URLs or paths and observe the response.")
|
233 |
+
demo_interface.render()
|
234 |
+
|
235 |
+
# Launch the Gradio application
|
236 |
+
if __name__ == "__main__":
|
237 |
+
app.launch(server_name="0.0.0.0")
|
instructions.txt
ADDED
@@ -0,0 +1,421 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Understood. It's crucial to handle credentials securely.
|
2 |
+
|
3 |
+
**Important Security Note for AI Agents & Public Repositories:**
|
4 |
+
|
5 |
+
When providing credentials to an AI agent, especially for code generation that might end up in a public repository (like Hugging Face Spaces), it's vital to *never hardcode them directly into the source code*. This is a major security risk.
|
6 |
+
|
7 |
+
Instead, we should always use environment variables. This keeps your sensitive keys out of your code and out of public view. Both Hugging Face Spaces and Modal support setting environment variables.
|
8 |
+
|
9 |
+
I will include instructions for SWE-1 on how to use these credentials via environment variables.
|
10 |
+
|
11 |
+
---
|
12 |
+
|
13 |
+
### Instructions for AI Agent (SWE-1) - "Contextual Video Data Server" (Updated with Credentials Handling)
|
14 |
+
|
15 |
+
**Project Name:** Contextual Video Data Server (Your Hugging Face Space)
|
16 |
+
|
17 |
+
**Goal:** To build a Gradio application deployed on Hugging Face Spaces that acts as a video processing and data serving backend. It will accept video uploads, transcribe their audio using a Modal-deployed Whisper model, and expose an API endpoint to serve this transcribed text. This server will be consumed by another Hugging Face Space (the "Model's Frontend") which will then interact with the Anthropic API.
|
18 |
+
|
19 |
+
**Credentials to be Used (via Environment Variables):**
|
20 |
+
|
21 |
+
* **Anthropic API Key:** `YOUR_ANTHROPIC_API_KEY_HERE` (This will be used by the *other* Hugging Face Space, the "Model's Frontend," not directly by this "MCP Tool/Server" Space. However, it's good to keep in mind for future steps).
|
22 |
+
* **Modal Token ID:** `ak-MZoZD4vvq8KDMQJolFnix2`
|
23 |
+
* **Modal Token Secret:** `as-pVhhGl2cv30MhKUV3sXJKb`
|
24 |
+
* **HuggingFace Token:** `YOUR_HUGGINGFACE_TOKEN_HERE` (This is typically for logging into `huggingface_hub` for model downloads/uploads if needed, and also used by the Hugging Face Spaces platform itself for cloning repos etc.)
|
25 |
+
|
26 |
+
**High-Level Plan:**
|
27 |
+
|
28 |
+
1. **Gradio App with API Endpoint:** Create a Gradio application that can upload videos and expose a function via an API.
|
29 |
+
2. **Modal Backend for Whisper Transcription:** Develop a Modal application to perform audio extraction and Whisper transcription.
|
30 |
+
3. **Integration:** Connect the Gradio app to the Modal backend.
|
31 |
+
|
32 |
+
---
|
33 |
+
|
34 |
+
**Detailed Instructions for SWE-1:**
|
35 |
+
|
36 |
+
#### Part 1: Gradio Application Setup (The "MCP Tool/Server" Frontend)
|
37 |
+
|
38 |
+
**Objective:** Create a basic Gradio application that handles video uploads and defines a function that can be exposed as an API endpoint. This function will initially just return a placeholder string.
|
39 |
+
|
40 |
+
**Dependencies:**
|
41 |
+
* `gradio`
|
42 |
+
* `moviepy`
|
43 |
+
* `requests` (added for future integration with Modal)
|
44 |
+
|
45 |
+
**Files to Create:**
|
46 |
+
* `app.py`
|
47 |
+
* `requirements.txt`
|
48 |
+
|
49 |
+
**`requirements.txt` content:**
|
50 |
+
```
|
51 |
+
gradio
|
52 |
+
moviepy
|
53 |
+
requests
|
54 |
+
```
|
55 |
+
|
56 |
+
**`app.py` content (initial structure):**
|
57 |
+
|
58 |
+
```python
|
59 |
+
import gradio as gr
|
60 |
+
import os
|
61 |
+
import requests
|
62 |
+
import tempfile
|
63 |
+
|
64 |
+
# Placeholder for the function that will process the video and return transcription.
|
65 |
+
# This function will eventually call our Modal backend.
|
66 |
+
def process_video_for_api(video_path: str) -> str:
|
67 |
+
"""
|
68 |
+
Processes the uploaded video and returns its transcription.
|
69 |
+
This is the function that will be exposed via the Gradio API.
|
70 |
+
"""
|
71 |
+
if video_path is None:
|
72 |
+
return "Error: No video file uploaded."
|
73 |
+
|
74 |
+
# In this initial version, we just return a placeholder.
|
75 |
+
# Later, this will call the Modal function.
|
76 |
+
print(f"Received video for processing: {video_path}")
|
77 |
+
return f"Video {os.path.basename(video_path)} received. Transcription pending from Modal."
|
78 |
+
|
79 |
+
# Gradio Interface for the API endpoint
|
80 |
+
# This interface will primarily be consumed by the "Model's Frontend" Space.
|
81 |
+
api_interface = gr.Interface(
|
82 |
+
fn=process_video_for_api,
|
83 |
+
inputs=gr.Video(label="Video File for Transcription"),
|
84 |
+
outputs="text",
|
85 |
+
title="Video Transcription API",
|
86 |
+
description="Upload a video to get its audio transcription for AI context.",
|
87 |
+
allow_flagging="never"
|
88 |
+
)
|
89 |
+
|
90 |
+
# Gradio Interface for a simple user-facing demo (optional, but good for testing)
|
91 |
+
def demo_process_video(video_path: str) -> str:
|
92 |
+
"""
|
93 |
+
A simple demo function for the Gradio UI.
|
94 |
+
It calls the same backend logic as the API.
|
95 |
+
"""
|
96 |
+
print(f"Demo received video: {video_path}")
|
97 |
+
result = process_video_for_api(video_path) # Call the core logic
|
98 |
+
return result
|
99 |
+
|
100 |
+
demo_interface = gr.Interface(
|
101 |
+
fn=demo_process_video,
|
102 |
+
inputs=gr.Video(label="Upload Video for Demo"),
|
103 |
+
outputs="text",
|
104 |
+
title="Video Transcription Demo",
|
105 |
+
description="Upload a video to see its immediate transcription status (from the API).",
|
106 |
+
allow_flagging="never"
|
107 |
+
)
|
108 |
+
|
109 |
+
# Combine interfaces into a Blocks app for a better user experience in the Space.
|
110 |
+
with gr.Blocks() as app:
|
111 |
+
gr.Markdown("# Contextual Video Data Server")
|
112 |
+
gr.Markdown("This Hugging Face Space acts as a backend for processing video context for AI models.")
|
113 |
+
|
114 |
+
with gr.Tab("API Endpoint (for AI Models)"):
|
115 |
+
gr.Markdown("### Use this endpoint from another application (e.g., another Hugging Face Space).")
|
116 |
+
gr.Markdown("The `process_video_for_api` function is exposed here.")
|
117 |
+
api_interface.render()
|
118 |
+
|
119 |
+
with gr.Tab("Demo (for Manual Testing)"):
|
120 |
+
gr.Markdown("### Manually test video uploads and observe the response.")
|
121 |
+
demo_interface.render()
|
122 |
+
|
123 |
+
# Launch the Gradio application
|
124 |
+
if __name__ == "__main__":
|
125 |
+
app.launch()
|
126 |
+
```
|
127 |
+
|
128 |
+
**Implementation Instructions:**
|
129 |
+
|
130 |
+
1. **Create Project Folder:** Create a new folder for your Hugging Face Space project (e.g., `video-data-server-space`).
|
131 |
+
2. **Create `requirements.txt`:** Inside this folder, create a file named `requirements.txt` and paste the content provided above.
|
132 |
+
3. **Create `app.py`:** Inside the same folder, create a file named `app.py` and paste the Python code provided above.
|
133 |
+
4. **Local Testing (Optional but Recommended):**
|
134 |
+
* Open your terminal or command prompt.
|
135 |
+
* Navigate to your project folder (`cd video-data-server-space`).
|
136 |
+
* Install dependencies: `pip install -r requirements.txt`
|
137 |
+
* Run the Gradio app: `python app.py`
|
138 |
+
* Open the URL provided by Gradio (usually `http://127.0.0.1:7860`) in your web browser.
|
139 |
+
* Test uploading a video. You should see the placeholder response.
|
140 |
+
5. **Hugging Face Spaces Deployment:**
|
141 |
+
* Create a new Space on Hugging Face.
|
142 |
+
* Choose "Gradio" as the SDK.
|
143 |
+
* Select "Public" or "Private" as per your preference.
|
144 |
+
* Select a hardware configuration (CPU Basic is fine for this initial placeholder).
|
145 |
+
* Upload your `app.py` and `requirements.txt` files to the Space.
|
146 |
+
* **Crucially, set environment variables for your Hugging Face Token (if you intend to use it within the Space for private models or repo access) and the Modal API URL (once it's known).** You do this in the Space settings under "Settings" -> "Repository secrets".
|
147 |
+
* `HF_TOKEN`: `YOUR_HUGGINGFACE_TOKEN_HERE` (Though for this specific app, it's not strictly necessary unless you're accessing private HF models or repos from within the Space).
|
148 |
+
* `MODAL_API_URL`: (Will be set in Part 3 after Modal deployment)
|
149 |
+
* Once deployed, the Space will be accessible. The API endpoint (`process_video_for_api`) will be available via the Space's URL. The exact API path will be shown in the Gradio documentation within the Space.
|
150 |
+
|
151 |
+
#### Part 2: Modal Backend for Whisper Transcription
|
152 |
+
|
153 |
+
**Objective:** Create a Modal application that can perform audio extraction from a video and transcribe it using OpenAI's Whisper model via the Hugging Face Transformers library. This will be an independent service that your Gradio app calls.
|
154 |
+
|
155 |
+
**Dependencies:**
|
156 |
+
* `modal-client`
|
157 |
+
* `huggingface_hub`
|
158 |
+
* `transformers`
|
159 |
+
* `accelerate`
|
160 |
+
* `soundfile`
|
161 |
+
* `ffmpeg-python`
|
162 |
+
* `moviepy`
|
163 |
+
* `torch` (CPU version is fine for small models, GPU version if using larger models on a GPU-enabled Modal function)
|
164 |
+
|
165 |
+
**Files to Create:**
|
166 |
+
* `modal_whisper_app.py`
|
167 |
+
|
168 |
+
**`modal_whisper_app.py` content:**
|
169 |
+
|
170 |
+
```python
|
171 |
+
import modal
|
172 |
+
import io
|
173 |
+
import torch
|
174 |
+
from transformers import pipeline
|
175 |
+
import moviepy.editor as mp
|
176 |
+
import os
|
177 |
+
import tempfile
|
178 |
+
|
179 |
+
# Modal Stub for our application
|
180 |
+
stub = modal.Stub(name="video-whisper-transcriber")
|
181 |
+
|
182 |
+
# Define the image for our Modal function
|
183 |
+
# We'll use a specific Hugging Face Transformers image or a custom one with dependencies
|
184 |
+
whisper_image = (
|
185 |
+
modal.Image.debian_slim()
|
186 |
+
.apt_install("ffmpeg") # ffmpeg is essential for moviepy
|
187 |
+
.pip_install(
|
188 |
+
"transformers",
|
189 |
+
"accelerate",
|
190 |
+
"soundfile",
|
191 |
+
"moviepy",
|
192 |
+
"huggingface_hub",
|
193 |
+
"torch" # install torch for CPU by default
|
194 |
+
)
|
195 |
+
# If you need GPU, specify the CUDA version for torch and use GPU
|
196 |
+
# .pip_install("torch --index-url https://download.pytorch.org/whl/cu121")
|
197 |
+
)
|
198 |
+
|
199 |
+
@stub.function(
|
200 |
+
image=whisper_image,
|
201 |
+
# Configure resources for the function. For larger Whisper models, you might need a GPU.
|
202 |
+
# For 'tiny.en' or 'base.en', CPU might be sufficient, but GPU will be faster.
|
203 |
+
# gpu="A10G" # Uncomment and adjust if you need GPU (e.g., "A10G", "T4", etc.)
|
204 |
+
timeout=600 # 10 minutes timeout for potentially long videos
|
205 |
+
)
|
206 |
+
@modal.web_endpoint(method="POST") # Expose this function as a web endpoint
|
207 |
+
def transcribe_video_audio(video_bytes: bytes) -> str:
|
208 |
+
"""
|
209 |
+
Receives video bytes, extracts audio, and transcribes it using OpenAI Whisper.
|
210 |
+
"""
|
211 |
+
if not video_bytes:
|
212 |
+
return "Error: No video bytes provided."
|
213 |
+
|
214 |
+
print("Received video bytes for transcription.")
|
215 |
+
|
216 |
+
# Save the received bytes to a temporary video file
|
217 |
+
with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as temp_video_file:
|
218 |
+
temp_video_file.write(video_bytes)
|
219 |
+
temp_video_path = temp_video_file.name
|
220 |
+
|
221 |
+
try:
|
222 |
+
# Load the video and extract audio
|
223 |
+
video = mp.VideoFileClip(temp_video_path)
|
224 |
+
|
225 |
+
# Save audio to a temporary WAV file
|
226 |
+
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio_file:
|
227 |
+
temp_audio_path = temp_audio_file.name
|
228 |
+
video.audio.write_audiofile(temp_audio_path, logger=None) # logger=None to suppress ffmpeg output
|
229 |
+
|
230 |
+
# Initialize the Whisper ASR pipeline
|
231 |
+
# Using a small, English-only model for faster processing
|
232 |
+
# You can change 'tiny.en' to 'base.en', 'small.en', or 'medium.en' if needed.
|
233 |
+
# Ensure you have enough memory/GPU if using larger models.
|
234 |
+
# Use GPU if available on Modal, otherwise CPU.
|
235 |
+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
236 |
+
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
237 |
+
|
238 |
+
pipe = pipeline(
|
239 |
+
"automatic-speech-recognition",
|
240 |
+
model="openai/whisper-tiny.en", # Using the cheapest model as requested
|
241 |
+
torch_dtype=torch_dtype,
|
242 |
+
device=device,
|
243 |
+
)
|
244 |
+
|
245 |
+
# Transcribe the audio
|
246 |
+
print(f"Transcribing audio from {temp_audio_path} using Whisper on {device}...")
|
247 |
+
transcription_result = pipe(temp_audio_path, generate_kwargs={"task": "transcribe"})
|
248 |
+
transcribed_text = transcription_result["text"]
|
249 |
+
print("Transcription complete.")
|
250 |
+
|
251 |
+
return transcribed_text
|
252 |
+
|
253 |
+
except Exception as e:
|
254 |
+
print(f"An error occurred during transcription: {e}")
|
255 |
+
return f"Error during video processing: {e}"
|
256 |
+
finally:
|
257 |
+
# Clean up temporary files
|
258 |
+
if 'temp_video_path' in locals() and os.path.exists(temp_video_path):
|
259 |
+
os.remove(temp_video_path)
|
260 |
+
if 'temp_audio_path' in locals() and os.path.exists(temp_audio_path):
|
261 |
+
os.remove(temp_audio_path)
|
262 |
+
|
263 |
+
# You can add local testing code if needed
|
264 |
+
@stub.local_entrypoint()
|
265 |
+
def main():
|
266 |
+
print("To deploy this Modal application, run `modal deploy modal_whisper_app.py`.")
|
267 |
+
print("Ensure your Modal token is set using `modal token set --token-id <ID> --token-secret <SECRET>`")
|
268 |
+
|
269 |
+
```
|
270 |
+
|
271 |
+
**Implementation Instructions:**
|
272 |
+
|
273 |
+
1. **Install Modal CLI:** If you haven't already, install the Modal CLI: `pip install modal-client`
|
274 |
+
2. **Authenticate Modal:** Run the following command in your terminal to set your Modal credentials as environment variables for the CLI:
|
275 |
+
```bash
|
276 |
+
modal token set --token-id ak-MZoZD4vvq8KDMQJolFnix2 --token-secret as-pVhhGl2cv30MhKUV3sXJKb
|
277 |
+
```
|
278 |
+
3. **Create `modal_whisper_app.py`:** Create a file named `modal_whisper_app.py` and paste the content provided above.
|
279 |
+
4. **Review Dependencies and Resources:**
|
280 |
+
* The code defaults to CPU for `torch` and `whisper-tiny.en`. If you want to use a GPU for faster processing with Modal, uncomment `gpu="A10G"` (or your preferred GPU type) and adjust the `torch` installation line to include CUDA support (e.g., `pip_install("torch --index-url https://download.pytorch.org/whl/cu121")`). Remember to use the cheapest model (`tiny.en`) as requested.
|
281 |
+
* Consider the `timeout` for longer videos.
|
282 |
+
5. **Deploy to Modal:**
|
283 |
+
* Open your terminal or command prompt.
|
284 |
+
* Navigate to the directory where you saved `modal_whisper_app.py`.
|
285 |
+
* Deploy the Modal application: `modal deploy modal_whisper_app.py`
|
286 |
+
* Modal will provide you with a URL for the `transcribe_video_audio` endpoint (e.g., `https://your-workspace-name.modal.run/transcribe_video_audio`). **Keep this URL handy, as you'll need it in the next step.**
|
287 |
+
|
288 |
+
#### Part 3: Integration: Connecting Gradio to Modal
|
289 |
+
|
290 |
+
**Objective:** Modify the Gradio application (`app.py`) to call the deployed Modal endpoint for video transcription instead of returning a placeholder.
|
291 |
+
|
292 |
+
**Dependencies:**
|
293 |
+
* `requests` (already added in Part 1)
|
294 |
+
|
295 |
+
**Files to Modify:**
|
296 |
+
* `app.py`
|
297 |
+
* `requirements.txt` (already updated in Part 1)
|
298 |
+
|
299 |
+
**`app.py` modification:**
|
300 |
+
|
301 |
+
You'll need to replace the placeholder logic in `process_video_for_api` with a call to your Modal endpoint.
|
302 |
+
|
303 |
+
```python
|
304 |
+
import gradio as gr
|
305 |
+
import os
|
306 |
+
import requests
|
307 |
+
import tempfile
|
308 |
+
|
309 |
+
# --- IMPORTANT ---
|
310 |
+
# This URL MUST be set as an environment variable in your Hugging Face Space.
|
311 |
+
# Name the environment variable MODAL_API_URL.
|
312 |
+
# During local testing, you can uncomment and set it here temporarily.
|
313 |
+
MODAL_API_URL = os.environ.get("MODAL_API_URL", "YOUR_MODAL_WHISPER_ENDPOINT_URL_HERE")
|
314 |
+
# Example if testing locally: MODAL_API_URL = "https://your-workspace-name.modal.run/transcribe_video_audio"
|
315 |
+
# --- IMPORTANT ---
|
316 |
+
|
317 |
+
def process_video_for_api(video_path: str) -> str:
|
318 |
+
"""
|
319 |
+
Processes the uploaded video and returns its transcription by calling the Modal backend.
|
320 |
+
"""
|
321 |
+
if MODAL_API_URL == "YOUR_MODAL_WHISPER_ENDPOINT_URL_HERE":
|
322 |
+
return "Error: MODAL_API_URL is not set. Please configure it in your Hugging Face Space secrets."
|
323 |
+
|
324 |
+
if video_path is None:
|
325 |
+
return "Error: No video file uploaded."
|
326 |
+
|
327 |
+
print(f"Received video for processing: {video_path}")
|
328 |
+
|
329 |
+
try:
|
330 |
+
# Gradio provides a temporary path. We need to read the bytes to send to Modal.
|
331 |
+
with open(video_path, "rb") as video_file:
|
332 |
+
video_bytes = video_file.read()
|
333 |
+
|
334 |
+
print(f"Sending video bytes to Modal at {MODAL_API_URL}...")
|
335 |
+
response = requests.post(MODAL_API_URL, data=video_bytes)
|
336 |
+
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
|
337 |
+
|
338 |
+
transcribed_text = response.text
|
339 |
+
print("Transcription received from Modal.")
|
340 |
+
return transcribed_text
|
341 |
+
|
342 |
+
except requests.exceptions.RequestException as e:
|
343 |
+
print(f"Error calling Modal backend: {e}")
|
344 |
+
return f"Error communicating with transcription service: {e}"
|
345 |
+
except Exception as e:
|
346 |
+
print(f"An unexpected error occurred: {e}")
|
347 |
+
return f"An unexpected error occurred during processing: {e}"
|
348 |
+
|
349 |
+
# The rest of your app.py remains the same.
|
350 |
+
|
351 |
+
# Gradio Interface for the API endpoint
|
352 |
+
api_interface = gr.Interface(
|
353 |
+
fn=process_video_for_api,
|
354 |
+
inputs=gr.Video(label="Video File for Transcription"),
|
355 |
+
outputs="text",
|
356 |
+
title="Video Transcription API",
|
357 |
+
description="Upload a video to get its audio transcription for AI context.",
|
358 |
+
allow_flagging="never"
|
359 |
+
)
|
360 |
+
|
361 |
+
# Gradio Interface for a simple user-facing demo (optional, but good for testing)
|
362 |
+
def demo_process_video(video_path: str) -> str:
|
363 |
+
"""
|
364 |
+
A simple demo function for the Gradio UI.
|
365 |
+
It calls the same backend logic as the API.
|
366 |
+
"""
|
367 |
+
print(f"Demo received video: {video_path}")
|
368 |
+
result = process_video_for_api(video_path) # Call the core logic
|
369 |
+
return result
|
370 |
+
|
371 |
+
demo_interface = gr.Interface(
|
372 |
+
fn=demo_process_video,
|
373 |
+
inputs=gr.Video(label="Upload Video for Demo"),
|
374 |
+
outputs="text",
|
375 |
+
title="Video Transcription Demo",
|
376 |
+
description="Upload a video to see its immediate transcription status (from the API).",
|
377 |
+
allow_flagging="never"
|
378 |
+
)
|
379 |
+
|
380 |
+
# Combine interfaces into a Blocks app for a better user experience in the Space.
|
381 |
+
with gr.Blocks() as app:
|
382 |
+
gr.Markdown("# Contextual Video Data Server")
|
383 |
+
gr.Markdown("This Hugging Face Space acts as a backend for processing video context for AI models.")
|
384 |
+
|
385 |
+
with gr.Tab("API Endpoint (for AI Models)"):
|
386 |
+
gr.Markdown("### Use this endpoint from another application (e.g., another Hugging Face Space).")
|
387 |
+
gr.Markdown("The `process_video_for_api` function is exposed here.")
|
388 |
+
api_interface.render()
|
389 |
+
|
390 |
+
with gr.Tab("Demo (for Manual Testing)"):
|
391 |
+
gr.Markdown("### Manually test video uploads and observe the response.")
|
392 |
+
demo_interface.render()
|
393 |
+
|
394 |
+
# Launch the Gradio application
|
395 |
+
if __name__ == "__main__":
|
396 |
+
app.launch()
|
397 |
+
```
|
398 |
+
|
399 |
+
**Implementation Instructions:**
|
400 |
+
|
401 |
+
1. **Update `app.py`:**
|
402 |
+
* Paste the updated `process_video_for_api` function into your `app.py`.
|
403 |
+
* Note the line `MODAL_API_URL = os.environ.get("MODAL_API_URL", "YOUR_MODAL_WHISPER_ENDPOINT_URL_HERE")`. This tells the application to fetch the Modal API URL from an environment variable named `MODAL_API_URL`.
|
404 |
+
2. **Configure Hugging Face Space Secrets:**
|
405 |
+
* Go to your Hugging Face Space settings.
|
406 |
+
* Navigate to "Settings" -> "Repository secrets".
|
407 |
+
* Add a new secret:
|
408 |
+
* **Name:** `MODAL_API_URL`
|
409 |
+
* **Value:** Paste the actual URL you obtained after deploying your `modal_whisper_app.py` (e.g., `https://your-workspace-name.modal.run/transcribe_video_audio`).
|
410 |
+
* (Optional but recommended for general practice) Add `HF_TOKEN` with your Hugging Face token.
|
411 |
+
3. **Redeploy Gradio Space:**
|
412 |
+
* If you're using Git for your Hugging Face Space, commit and push your changes.
|
413 |
+
* If you're using the Hugging Face UI, upload the modified `app.py` to your Space.
|
414 |
+
* The Space will automatically rebuild and redeploy, now using the environment variable.
|
415 |
+
4. **Test the Full Flow:**
|
416 |
+
* Once your Gradio Space is live, go to the "Demo" tab.
|
417 |
+
* Upload a video.
|
418 |
+
* The Gradio app will now send the video to your Modal backend, which will transcribe it, and then the transcription will be returned and displayed in the Gradio UI.
|
419 |
+
* You can also test the API endpoint directly using tools like `curl` or Postman, or by building a small test script, pointing it to your Space's API URL (e.g., `https://your-username-video-data-server.hf.space/run/process_video_for_api`).
|
420 |
+
|
421 |
+
This robust setup ensures your credentials are secure and your architecture is well-defined for the hackathon!
|
modal_whisper_app.py
ADDED
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import modal
|
2 |
+
import os
|
3 |
+
import tempfile
|
4 |
+
import io
|
5 |
+
|
6 |
+
# Define the Modal image
|
7 |
+
whisper_image = (
|
8 |
+
modal.Image.debian_slim(python_version="3.10")
|
9 |
+
.apt_install("ffmpeg")
|
10 |
+
.run_commands("pip install moviepy") # Force install moviepy
|
11 |
+
.pip_install(
|
12 |
+
"transformers[torch]",
|
13 |
+
"accelerate",
|
14 |
+
"soundfile",
|
15 |
+
"moviepy", # Essential for audio extraction from video
|
16 |
+
"huggingface_hub",
|
17 |
+
"ffmpeg-python"
|
18 |
+
)
|
19 |
+
)
|
20 |
+
|
21 |
+
app = modal.App(name="whisper-transcriber") # Changed from modal.Stub to modal.App
|
22 |
+
|
23 |
+
# Environment variable for model name, configurable in Modal UI or via .env
|
24 |
+
MODEL_NAME = os.environ.get("HF_MODEL_NAME", "openai/whisper-base")
|
25 |
+
|
26 |
+
# Hugging Face Token - retrieve from memory and set as Modal Secret
|
27 |
+
# IMPORTANT: Create a Modal Secret named 'my-huggingface-secret' with your actual HF_TOKEN.
|
28 |
+
# Example: modal secret create my-huggingface-secret HF_TOKEN=your_hf_token_here
|
29 |
+
HF_TOKEN_SECRET = modal.Secret.from_name("my-huggingface-secret")
|
30 |
+
|
31 |
+
@app.function(
|
32 |
+
image=whisper_image,
|
33 |
+
secrets=[HF_TOKEN_SECRET],
|
34 |
+
timeout=1200
|
35 |
+
)
|
36 |
+
def transcribe_video_audio(video_bytes: bytes) -> str:
|
37 |
+
# Imports moved inside the function to avoid local ModuleNotFoundError during `modal deploy`
|
38 |
+
from moviepy.editor import VideoFileClip
|
39 |
+
import soundfile as sf
|
40 |
+
import torch
|
41 |
+
from transformers import pipeline
|
42 |
+
from huggingface_hub import login
|
43 |
+
|
44 |
+
if not video_bytes:
|
45 |
+
return "Error: No video data received."
|
46 |
+
|
47 |
+
# Login to Hugging Face Hub using the token from Modal secrets
|
48 |
+
hf_token = os.environ.get("HF_TOKEN")
|
49 |
+
if hf_token:
|
50 |
+
try:
|
51 |
+
login(token=hf_token)
|
52 |
+
print("Successfully logged into Hugging Face Hub.")
|
53 |
+
except Exception as e:
|
54 |
+
print(f"Hugging Face Hub login failed: {e}. Proceeding, but private models may not be accessible.")
|
55 |
+
else:
|
56 |
+
print("HF_TOKEN secret not found. Proceeding without login (works for public models).")
|
57 |
+
|
58 |
+
print(f"Processing video for transcription using model: {MODEL_NAME}")
|
59 |
+
|
60 |
+
# Initialize pipeline inside the function.
|
61 |
+
# For production/frequent use, consider @stub.cls to load the model once per container lifecycle.
|
62 |
+
print("Loading Whisper model...")
|
63 |
+
device_map = "cuda:0" if torch.cuda.is_available() else "cpu"
|
64 |
+
# Use float16 for GPU for faster inference and less memory, float32 for CPU
|
65 |
+
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
66 |
+
|
67 |
+
transcriber = pipeline(
|
68 |
+
"automatic-speech-recognition",
|
69 |
+
model=MODEL_NAME,
|
70 |
+
torch_dtype=torch_dtype,
|
71 |
+
device=device_map,
|
72 |
+
)
|
73 |
+
print(f"Whisper model loaded on device: {device_map} with dtype: {torch_dtype}")
|
74 |
+
|
75 |
+
video_path = None
|
76 |
+
audio_path = None
|
77 |
+
|
78 |
+
try:
|
79 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=".mp4") as tmp_video_file:
|
80 |
+
tmp_video_file.write(video_bytes)
|
81 |
+
video_path = tmp_video_file.name
|
82 |
+
print(f"Temporary video file saved: {video_path}")
|
83 |
+
|
84 |
+
print("Extracting audio from video...")
|
85 |
+
video_clip = VideoFileClip(video_path)
|
86 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_audio_file:
|
87 |
+
audio_path = tmp_audio_file.name
|
88 |
+
video_clip.audio.write_audiofile(audio_path, codec='pcm_s16le', logger=None)
|
89 |
+
video_clip.close()
|
90 |
+
print(f"Audio extracted to: {audio_path}")
|
91 |
+
|
92 |
+
audio_input, samplerate = sf.read(audio_path)
|
93 |
+
if audio_input.ndim > 1:
|
94 |
+
audio_input = audio_input.mean(axis=1) # Convert to mono
|
95 |
+
|
96 |
+
print(f"Audio data shape: {audio_input.shape}, Samplerate: {samplerate}")
|
97 |
+
print("Starting transcription...")
|
98 |
+
# Pass audio as a dictionary for more control, or directly as numpy array
|
99 |
+
# Adding chunk_length_s for handling long audio files better.
|
100 |
+
result = transcriber(audio_input.copy(), chunk_length_s=30, batch_size=8, return_timestamps=False)
|
101 |
+
transcribed_text = result["text"]
|
102 |
+
|
103 |
+
print(f"Transcription successful. Length: {len(transcribed_text)}")
|
104 |
+
if len(transcribed_text) > 100:
|
105 |
+
print(f"Transcription preview: {transcribed_text[:100]}...")
|
106 |
+
else:
|
107 |
+
print(f"Transcription: {transcribed_text}")
|
108 |
+
|
109 |
+
return transcribed_text
|
110 |
+
|
111 |
+
except Exception as e:
|
112 |
+
print(f"Error during transcription process: {e}")
|
113 |
+
import traceback
|
114 |
+
traceback.print_exc()
|
115 |
+
return f"Error: Transcription failed. Details: {str(e)}"
|
116 |
+
finally:
|
117 |
+
for p in [video_path, audio_path]:
|
118 |
+
if p and os.path.exists(p):
|
119 |
+
try:
|
120 |
+
os.remove(p)
|
121 |
+
print(f"Removed temporary file: {p}")
|
122 |
+
except Exception as e_rm:
|
123 |
+
print(f"Error removing temporary file {p}: {e_rm}")
|
124 |
+
|
125 |
+
# This is a local entrypoint for testing the Modal function if you run `modal run modal_whisper_app.py`
|
126 |
+
@app.local_entrypoint()
|
127 |
+
def main():
|
128 |
+
# This is just an example of how you might test.
|
129 |
+
# You'd need a sample video file (e.g., "sample.mp4") in the same directory.
|
130 |
+
# For actual deployment, this main function isn't strictly necessary as Gradio will call the webhook.
|
131 |
+
sample_video_path = "sample.mp4"
|
132 |
+
if not os.path.exists(sample_video_path):
|
133 |
+
print(f"Sample video {sample_video_path} not found. Skipping local test run.")
|
134 |
+
return
|
135 |
+
|
136 |
+
with open(sample_video_path, "rb") as f:
|
137 |
+
video_bytes_content = f.read()
|
138 |
+
|
139 |
+
print(f"Testing transcription with {sample_video_path}...")
|
140 |
+
transcription = transcribe_video_audio.remote(video_bytes_content)
|
141 |
+
print("----")
|
142 |
+
print(f"Transcription Result: {transcription}")
|
143 |
+
print("----")
|
144 |
+
|
145 |
+
# To call this function from another Python script (after deployment):
|
146 |
+
# import modal
|
147 |
+
# Ensure the app name matches the one in modal.App(name=...)
|
148 |
+
# The exact lookup method might vary slightly with modal.App, often it's:
|
149 |
+
# deployed_app = modal.App.lookup("whisper-transcriber")
|
150 |
+
# or by accessing the function directly if the app is deployed with a name.
|
151 |
+
# For a deployed function, you might use its tag or webhook URL directly.
|
152 |
+
# Example using a direct function call if deployed and accessible:
|
153 |
+
# f = modal.Function.lookup("whisper-transcriber/transcribe_video_audio") # Or similar based on deployment output
|
154 |
+
# For invoking:
|
155 |
+
# result = f.remote(your_video_bytes) # for async
|
156 |
+
# print(result)
|
157 |
+
# Or, if you have the app object:
|
158 |
+
# result = app.functions.transcribe_video_audio.remote(your_video_bytes)
|
159 |
+
# Consult Modal documentation for the precise invocation method for your Modal version and deployment style.
|
160 |
+
|
161 |
+
# Note: When deploying to Modal, Modal uses the `app.serve()` or `app.deploy()` mechanism.
|
162 |
+
# The Gradio app will call the deployed Modal function via its HTTP endpoint.
|
requirements.txt
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
gradio
|
2 |
+
moviepy
|
3 |
+
requests
|
4 |
+
yt-dlp
|
5 |
+
modal
|