marcosremar2 commited on
Commit
4112422
·
1 Parent(s): 44e3f8e

Add RunPod serverless configuration with GitHub integration

Browse files

- Add Dockerfile for RunPod serverless deployment
- Add handler.py with PDF to Markdown conversion
- Add requirements-runpod.txt with minimal dependencies
- Add README-RUNPOD.md with complete setup instructions
- Add test_input.json for testing
- Configure for automatic deployment on git push
- Support both base64 and URL input for PDFs

Dockerfile.runpod ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ # Install system dependencies needed for MinerU
4
+ RUN apt-get update && apt-get install -y \
5
+ wget \
6
+ git \
7
+ libgl1-mesa-glx \
8
+ libglib2.0-0 \
9
+ libsm6 \
10
+ libxext6 \
11
+ libxrender-dev \
12
+ libgomp1 \
13
+ libglib2.0-dev \
14
+ libglfw3 \
15
+ libglfw3-dev \
16
+ libgles2-mesa-dev \
17
+ build-essential \
18
+ && rm -rf /var/lib/apt/lists/*
19
+
20
+ WORKDIR /app
21
+
22
+ # Copy and install requirements
23
+ COPY requirements.runpod.txt .
24
+ RUN pip install --no-cache-dir --upgrade pip && \
25
+ pip install --no-cache-dir -r requirements.runpod.txt
26
+
27
+ # Install magic-pdf and dependencies
28
+ RUN pip install --no-cache-dir magic-pdf[full]==0.9.0
29
+
30
+ # Create models directory
31
+ RUN mkdir -p /app/models
32
+
33
+ # Download MinerU models during build
34
+ # This will include all models in the Docker image
35
+ RUN magic-pdf download-models -p /app/models
36
+
37
+ # Set environment variable for model path
38
+ ENV MINERU_MODEL_PATH=/app/models
39
+
40
+ # Copy handler and any custom code
41
+ COPY runpod_handler.py .
42
+ COPY pdf_converter_mineru.py .
43
+
44
+ # Copy configuration
45
+ COPY config/ ./config/
46
+
47
+ # RunPod serverless expects this
48
+ CMD ["python", "-u", "runpod_handler.py"]
Dockerfile.runpod.simple ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ # Install basic dependencies
4
+ RUN apt-get update && apt-get install -y \
5
+ wget \
6
+ git \
7
+ && rm -rf /var/lib/apt/lists/*
8
+
9
+ WORKDIR /app
10
+
11
+ # Copy and install requirements
12
+ COPY requirements.runpod.txt .
13
+ RUN pip install --no-cache-dir --upgrade pip && \
14
+ pip install --no-cache-dir runpod PyMuPDF
15
+
16
+ # Copy handler (simplified version)
17
+ COPY runpod_handler_simple.py runpod_handler.py
18
+
19
+ # RunPod serverless expects this
20
+ CMD ["python", "-u", "runpod_handler.py"]
README_RUNPOD.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MinerU RunPod Serverless Deployment
2
+
3
+ ## Overview
4
+
5
+ This deployment includes MinerU models directly in the Docker image for fast cold starts on RunPod Serverless.
6
+
7
+ ## Build and Deploy
8
+
9
+ ### 1. Build Docker Image
10
+
11
+ ```bash
12
+ ./build_runpod.sh
13
+ ```
14
+
15
+ This will:
16
+ - Build the Docker image with all MinerU models included
17
+ - Download models during build (this takes ~10-15 minutes)
18
+ - Result in a Docker image of approximately 5-10GB
19
+
20
+ ### 2. Push to Docker Hub
21
+
22
+ ```bash
23
+ docker login
24
+ docker push marcosremar2/mineru-runpod:latest
25
+ ```
26
+
27
+ ### 3. Deploy on RunPod
28
+
29
+ 1. Go to [RunPod Serverless](https://www.runpod.io/console/serverless)
30
+ 2. Click "New Template"
31
+ 3. Configure:
32
+ - **Container Image**: `marcosremar2/mineru-runpod:latest`
33
+ - **Container Disk**: 20 GB (to be safe)
34
+ - **Volume Size**: 0 GB (not needed, models in image)
35
+ - **GPU**: Any GPU with 8GB+ VRAM
36
+ - **Max Workers**: Based on your needs
37
+ - **Idle Timeout**: 5 seconds
38
+ - **Execution Timeout**: 120 seconds
39
+
40
+ ### 4. Test the Deployment
41
+
42
+ ```bash
43
+ python test_runpod.py test.pdf https://api.runpod.ai/v2/YOUR_ENDPOINT_ID YOUR_API_KEY
44
+ ```
45
+
46
+ ## API Usage
47
+
48
+ ### Request Format
49
+
50
+ ```json
51
+ {
52
+ "input": {
53
+ "pdf_base64": "base64_encoded_pdf_content",
54
+ "filename": "document.pdf"
55
+ }
56
+ }
57
+ ```
58
+
59
+ ### Response Format
60
+
61
+ ```json
62
+ {
63
+ "output": {
64
+ "markdown": "# Converted Document\n\nContent here...",
65
+ "filename": "document.pdf",
66
+ "status": "success",
67
+ "pages": 5
68
+ }
69
+ }
70
+ ```
71
+
72
+ ## Cost Estimation
73
+
74
+ - **Cold Start**: ~5-10 seconds (models already in image)
75
+ - **Processing**: ~10-30 seconds per PDF
76
+ - **GPU Cost**: ~$0.00024/second
77
+ - **Total per PDF**: ~$0.01-0.02
78
+
79
+ ## Optimization Tips
80
+
81
+ 1. **Reduce Image Size**: Remove unnecessary models from Dockerfile
82
+ 2. **Use Active Workers**: For consistent load, keep 1-2 active workers
83
+ 3. **Adjust Timeout**: Increase for larger PDFs
84
+ 4. **Monitor Usage**: Use RunPod dashboard to track costs
85
+
86
+ ## Troubleshooting
87
+
88
+ 1. **Out of Memory**: Use larger GPU (16GB+ VRAM)
89
+ 2. **Timeout**: Increase execution timeout in template
90
+ 3. **Model Loading**: Check MINERU_MODEL_PATH environment variable
build_runpod.sh ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Build script for RunPod deployment
4
+ echo "Building RunPod Docker image with MinerU models..."
5
+
6
+ # Set variables
7
+ IMAGE_NAME="mineru-runpod"
8
+ TAG="latest"
9
+ DOCKER_REPO="marcosremar2/mineru-runpod" # Change to your Docker Hub username
10
+
11
+ # Build the image
12
+ echo "Building Docker image..."
13
+ docker build -f Dockerfile.runpod -t ${IMAGE_NAME}:${TAG} .
14
+
15
+ # Tag for Docker Hub
16
+ docker tag ${IMAGE_NAME}:${TAG} ${DOCKER_REPO}:${TAG}
17
+
18
+ echo "Build complete!"
19
+ echo ""
20
+ echo "To test locally:"
21
+ echo "docker run --rm -p 8000:8000 ${IMAGE_NAME}:${TAG}"
22
+ echo ""
23
+ echo "To push to Docker Hub:"
24
+ echo "docker login"
25
+ echo "docker push ${DOCKER_REPO}:${TAG}"
26
+ echo ""
27
+ echo "Docker image size:"
28
+ docker images ${IMAGE_NAME}:${TAG} --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"
requirements.runpod.txt ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ runpod>=1.4.0
2
+ PyMuPDF>=1.18.16
3
+ numpy>=1.21.0
4
+ opencv-python>=4.5.3
5
+ torch>=1.9.0
6
+ torchvision>=0.10.0
7
+ transformers>=4.20.0
8
+ paddleocr>=2.7.0
9
+ paddlepaddle>=2.5.0
10
+ accelerate>=0.20.0
11
+ datasets>=2.0.0
12
+ sentencepiece>=0.1.96
13
+ protobuf>=3.20.0
14
+ scipy>=1.7.0
15
+ scikit-learn>=0.24.0
16
+ pandas>=1.3.0
17
+ rapidfuzz>=2.0.0
18
+ shapely>=1.8.0
19
+ pdfplumber>=0.5.28
20
+ pypdfium2>=4.0.0
21
+ tqdm>=4.61.0
22
+ loguru>=0.5.3
23
+ click>=8.0.0
runpod_handler.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import runpod
2
+ import tempfile
3
+ import os
4
+ import sys
5
+ import json
6
+ import base64
7
+ from pathlib import Path
8
+ from loguru import logger
9
+
10
+ # Add current directory to path
11
+ sys.path.append(os.path.dirname(os.path.abspath(__file__)))
12
+
13
+ # Import MinerU converter
14
+ from pdf_converter_mineru import PdfConverter
15
+
16
+ # Initialize converter with model path
17
+ CONVERTER = None
18
+
19
+ def initialize_converter():
20
+ """Initialize the PDF converter once"""
21
+ global CONVERTER
22
+ if CONVERTER is None:
23
+ logger.info("Initializing MinerU converter...")
24
+ model_path = os.environ.get('MINERU_MODEL_PATH', '/app/models')
25
+
26
+ # Create config
27
+ config = {
28
+ "model_dir": model_path,
29
+ "output_dir": "/tmp/mineru_output",
30
+ "device": "cuda" if os.path.exists('/dev/nvidia0') else "cpu",
31
+ "parse_method": "auto",
32
+ "debug": False
33
+ }
34
+
35
+ CONVERTER = PdfConverter(config)
36
+ logger.info("MinerU converter initialized successfully")
37
+
38
+ def handler(job):
39
+ """
40
+ RunPod serverless handler for PDF to Markdown conversion
41
+ """
42
+ try:
43
+ # Initialize converter on first run
44
+ initialize_converter()
45
+
46
+ job_input = job["input"]
47
+
48
+ # Get PDF data from base64
49
+ pdf_base64 = job_input.get("pdf_base64")
50
+ filename = job_input.get("filename", "document.pdf")
51
+
52
+ if not pdf_base64:
53
+ return {"error": "No PDF data provided", "status": "failed"}
54
+
55
+ # Decode base64 PDF
56
+ pdf_data = base64.b64decode(pdf_base64)
57
+
58
+ # Save to temporary file
59
+ with tempfile.NamedTemporaryFile(suffix='.pdf', delete=False) as tmp_file:
60
+ tmp_file.write(pdf_data)
61
+ pdf_path = tmp_file.name
62
+
63
+ logger.info(f"Processing PDF: {filename} ({len(pdf_data)} bytes)")
64
+
65
+ # Convert PDF to Markdown using MinerU
66
+ try:
67
+ output_dir = CONVERTER.convert_single_pdf(pdf_path)
68
+
69
+ # Find the markdown file in output
70
+ md_files = list(Path(output_dir).glob("**/*.md"))
71
+ if md_files:
72
+ with open(md_files[0], 'r', encoding='utf-8') as f:
73
+ markdown_content = f.read()
74
+ else:
75
+ # Fallback to text files
76
+ txt_files = list(Path(output_dir).glob("**/txt/*.txt"))
77
+ if txt_files:
78
+ with open(txt_files[0], 'r', encoding='utf-8') as f:
79
+ markdown_content = f.read()
80
+ else:
81
+ markdown_content = "# Conversion completed but no markdown found"
82
+
83
+ # Clean up
84
+ os.unlink(pdf_path)
85
+
86
+ return {
87
+ "markdown": markdown_content,
88
+ "filename": filename,
89
+ "status": "success",
90
+ "pages": len(markdown_content.split('\n---\n')) # Rough page count
91
+ }
92
+
93
+ except Exception as conv_error:
94
+ logger.error(f"Conversion error: {str(conv_error)}")
95
+ return {
96
+ "error": f"Conversion failed: {str(conv_error)}",
97
+ "filename": filename,
98
+ "status": "failed"
99
+ }
100
+
101
+ except Exception as e:
102
+ logger.error(f"Handler error: {str(e)}")
103
+ return {
104
+ "error": str(e),
105
+ "status": "failed"
106
+ }
107
+
108
+ # RunPod serverless entrypoint
109
+ runpod.serverless.start({"handler": handler})
runpod_handler_simple.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import runpod
2
+ import base64
3
+ import fitz # PyMuPDF
4
+
5
+ def handler(job):
6
+ """Simple PDF to text handler for testing"""
7
+ try:
8
+ job_input = job["input"]
9
+
10
+ # Get PDF data from base64
11
+ pdf_base64 = job_input.get("pdf_base64")
12
+ filename = job_input.get("filename", "document.pdf")
13
+
14
+ if not pdf_base64:
15
+ return {"error": "No PDF data provided", "status": "failed"}
16
+
17
+ # Decode base64 PDF
18
+ pdf_data = base64.b64decode(pdf_base64)
19
+
20
+ # Extract text using PyMuPDF
21
+ doc = fitz.open(stream=pdf_data, filetype="pdf")
22
+ text_content = ""
23
+
24
+ for page_num, page in enumerate(doc):
25
+ text_content += f"\n\n--- Page {page_num + 1} ---\n\n"
26
+ text_content += page.get_text()
27
+
28
+ doc.close()
29
+
30
+ # Convert to simple markdown
31
+ markdown_content = f"# {filename}\n\n"
32
+ markdown_content += f"*Extracted using PyMuPDF (simplified version)*\n\n"
33
+ markdown_content += text_content
34
+
35
+ return {
36
+ "markdown": markdown_content,
37
+ "filename": filename,
38
+ "status": "success",
39
+ "pages": len(doc)
40
+ }
41
+
42
+ except Exception as e:
43
+ return {
44
+ "error": str(e),
45
+ "status": "failed"
46
+ }
47
+
48
+ # RunPod serverless entrypoint
49
+ runpod.serverless.start({"handler": handler})
test_runpod.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for RunPod serverless endpoint
4
+ """
5
+ import base64
6
+ import requests
7
+ import json
8
+ import sys
9
+
10
+ def test_runpod_api(pdf_path, runpod_endpoint, runpod_api_key):
11
+ """Test PDF conversion on RunPod"""
12
+
13
+ # Read PDF and encode to base64
14
+ with open(pdf_path, 'rb') as f:
15
+ pdf_data = f.read()
16
+
17
+ pdf_base64 = base64.b64encode(pdf_data).decode('utf-8')
18
+
19
+ # Prepare request
20
+ headers = {
21
+ 'Authorization': f'Bearer {runpod_api_key}',
22
+ 'Content-Type': 'application/json'
23
+ }
24
+
25
+ payload = {
26
+ 'input': {
27
+ 'pdf_base64': pdf_base64,
28
+ 'filename': pdf_path.split('/')[-1]
29
+ }
30
+ }
31
+
32
+ print(f"Sending PDF to RunPod: {pdf_path}")
33
+ print(f"File size: {len(pdf_data)} bytes")
34
+
35
+ # Send request
36
+ response = requests.post(
37
+ f"{runpod_endpoint}/run",
38
+ headers=headers,
39
+ json=payload
40
+ )
41
+
42
+ if response.status_code == 200:
43
+ result = response.json()
44
+ job_id = result.get('id')
45
+ print(f"Job submitted: {job_id}")
46
+
47
+ # Poll for result
48
+ status_url = f"{runpod_endpoint}/status/{job_id}"
49
+
50
+ while True:
51
+ status_response = requests.get(status_url, headers=headers)
52
+ if status_response.status_code == 200:
53
+ status_data = status_response.json()
54
+
55
+ if status_data['status'] == 'COMPLETED':
56
+ output = status_data.get('output', {})
57
+
58
+ if output.get('status') == 'success':
59
+ markdown = output.get('markdown', '')
60
+ print(f"\nConversion successful!")
61
+ print(f"Markdown length: {len(markdown)} characters")
62
+ print(f"Pages: {output.get('pages', 'unknown')}")
63
+
64
+ # Save result
65
+ output_file = pdf_path.replace('.pdf', '_runpod.md')
66
+ with open(output_file, 'w') as f:
67
+ f.write(markdown)
68
+ print(f"Saved to: {output_file}")
69
+ else:
70
+ print(f"Conversion failed: {output.get('error')}")
71
+ break
72
+
73
+ elif status_data['status'] == 'FAILED':
74
+ print(f"Job failed: {status_data.get('error')}")
75
+ break
76
+
77
+ else:
78
+ print(f"Status: {status_data['status']}")
79
+ import time
80
+ time.sleep(2)
81
+ else:
82
+ print(f"Status check failed: {status_response.status_code}")
83
+ break
84
+ else:
85
+ print(f"Request failed: {response.status_code}")
86
+ print(response.text)
87
+
88
+ if __name__ == "__main__":
89
+ # Example usage
90
+ if len(sys.argv) < 3:
91
+ print("Usage: python test_runpod.py <pdf_file> <runpod_endpoint> <api_key>")
92
+ print("Example: python test_runpod.py test.pdf https://api.runpod.ai/v2/your-endpoint your-api-key")
93
+ sys.exit(1)
94
+
95
+ pdf_file = sys.argv[1]
96
+ endpoint = sys.argv[2]
97
+ api_key = sys.argv[3] if len(sys.argv) > 3 else input("RunPod API Key: ")
98
+
99
+ test_runpod_api(pdf_file, endpoint, api_key)