VARCO-VISION-2.0-1.7B-OCR

Introduction
VARCO-VISION-2.0-1.7B-OCR is a lightweight yet powerful OCR-specialized model derived from VARCO-VISION-2.0-1.7B, designed to deliver efficient and accurate text recognition in real-world scenarios. Unlike conventional vision-language models (VLMs) that primarily focus on transcribing visible text, this model performs both recognition and spatial localization by detecting bounding boxes around each character, enabling structured, layout-aware OCR outputs.
The model supports both Korean and English, making it well-suited for multilingual environments where mixed-script documents are common. Each recognized character is paired with its precise position in the image, formatted as <char>{characters}</char><bbox>{x1}, {y1}, {x2}, {y2}</bbox>
, where the coordinates correspond to the top-left (x1
, y1
) and bottom-right (x2
, y2
) corners of the character's bounding box.
While VARCO-VISION-2.0-14B demonstrates strong OCR capabilities as part of its broader multimodal reasoning skills, deploying such a large model for single-task use cases can be computationally inefficient. VARCO-VISION-2.0-1.7B-OCR addresses this with a task-optimized design that retains high accuracy while significantly reducing resource requirements, making it ideal for real-time or resource-constrained applications.
🚨News🎙️
- 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at link
- 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at link
- 📰 2025-07-18: We updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
- 📰 2025-07-16: We released VARCO-VISION-2.0-14B at link
- 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at link
VARCO-VISION-2.0 Family
Model Name | Base Models (Vision / Language) | HF Link |
---|---|---|
VARCO-VISION-2.0-14B | siglip2-so400m-patch16-384 / Qwen3-14B | link |
VARCO-VISION-2.0-1.7B | siglip2-so400m-patch16-384 / Qwen3-1.7B | link |
VARCO-VISION-2.0-1.7B-OCR | siglip2-so400m-patch16-384 / Qwen3-1.7B | link |
GME-VARCO-VISION-Embedding | Qwen2-VL-7B-Instruct | link |
Model Architecture
VARCO-VISION-2.0 follows the architecture of LLaVA-OneVision.
Evaluation
OCR Benchmark
Benchmark | CLOVA OCR | PaddleOCR | EasyOCR | VARCO-VISION-2.0-1.7B-OCR |
---|---|---|---|---|
CORD | 93.9 | 91.4 | 77.8 | 95.6 |
ICDAR2013 | 94.4 | 92.0 | 85.0 | 95.5 |
ICDAR2015 | 84.1 | 73.7 | 57.9 | 75.4 |
Usage
To use this model, we recommend installing transformers
version 4.53.1 or higher.
Additionally, for best results, we recommend upscaling input images to a minimum resolution of 2,304 on the longer side if they are smaller.
import torch
from PIL import Image
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
model_name = "NCSOFT/VARCO-VISION-2.0-1.7B-OCR"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
attn_implementation="sdpa",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)
image = Image.open("file:///path/to/image.jpg")
# Image upscaling for OCR performance boost
w, h = image.size
target_size = 2304
if max(w, h) < target_size:
scaling_factor = target_size / max(w, h)
new_w = int(w * scaling_factor)
new_h = int(h * scaling_factor)
image = image.resize((new_w, new_h))
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": ""},
],
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, torch.float16)
generate_ids = model.generate(**inputs, max_new_tokens=1024)
generate_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
]
output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=False)
print(output)
- Downloads last month
- 1,193