VARCO-VISION-2.0-1.7B-OCR

Introduction

VARCO-VISION-2.0-1.7B-OCR is a lightweight yet powerful OCR-specialized model derived from VARCO-VISION-2.0-1.7B, designed to deliver efficient and accurate text recognition in real-world scenarios. Unlike conventional vision-language models (VLMs) that primarily focus on transcribing visible text, this model performs both recognition and spatial localization by detecting bounding boxes around each character, enabling structured, layout-aware OCR outputs.

The model supports both Korean and English, making it well-suited for multilingual environments where mixed-script documents are common. Each recognized character is paired with its precise position in the image, formatted as <char>{characters}</char><bbox>{x1}, {y1}, {x2}, {y2}</bbox>, where the coordinates correspond to the top-left (x1, y1) and bottom-right (x2, y2) corners of the character's bounding box.

While VARCO-VISION-2.0-14B demonstrates strong OCR capabilities as part of its broader multimodal reasoning skills, deploying such a large model for single-task use cases can be computationally inefficient. VARCO-VISION-2.0-1.7B-OCR addresses this with a task-optimized design that retains high accuracy while significantly reducing resource requirements, making it ideal for real-time or resource-constrained applications.

🚨News🎙️

📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at link
📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at link
📰 2025-07-18: We updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
📰 2025-07-16: We released VARCO-VISION-2.0-14B at link
📰 2025-07-16: We released GME-VARCO-VISION-Embedding at link

VARCO-VISION-2.0 Family

Model Name	Base Models (Vision / Language)	HF Link
VARCO-VISION-2.0-14B	siglip2-so400m-patch16-384 / Qwen3-14B	link
VARCO-VISION-2.0-1.7B	siglip2-so400m-patch16-384 / Qwen3-1.7B	link
VARCO-VISION-2.0-1.7B-OCR	siglip2-so400m-patch16-384 / Qwen3-1.7B	link
GME-VARCO-VISION-Embedding	Qwen2-VL-7B-Instruct	link

Model Architecture

VARCO-VISION-2.0 follows the architecture of LLaVA-OneVision.

Evaluation

OCR Benchmark

Benchmark	CLOVA OCR	PaddleOCR	EasyOCR	VARCO-VISION-2.0-1.7B-OCR
CORD	93.9	91.4	77.8	95.6
ICDAR2013	94.4	92.0	85.0	95.5
ICDAR2015	84.1	73.7	57.9	75.4

Usage

To use this model, we recommend installing transformers version 4.53.1 or higher. Additionally, for best results, we recommend upscaling input images to a minimum resolution of 2,304 on the longer side if they are smaller.

import torch
from PIL import Image
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

model_name = "NCSOFT/VARCO-VISION-2.0-1.7B-OCR"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    attn_implementation="sdpa",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)

image = Image.open("file:///path/to/image.jpg")

# Image upscaling for OCR performance boost
w, h = image.size
target_size = 2304
if max(w, h) < target_size:
    scaling_factor = target_size / max(w, h)
    new_w = int(w * scaling_factor)
    new_h = int(h * scaling_factor)
    image = image.resize((new_w, new_h))

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": ""},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, torch.float16)

generate_ids = model.generate(**inputs, max_new_tokens=1024)
generate_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
]
output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=False)
print(output)

NCSOFT
/

VARCO-VISION-2.0-1.7B-OCR