UCSC-VLAA
/

MedVLThinker-7B-SFT_PMC

Image-Text-to-Text

vision-language

Model card Files Files and versions

MedVLThinker-7B-SFT_PMC / README.md

xk-huang's picture

Upload MedVLThinker-7B-SFT_PMC model weights

4bfe912 verified 27 days ago

|

2.46 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-VL-7B-Instruct
	tags:
	- vision-language
	- medical
	- multimodal
	- qwen2.5-vl
	datasets:
	- UCSC-VLAA/MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized
	- UCSC-VLAA/MedVLThinker-m23k-tokenized
	- UCSC-VLAA/MedVLThinker-pmc_vqa
	- UCSC-VLAA/MedVLThinker-Eval
	language:
	- en
	pipeline_tag: image-text-to-text
	---

	# MedVLThinker-7B-SFT_PMC

	Code: https://github.com/UCSC-VLAA/MedVLThinker

	## Model Description

	MedVLThinker-7B-SFT_PMC is a 7B parameter medical vision-language model based on Qwen2.5-VL.
	This model has been trained using supervised fine-tuning on PMC-VQA dataset.

	## Model Details

	- Base Model: Qwen/Qwen2.5-VL-7B-Instruct
	- Model Size: 7B parameters
	- Training Method: Supervised Fine-tuning
	- Training Data: PMC-VQA dataset

	## Usage

	```python
	from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info
	import torch

	# Load the model
	model = Qwen2VLForConditionalGeneration.from_pretrained(
	"UCSC-VLAA/MedVLThinker-7B-SFT_PMC",
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)
	processor = AutoProcessor.from_pretrained("UCSC-VLAA/MedVLThinker-7B-SFT_PMC")

	# Example usage
	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "path/to/medical/image.jpg",
	},
	{"type": "text", "text": "What can you see in this medical image?"},
	],
	}
	]

	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	# Inference
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	## Citation

	```bibtex
	@article{medvlthinker2025,
	title={MedVLThinker: Simple Baselines for Multimodal Medical Reasoning},
	author={Your Team},
	journal={arXiv preprint},
	year={2025}
	}
	```

	## License

	This model is released under the Apache 2.0 license.