Update README.md

92af756 verified about 1 month ago

6.35 kB

	---
	license: apache-2.0
	base_model:
	- mistralai/Devstral-Small-2507
	---

	# Devstral-Vision-Small-2507

	Created by [Eric Hartford](https://erichartford.com/) at [Quixi AI](https://erichartford.com/)

	## Model Description

	Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of [Devstral-Small-2507](https://huggingface.co/mistralai/Devstral-Small-2507) with the vision understanding of [Mistral-Small-3.2-24B-Instruct-2506](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506).

	This model enables vision-augmented software engineering tasks, allowing developers to:
	- Analyze screenshots and UI mockups to generate code
	- Debug visual rendering issues with actual screenshots
	- Convert designs and wireframes directly into implementation
	- Understand and modify codebases with visual context

	### Model Details

	- Base Architecture: Mistral Small 3.2 with vision encoder
	- Parameters: 24B (language model) + vision components
	- Context Window: 128k tokens
	- License: Apache 2.0
	- Language Model: Fine-tuned Devstral weights for superior coding performance
	- Vision Model: Mistral-Small vision encoder and multimodal projector

	## How It Was Created

	This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components:

	1. Started with Mistral-Small-3.2-24B-Instruct-2506 (complete multimodal model)
	2. Replaced only the core language model weights with Devstral-Small-2507's fine-tuned weights
	3. Preserved Mistral's vision encoder, multimodal projector, vision-language adapter, and token embeddings
	4. Kept Mistral's tokenizer to maintain proper image token handling

	The result is a model that combines Devstral's state-of-the-art coding capabilities with Mistral's vision understanding.

	Here is the [script](make_devstral_vision.py)

	## Intended Use

	### Primary Use Cases
	- Visual Software Engineering: Analyze UI screenshots, mockups, and designs to generate implementation code
	- Code Review with Visual Context: Review code changes alongside their visual output
	- Debugging Visual Issues: Debug rendering problems by analyzing screenshots
	- Design-to-Code: Convert visual designs directly into code
	- Documentation with Visual Examples: Generate documentation that references visual elements

	### Example Applications
	- Building UI components from screenshots
	- Debugging CSS/styling issues with visual feedback
	- Converting Figma/design mockups to code
	- Analyzing and reproducing visual bugs
	- Creating visual test cases

	## Usage

	### With OpenHands

	The model is optimized for use with [OpenHands](https://github.com/All-Hands-AI/OpenHands) for agentic coding tasks:

	```bash
	# Using vLLM
	vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \
	--tokenizer_mode mistral \
	--config_format mistral \
	--load_format mistral \
	--tensor-parallel-size 2

	# Configure OpenHands to use the model
	# Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507
	# Set Base URL: http://localhost:8000/v1
	```

	### With Transformers

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor
	from PIL import Image

	model_id = "cognitivecomputations/Devstral-Vision-Small-2507"

	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)
	processor = AutoProcessor.from_pretrained(model_id)

	# Load an image
	image = Image.open("screenshot.png")

	# Create a prompt
	prompt = "Analyze this UI screenshot and generate React code to reproduce it."

	# Process inputs
	inputs = processor(
	text=prompt,
	images=image,
	return_tensors="pt"
	).to(model.device)

	# Generate
	outputs = model.generate(
	**inputs,
	max_new_tokens=2000,
	temperature=0.7
	)

	response = processor.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/GUij-XVX7zaoU9UjG4n19.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/wLHwLZti9Na0O-UOVh-Nh.png)


	## Performance Expectations

	### Coding Performance
	Inherits Devstral's exceptional performance on coding tasks:
	- 53.6% on SWE-Bench Verified (when used with OpenHands)
	- Superior performance on multi-file editing and codebase exploration
	- Excellent tool use and agentic behavior

	### Vision Performance
	Maintains Mistral-Small's vision capabilities:
	- Strong understanding of UI elements and layouts
	- Accurate interpretation of charts, diagrams, and visual documentation
	- Reliable screenshot analysis for debugging

	## Hardware Requirements

	- GPU Memory: ~48GB for full precision, ~24GB with 4-bit quantization
	- Recommended: 2x RTX 4090 or better for optimal performance
	- Minimum: Single GPU with 24GB VRAM using quantization

	## Limitations

	- Vision capabilities are limited to what Mistral-Small-3.2 supports
	- Not specifically fine-tuned on vision-to-code tasks (uses Devstral's text-only fine-tuning)
	- Large model size may be prohibitive for some deployment scenarios
	- Best performance achieved when used with appropriate scaffolding (OpenHands, Cline, etc.)

	## Ethical Considerations

	This model inherits both the capabilities and limitations of its parent models. Users should:
	- Review generated code for security vulnerabilities
	- Verify visual interpretations are accurate
	- Be aware of potential biases in code generation
	- Use appropriate safety measures in production deployments

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{devstral-vision-2507,
	author = {Hartford, Eric},
	title = {Devstral-Vision-Small-2507},
	year = {2025},
	publisher = {HuggingFace},
	url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507}
	}
	```

	## Acknowledgments

	This model builds upon the excellent work by:
	- [Mistral AI](https://mistral.ai/) for both Mistral-Small and Devstral
	- [All Hands AI](https://www.all-hands.dev/) for their collaboration on Devstral
	- The open-source community for testing and feedback

	## License

	Apache 2.0 - Same as the base models

	---

	Created with dolphin passion 🐬 by Cognitive Computations