|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- mistralai/Devstral-Small-2507 |
|
--- |
|
|
|
# Devstral-Vision-Small-2507 |
|
|
|
Created by [Eric Hartford](https://erichartford.com/) at [Quixi AI](https://erichartford.com/) |
|
|
|
## Model Description |
|
|
|
Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of [Devstral-Small-2507](https://huggingface.co/mistralai/Devstral-Small-2507) with the vision understanding of [Mistral-Small-3.2-24B-Instruct-2506](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506). |
|
|
|
This model enables vision-augmented software engineering tasks, allowing developers to: |
|
- Analyze screenshots and UI mockups to generate code |
|
- Debug visual rendering issues with actual screenshots |
|
- Convert designs and wireframes directly into implementation |
|
- Understand and modify codebases with visual context |
|
|
|
### Model Details |
|
|
|
- **Base Architecture**: Mistral Small 3.2 with vision encoder |
|
- **Parameters**: 24B (language model) + vision components |
|
- **Context Window**: 128k tokens |
|
- **License**: Apache 2.0 |
|
- **Language Model**: Fine-tuned Devstral weights for superior coding performance |
|
- **Vision Model**: Mistral-Small vision encoder and multimodal projector |
|
|
|
## How It Was Created |
|
|
|
This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components: |
|
|
|
1. Started with Mistral-Small-3.2-24B-Instruct-2506 (complete multimodal model) |
|
2. Replaced only the core language model weights with Devstral-Small-2507's fine-tuned weights |
|
3. Preserved Mistral's vision encoder, multimodal projector, vision-language adapter, and token embeddings |
|
4. Kept Mistral's tokenizer to maintain proper image token handling |
|
|
|
The result is a model that combines Devstral's state-of-the-art coding capabilities with Mistral's vision understanding. |
|
|
|
Here is the [script](make_devstral_vision.py) |
|
|
|
## Intended Use |
|
|
|
### Primary Use Cases |
|
- **Visual Software Engineering**: Analyze UI screenshots, mockups, and designs to generate implementation code |
|
- **Code Review with Visual Context**: Review code changes alongside their visual output |
|
- **Debugging Visual Issues**: Debug rendering problems by analyzing screenshots |
|
- **Design-to-Code**: Convert visual designs directly into code |
|
- **Documentation with Visual Examples**: Generate documentation that references visual elements |
|
|
|
### Example Applications |
|
- Building UI components from screenshots |
|
- Debugging CSS/styling issues with visual feedback |
|
- Converting Figma/design mockups to code |
|
- Analyzing and reproducing visual bugs |
|
- Creating visual test cases |
|
|
|
## Usage |
|
|
|
### With OpenHands |
|
|
|
The model is optimized for use with [OpenHands](https://github.com/All-Hands-AI/OpenHands) for agentic coding tasks: |
|
|
|
```bash |
|
# Using vLLM |
|
vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \ |
|
--tokenizer_mode mistral \ |
|
--config_format mistral \ |
|
--load_format mistral \ |
|
--tensor-parallel-size 2 |
|
|
|
# Configure OpenHands to use the model |
|
# Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507 |
|
# Set Base URL: http://localhost:8000/v1 |
|
``` |
|
|
|
### With Transformers |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
from PIL import Image |
|
|
|
model_id = "cognitivecomputations/Devstral-Vision-Small-2507" |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
torch_dtype=torch.bfloat16, |
|
device_map="auto" |
|
) |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
# Load an image |
|
image = Image.open("screenshot.png") |
|
|
|
# Create a prompt |
|
prompt = "Analyze this UI screenshot and generate React code to reproduce it." |
|
|
|
# Process inputs |
|
inputs = processor( |
|
text=prompt, |
|
images=image, |
|
return_tensors="pt" |
|
).to(model.device) |
|
|
|
# Generate |
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=2000, |
|
temperature=0.7 |
|
) |
|
|
|
response = processor.decode(outputs[0], skip_special_tokens=True) |
|
print(response) |
|
``` |
|
|
|
 |
|
|
|
 |
|
|
|
|
|
## Performance Expectations |
|
|
|
### Coding Performance |
|
Inherits Devstral's exceptional performance on coding tasks: |
|
- 53.6% on SWE-Bench Verified (when used with OpenHands) |
|
- Superior performance on multi-file editing and codebase exploration |
|
- Excellent tool use and agentic behavior |
|
|
|
### Vision Performance |
|
Maintains Mistral-Small's vision capabilities: |
|
- Strong understanding of UI elements and layouts |
|
- Accurate interpretation of charts, diagrams, and visual documentation |
|
- Reliable screenshot analysis for debugging |
|
|
|
## Hardware Requirements |
|
|
|
- **GPU Memory**: ~48GB for full precision, ~24GB with 4-bit quantization |
|
- **Recommended**: 2x RTX 4090 or better for optimal performance |
|
- **Minimum**: Single GPU with 24GB VRAM using quantization |
|
|
|
## Limitations |
|
|
|
- Vision capabilities are limited to what Mistral-Small-3.2 supports |
|
- Not specifically fine-tuned on vision-to-code tasks (uses Devstral's text-only fine-tuning) |
|
- Large model size may be prohibitive for some deployment scenarios |
|
- Best performance achieved when used with appropriate scaffolding (OpenHands, Cline, etc.) |
|
|
|
## Ethical Considerations |
|
|
|
This model inherits both the capabilities and limitations of its parent models. Users should: |
|
- Review generated code for security vulnerabilities |
|
- Verify visual interpretations are accurate |
|
- Be aware of potential biases in code generation |
|
- Use appropriate safety measures in production deployments |
|
|
|
## Citation |
|
|
|
If you use this model, please cite: |
|
|
|
```bibtex |
|
@misc{devstral-vision-2507, |
|
author = {Hartford, Eric}, |
|
title = {Devstral-Vision-Small-2507}, |
|
year = {2025}, |
|
publisher = {HuggingFace}, |
|
url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507} |
|
} |
|
``` |
|
|
|
## Acknowledgments |
|
|
|
This model builds upon the excellent work by: |
|
- [Mistral AI](https://mistral.ai/) for both Mistral-Small and Devstral |
|
- [All Hands AI](https://www.all-hands.dev/) for their collaboration on Devstral |
|
- The open-source community for testing and feedback |
|
|
|
## License |
|
|
|
Apache 2.0 - Same as the base models |
|
|
|
--- |
|
|
|
*Created with dolphin passion 🐬 by Cognitive Computations* |