File size: 6,349 Bytes
ed4a490 aff294b ed4a490 7b9a627 92af756 7b9a627 39b83a4 7b9a627 4a9ae25 a6dd524 7b9a627 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
---
license: apache-2.0
base_model:
- mistralai/Devstral-Small-2507
---
# Devstral-Vision-Small-2507
Created by [Eric Hartford](https://erichartford.com/) at [Quixi AI](https://erichartford.com/)
## Model Description
Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of [Devstral-Small-2507](https://huggingface.co/mistralai/Devstral-Small-2507) with the vision understanding of [Mistral-Small-3.2-24B-Instruct-2506](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506).
This model enables vision-augmented software engineering tasks, allowing developers to:
- Analyze screenshots and UI mockups to generate code
- Debug visual rendering issues with actual screenshots
- Convert designs and wireframes directly into implementation
- Understand and modify codebases with visual context
### Model Details
- **Base Architecture**: Mistral Small 3.2 with vision encoder
- **Parameters**: 24B (language model) + vision components
- **Context Window**: 128k tokens
- **License**: Apache 2.0
- **Language Model**: Fine-tuned Devstral weights for superior coding performance
- **Vision Model**: Mistral-Small vision encoder and multimodal projector
## How It Was Created
This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components:
1. Started with Mistral-Small-3.2-24B-Instruct-2506 (complete multimodal model)
2. Replaced only the core language model weights with Devstral-Small-2507's fine-tuned weights
3. Preserved Mistral's vision encoder, multimodal projector, vision-language adapter, and token embeddings
4. Kept Mistral's tokenizer to maintain proper image token handling
The result is a model that combines Devstral's state-of-the-art coding capabilities with Mistral's vision understanding.
Here is the [script](make_devstral_vision.py)
## Intended Use
### Primary Use Cases
- **Visual Software Engineering**: Analyze UI screenshots, mockups, and designs to generate implementation code
- **Code Review with Visual Context**: Review code changes alongside their visual output
- **Debugging Visual Issues**: Debug rendering problems by analyzing screenshots
- **Design-to-Code**: Convert visual designs directly into code
- **Documentation with Visual Examples**: Generate documentation that references visual elements
### Example Applications
- Building UI components from screenshots
- Debugging CSS/styling issues with visual feedback
- Converting Figma/design mockups to code
- Analyzing and reproducing visual bugs
- Creating visual test cases
## Usage
### With OpenHands
The model is optimized for use with [OpenHands](https://github.com/All-Hands-AI/OpenHands) for agentic coding tasks:
```bash
# Using vLLM
vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--tensor-parallel-size 2
# Configure OpenHands to use the model
# Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507
# Set Base URL: http://localhost:8000/v1
```
### With Transformers
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
model_id = "cognitivecomputations/Devstral-Vision-Small-2507"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# Load an image
image = Image.open("screenshot.png")
# Create a prompt
prompt = "Analyze this UI screenshot and generate React code to reproduce it."
# Process inputs
inputs = processor(
text=prompt,
images=image,
return_tensors="pt"
).to(model.device)
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=2000,
temperature=0.7
)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```


## Performance Expectations
### Coding Performance
Inherits Devstral's exceptional performance on coding tasks:
- 53.6% on SWE-Bench Verified (when used with OpenHands)
- Superior performance on multi-file editing and codebase exploration
- Excellent tool use and agentic behavior
### Vision Performance
Maintains Mistral-Small's vision capabilities:
- Strong understanding of UI elements and layouts
- Accurate interpretation of charts, diagrams, and visual documentation
- Reliable screenshot analysis for debugging
## Hardware Requirements
- **GPU Memory**: ~48GB for full precision, ~24GB with 4-bit quantization
- **Recommended**: 2x RTX 4090 or better for optimal performance
- **Minimum**: Single GPU with 24GB VRAM using quantization
## Limitations
- Vision capabilities are limited to what Mistral-Small-3.2 supports
- Not specifically fine-tuned on vision-to-code tasks (uses Devstral's text-only fine-tuning)
- Large model size may be prohibitive for some deployment scenarios
- Best performance achieved when used with appropriate scaffolding (OpenHands, Cline, etc.)
## Ethical Considerations
This model inherits both the capabilities and limitations of its parent models. Users should:
- Review generated code for security vulnerabilities
- Verify visual interpretations are accurate
- Be aware of potential biases in code generation
- Use appropriate safety measures in production deployments
## Citation
If you use this model, please cite:
```bibtex
@misc{devstral-vision-2507,
author = {Hartford, Eric},
title = {Devstral-Vision-Small-2507},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507}
}
```
## Acknowledgments
This model builds upon the excellent work by:
- [Mistral AI](https://mistral.ai/) for both Mistral-Small and Devstral
- [All Hands AI](https://www.all-hands.dev/) for their collaboration on Devstral
- The open-source community for testing and feedback
## License
Apache 2.0 - Same as the base models
---
*Created with dolphin passion 🐬 by Cognitive Computations* |