File size: 6,349 Bytes
ed4a490
 
aff294b
 
ed4a490
7b9a627
 
 
92af756
7b9a627
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39b83a4
 
7b9a627
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4a9ae25
 
a6dd524
 
 
7b9a627
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
license: apache-2.0
base_model:
- mistralai/Devstral-Small-2507
---

# Devstral-Vision-Small-2507

Created by [Eric Hartford](https://erichartford.com/) at [Quixi AI](https://erichartford.com/)

## Model Description

Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of [Devstral-Small-2507](https://huggingface.co/mistralai/Devstral-Small-2507) with the vision understanding of [Mistral-Small-3.2-24B-Instruct-2506](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506).

This model enables vision-augmented software engineering tasks, allowing developers to:
- Analyze screenshots and UI mockups to generate code
- Debug visual rendering issues with actual screenshots
- Convert designs and wireframes directly into implementation
- Understand and modify codebases with visual context

### Model Details

- **Base Architecture**: Mistral Small 3.2 with vision encoder
- **Parameters**: 24B (language model) + vision components
- **Context Window**: 128k tokens
- **License**: Apache 2.0
- **Language Model**: Fine-tuned Devstral weights for superior coding performance
- **Vision Model**: Mistral-Small vision encoder and multimodal projector

## How It Was Created

This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components:

1. Started with Mistral-Small-3.2-24B-Instruct-2506 (complete multimodal model)
2. Replaced only the core language model weights with Devstral-Small-2507's fine-tuned weights
3. Preserved Mistral's vision encoder, multimodal projector, vision-language adapter, and token embeddings
4. Kept Mistral's tokenizer to maintain proper image token handling

The result is a model that combines Devstral's state-of-the-art coding capabilities with Mistral's vision understanding.

Here is the [script](make_devstral_vision.py)

## Intended Use

### Primary Use Cases
- **Visual Software Engineering**: Analyze UI screenshots, mockups, and designs to generate implementation code
- **Code Review with Visual Context**: Review code changes alongside their visual output
- **Debugging Visual Issues**: Debug rendering problems by analyzing screenshots
- **Design-to-Code**: Convert visual designs directly into code
- **Documentation with Visual Examples**: Generate documentation that references visual elements

### Example Applications
- Building UI components from screenshots
- Debugging CSS/styling issues with visual feedback
- Converting Figma/design mockups to code
- Analyzing and reproducing visual bugs
- Creating visual test cases

## Usage

### With OpenHands

The model is optimized for use with [OpenHands](https://github.com/All-Hands-AI/OpenHands) for agentic coding tasks:

```bash
# Using vLLM
vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \
    --tokenizer_mode mistral \
    --config_format mistral \
    --load_format mistral \
    --tensor-parallel-size 2

# Configure OpenHands to use the model
# Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507
# Set Base URL: http://localhost:8000/v1
```

### With Transformers

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

model_id = "cognitivecomputations/Devstral-Vision-Small-2507"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Load an image
image = Image.open("screenshot.png")

# Create a prompt
prompt = "Analyze this UI screenshot and generate React code to reproduce it."

# Process inputs
inputs = processor(
    text=prompt,
    images=image,
    return_tensors="pt"
).to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=2000,
    temperature=0.7
)

response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```

![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/GUij-XVX7zaoU9UjG4n19.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/wLHwLZti9Na0O-UOVh-Nh.png)


## Performance Expectations

### Coding Performance
Inherits Devstral's exceptional performance on coding tasks:
- 53.6% on SWE-Bench Verified (when used with OpenHands)
- Superior performance on multi-file editing and codebase exploration
- Excellent tool use and agentic behavior

### Vision Performance
Maintains Mistral-Small's vision capabilities:
- Strong understanding of UI elements and layouts
- Accurate interpretation of charts, diagrams, and visual documentation
- Reliable screenshot analysis for debugging

## Hardware Requirements

- **GPU Memory**: ~48GB for full precision, ~24GB with 4-bit quantization
- **Recommended**: 2x RTX 4090 or better for optimal performance
- **Minimum**: Single GPU with 24GB VRAM using quantization

## Limitations

- Vision capabilities are limited to what Mistral-Small-3.2 supports
- Not specifically fine-tuned on vision-to-code tasks (uses Devstral's text-only fine-tuning)
- Large model size may be prohibitive for some deployment scenarios
- Best performance achieved when used with appropriate scaffolding (OpenHands, Cline, etc.)

## Ethical Considerations

This model inherits both the capabilities and limitations of its parent models. Users should:
- Review generated code for security vulnerabilities
- Verify visual interpretations are accurate
- Be aware of potential biases in code generation
- Use appropriate safety measures in production deployments

## Citation

If you use this model, please cite:

```bibtex
@misc{devstral-vision-2507,
  author = {Hartford, Eric},
  title = {Devstral-Vision-Small-2507},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507}
}
```

## Acknowledgments

This model builds upon the excellent work by:
- [Mistral AI](https://mistral.ai/) for both Mistral-Small and Devstral
- [All Hands AI](https://www.all-hands.dev/) for their collaboration on Devstral
- The open-source community for testing and feedback

## License

Apache 2.0 - Same as the base models

---

*Created with dolphin passion 🐬 by Cognitive Computations*