File size: 6,349 Bytes

---
license: apache-2.0
base_model:
- mistralai/Devstral-Small-2507
---

# Devstral-Vision-Small-2507

Created by [Eric Hartford](https://erichartford.com/) at [Quixi AI](https://erichartford.com/)

## Model Description

Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of [Devstral-Small-2507](https://huggingface.co/mistralai/Devstral-Small-2507) with the vision understanding of [Mistral-Small-3.2-24B-Instruct-2506](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506).

This model enables vision-augmented software engineering tasks, allowing developers to:
- Analyze screenshots and UI mockups to generate code
- Debug visual rendering issues with actual screenshots
- Convert designs and wireframes directly into implementation
- Understand and modify codebases with visual context

### Model Details

- **Base Architecture**: Mistral Small 3.2 with vision encoder
- **Parameters**: 24B (language model) + vision components
- **Context Window**: 128k tokens
- **License**: Apache 2.0
- **Language Model**: Fine-tuned Devstral weights for superior coding performance
- **Vision Model**: Mistral-Small vision encoder and multimodal projector

## How It Was Created

This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components:

1. Started with Mistral-Small-3.2-24B-Instruct-2506 (complete multimodal model)
2. Replaced only the core language model weights with Devstral-Small-2507's fine-tuned weights
3. Preserved Mistral's vision encoder, multimodal projector, vision-language adapter, and token embeddings
4. Kept Mistral's tokenizer to maintain proper image token handling

The result is a model that combines Devstral's state-of-the-art coding capabilities with Mistral's vision understanding.

Here is the [script](make_devstral_vision.py)

## Intended Use

### Primary Use Cases
- **Visual Software Engineering**: Analyze UI screenshots, mockups, and designs to generate implementation code
- **Code Review with Visual Context**: Review code changes alongside their visual output
- **Debugging Visual Issues**: Debug rendering problems by analyzing screenshots
- **Design-to-Code**: Convert visual designs directly into code
- **Documentation with Visual Examples**: Generate documentation that references visual elements

### Example Applications
- Building UI components from screenshots
- Debugging CSS/styling issues with visual feedback
- Converting Figma/design mockups to code
- Analyzing and reproducing visual bugs
- Creating visual test cases

## Usage

### With OpenHands

The model is optimized for use with [OpenHands](https://github.com/All-Hands-AI/OpenHands) for agentic coding tasks:

```bash
# Using vLLM
vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \
    --tokenizer_mode mistral \
    --config_format mistral \
    --load_format mistral \
    --tensor-parallel-size 2

# Configure OpenHands to use the model
# Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507
# Set Base URL: http://localhost:8000/v1
```

### With Transformers

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

model_id = "cognitivecomputations/Devstral-Vision-Small-2507"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Load an image
image = Image.open("screenshot.png")

# Create a prompt
prompt = "Analyze this UI screenshot and generate React code to reproduce it."

# Process inputs
inputs = processor(
    text=prompt,
    images=image,
    return_tensors="pt"
).to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=2000,
    temperature=0.7
)

response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```

![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/GUij-XVX7zaoU9UjG4n19.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/wLHwLZti9Na0O-UOVh-Nh.png)


## Performance Expectations

### Coding Performance
Inherits Devstral's exceptional performance on coding tasks:
- 53.6% on SWE-Bench Verified (when used with OpenHands)
- Superior performance on multi-file editing and codebase exploration
- Excellent tool use and agentic behavior

### Vision Performance
Maintains Mistral-Small's vision capabilities:
- Strong understanding of UI elements and layouts
- Accurate interpretation of charts, diagrams, and visual documentation
- Reliable screenshot analysis for debugging

## Hardware Requirements

- **GPU Memory**: ~48GB for full precision, ~24GB with 4-bit quantization
- **Recommended**: 2x RTX 4090 or better for optimal performance
- **Minimum**: Single GPU with 24GB VRAM using quantization

## Limitations

- Vision capabilities are limited to what Mistral-Small-3.2 supports
- Not specifically fine-tuned on vision-to-code tasks (uses Devstral's text-only fine-tuning)
- Large model size may be prohibitive for some deployment scenarios
- Best performance achieved when used with appropriate scaffolding (OpenHands, Cline, etc.)

## Ethical Considerations

This model inherits both the capabilities and limitations of its parent models. Users should:
- Review generated code for security vulnerabilities
- Verify visual interpretations are accurate
- Be aware of potential biases in code generation
- Use appropriate safety measures in production deployments

## Citation

If you use this model, please cite:

```bibtex
@misc{devstral-vision-2507,
  author = {Hartford, Eric},
  title = {Devstral-Vision-Small-2507},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507}
}
```

## Acknowledgments

This model builds upon the excellent work by:
- [Mistral AI](https://mistral.ai/) for both Mistral-Small and Devstral
- [All Hands AI](https://www.all-hands.dev/) for their collaboration on Devstral
- The open-source community for testing and feedback

## License

Apache 2.0 - Same as the base models

---

*Created with dolphin passion 🐬 by Cognitive Computations*