Magistral-Small-2506-Vision

Inspired by https://huggingface.co/ngxson/Devstral-Small-Vision-2505-GGUF, which is a Devstral vision experiment, this is an experimental checkpoint of Magistral-Small-2506 with vision.

Magistral Small is a GRPO-trained reasoning fine-tune of Mistral Small 3.1, which is a vision-capable LLM.

In its technical report, Mistral states that Magistral was fine-tuned on text-only data, but the authors report results on MMMU, MMMU-Pro and MathVista benchmarks, which show modest improvements despite text-only training. This suggests that Magistral successfully generalized its reasoning capabilities to multimodal data.

Mistral removed Magistral's vision encoder in their official release. This may be because of the performance gap between text-only and multimodal inputs.

In this model, I grafted Mistral Small 3.1's vision encoder on to Magistral Small. No further training was done, which should mean that text-only performance of this model should be the same as Mistral's official release.

The model was tested with vLLM and should work with any toolkit supporting Mistral Small 3.1. The Transformers implementation of Mistral 3 does not work well.

Make sure to use the system prompt provided in the SYSTEM_PROMPT.txt file (from Mistral's docs) and the sampling params temp=0.7, top_p=0.95.

The code used for creating this model can be found here: https://colab.research.google.com/drive/1UuMo4VSgVoD4GfLrFgHUJvCv0cdALR7m?usp=sharing. It requires ~150 GB of RAM (VRAM is not needed for this) since it loads three 24B models in BF16. 4-bit bits and bytes quantization could be used to reduce the memory requirements to 1/4.

There still may be configuration errors in this model which might reduce performance. Let me know if you encounter any weird behavior!

OptimusePrime
/

Magistral-Small-2506-Vision

Magistral-Small-2506-Vision

Model tree for OptimusePrime/Magistral-Small-2506-Vision