mistral.rs v0.5.0

Community Article Published March 24, 2025

We are excited to announce mistral.rs v0.5.0 with many new and exciting features making LLM inference easy and fast!

Thank you to all contributors for this release. This release includes the following highlights but also countless improvements, fixes, and optimizations:

Support for many more models:
- Gemma 3
- Qwen 2.5 VL
- Mistral Small 3.1
- Phi 4 Multimodal (image only)
Native tool calling support for:
- Llama 3.1/3.2/3.3
- Mistral Small 3
- Mistral Nemo
- Hermes 2 Pro
- Hermes 3
Tensor Parallelism support (NCCL)
FlashAttention V3 support and integration in PagedAttention
30x reduction in ISQ times on Metal
Revamped prefix cacher system

This release broadens support for models and enables all users from low-end to high-end to work within the same inference platform. Users are empowered to build their app locally and then deploy it to the cluster!

We have also implemented many optimizations for Metal devices! The results can be found below.

Metal comparsion versus llama.cpp, MLX

Comparing T/s versus llama.cpp and MLX v0.24.0 shows that mistral.rs v0.5.0 has very similar performance on Metal. You can reproduce these results here.

These tests were conducted on an M3 Max machine.

Llama 3.2 3b, 8bit

Platform	Prompt T/s	Decode T/s
mistral.rs	1116.60	71.44
llama.cpp	1532.91	76.87
mlx	1422.471	94.61

Llama 3.1 8b, 8bit

Platform	Prompt T/s	Decode T/s
mistral.rs	606.36	37.94
llama.cpp	736.68	39.20
mlx	670.71	44.216

Community

blanchon

Mar 24

Always happy when you guy ship !

dj3m

Mar 24

Nice work!

I wanna ask why do you think it is slower than llama.cpp and mlx, what is the bottleneck? The metal kernels are open source on both projects (with MIT license), so I don't suppose it is the missing kernel implementations.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote