mistral.rs v0.5.0
We are excited to announce mistral.rs v0.5.0 with many new and exciting features making LLM inference easy and fast!
Thank you to all contributors for this release. This release includes the following highlights but also countless improvements, fixes, and optimizations:
- Support for many more models:
- Gemma 3
- Qwen 2.5 VL
- Mistral Small 3.1
- Phi 4 Multimodal (image only)
- Native tool calling support for:
- Llama 3.1/3.2/3.3
- Mistral Small 3
- Mistral Nemo
- Hermes 2 Pro
- Hermes 3
- Tensor Parallelism support (NCCL)
- FlashAttention V3 support and integration in PagedAttention
- 30x reduction in ISQ times on Metal
- Revamped prefix cacher system
This release broadens support for models and enables all users from low-end to high-end to work within the same inference platform. Users are empowered to build their app locally and then deploy it to the cluster!
We have also implemented many optimizations for Metal devices! The results can be found below.
Metal comparsion versus llama.cpp, MLX
Comparing T/s versus llama.cpp and MLX v0.24.0 shows that mistral.rs v0.5.0 has very similar performance on Metal. You can reproduce these results here.
These tests were conducted on an M3 Max machine.
Llama 3.2 3b, 8bit
Platform | Prompt T/s | Decode T/s |
---|---|---|
mistral.rs | 1116.60 | 71.44 |
llama.cpp | 1532.91 | 76.87 |
mlx | 1422.471 | 94.61 |
Llama 3.1 8b, 8bit
Platform | Prompt T/s | Decode T/s |
---|---|---|
mistral.rs | 606.36 | 37.94 |
llama.cpp | 736.68 | 39.20 |
mlx | 670.71 | 44.216 |