mistral.rs v0.5.0

Community Article Published March 24, 2025

We are excited to announce mistral.rs v0.5.0 with many new and exciting features making LLM inference easy and fast!

Thank you to all contributors for this release. This release includes the following highlights but also countless improvements, fixes, and optimizations:

  • Support for many more models:
    • Gemma 3
    • Qwen 2.5 VL
    • Mistral Small 3.1
    • Phi 4 Multimodal (image only)
  • Native tool calling support for:
    • Llama 3.1/3.2/3.3
    • Mistral Small 3
    • Mistral Nemo
    • Hermes 2 Pro
    • Hermes 3
  • Tensor Parallelism support (NCCL)
  • FlashAttention V3 support and integration in PagedAttention
  • 30x reduction in ISQ times on Metal
  • Revamped prefix cacher system

This release broadens support for models and enables all users from low-end to high-end to work within the same inference platform. Users are empowered to build their app locally and then deploy it to the cluster!

We have also implemented many optimizations for Metal devices! The results can be found below.

Metal comparsion versus llama.cpp, MLX

Comparing T/s versus llama.cpp and MLX v0.24.0 shows that mistral.rs v0.5.0 has very similar performance on Metal. You can reproduce these results here.

These tests were conducted on an M3 Max machine.

Llama 3.2 3b, 8bit

Platform Prompt T/s Decode T/s
mistral.rs 1116.60 71.44
llama.cpp 1532.91 76.87
mlx 1422.471 94.61

Llama 3.1 8b, 8bit

Platform Prompt T/s Decode T/s
mistral.rs 606.36 37.94
llama.cpp 736.68 39.20
mlx 670.71 44.216

Community

Always happy when you guy ship !

Nice work!

I wanna ask why do you think it is slower than llama.cpp and mlx, what is the bottleneck? The metal kernels are open source on both projects (with MIT license), so I don't suppose it is the missing kernel implementations.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment