Gemini 2.5 Pro, thinking by default! We excited launch our best Gemini model for reasoning, multimodal and coding yet! #1 on LMSYS, Humanity’s Last Exam, AIME and GPQA and more!
TL;DR: - 💻 Best Gemini coding model yet, particularly for web development (excels on LiveCodeBench). - 🧠 Default "Thinking" with up to 64k token output - 🌌 1 Million multimodal input context for text, image, video, audio, and pdf - 🛠️ Function calling, structured output, google search & code execution. - 🏆 #1 on LMArena & sota on AIME, GPQA, Humanity's Last Exam - 💡 Knowledge cut of January 2025 - 🤗 Available for free as Experimental in AI Studio, Gemini API & Gemini APP - 🚀 Rate limits - Free 2 RPM 50 req/day
Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.
Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**. (Which everybody does, but people usually don't say)
For a tech report, it makes a lot of sense to report model performance when used optimally! On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)
Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!
Because if your model knows its evals by heart, you're not testing for generalization.
🔥 Agents can do anything! @microsoft Research just announced the release of Magma 8B!
Magma is a new Visual Language Model (VLM) with 8B parameters for multi-modal agents designed to handle complex interactions across virtual and real environments; and it's MIT licensed!
Magma comes with exciting new features such as: - Introduces the Set-of-Mark and Trace-of-Mark techniques for fine-tuning - Leverages a large amount of unlabeled video data to learn the spatial-temporal grounding and planning - A strong generalization and ability to be fine-tuned for other agentic tasks - SOTA in different multi-modal benchmarks spanning across UI navigation, robotics manipulation, image / video understanding and spatial understanding and reasoning - Generates goal-driven visual plans and actions for agentic use cases
The most difficult part was getting the model running in the first place, but the next steps are simple: ✂️ Implement sentence splitting, allowing for streamed responses 🌍 Multilingual support (only phonemization left)