From developing LLM applications over the past couple years, I've realized that regardless of what the hype is all about - nothing beats testing LLMS on your own specific use cases using your own evaluation metrics. For example, I did a comparison of O3-mini vs R1 vs Gemini Flash thinking https://www.youtube.com/watch?v=iBS_FsLcSN0 and realized for certain use cases, they are no better than regular non reasoning models. I am very curious to learn what people are using reasoning models for and how they are evaluating them!
Shon Fernandez
flexicious
AI & ML interests
None yet
Recent Activity
commented on
an
article
5 days ago
Let's talk about LLM evaluation
Organizations
None yet