gp-oss-120b — Exceptional Reasoning, Not Yet AGI Scale
I'm excited to share that gp-oss-120b is now leading my French LLM reasoning leaderboard (https://huggingface.co/spaces/Deepmama/LLM-FR_Leaderboard).
This model delivers outstanding performance on benchmarks designed to evaluate reasoning, logical inference, and critical thinking in French. On these tasks, it clearly outperforms larger or more well-known models (Qwen3, Deepseek-R1, ...).
That said, it performs less well on semantic puzzle datasets, which are more sensitive to model size and memorization than deep reasoning.
👉 Conclusion: superb reasoning abilities, but we're still below AGI-scale capabilities — especially for tasks that demand broad semantic compression or massive world knowledge... (and not good for Instruction responding ! IFEval translated in french is disappointing)