Papers
arxiv:2504.08120

DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

Published on Apr 10
· Submitted by Rexhaif on Apr 15
Authors:
,
,
,
,
,
,
,

Abstract

Reasoning-enabled large language models (LLMs) have recently demonstrated impressive performance in complex logical and mathematical tasks, yet their effectiveness in evaluating natural language generation remains unexplored. This study systematically compares reasoning-based LLMs (DeepSeek-R1 and OpenAI o3) with their non-reasoning counterparts across machine translation (MT) and text summarization (TS) evaluation tasks. We evaluate eight models across three architectural categories, including state-of-the-art reasoning models, their distilled variants (ranging from 8B to 70B parameters), and equivalent conventional, non-reasoning LLMs. Our experiments on WMT23 and SummEval benchmarks reveal that the benefits of reasoning capabilities are highly model and task-dependent: while OpenAI o3-mini models show consistent performance improvements with increased reasoning intensity, DeepSeek-R1 underperforms compared to its non-reasoning variant, with exception to certain aspects of TS evaluation. Correlation analysis demonstrates that increased reasoning token usage positively correlates with evaluation quality in o3-mini models. Furthermore, our results show that distillation of reasoning capabilities maintains reasonable performance in medium-sized models (32B) but degrades substantially in smaller variants (8B). This work provides the first comprehensive assessment of reasoning LLMs for NLG evaluation and offers insights into their practical use.

Community

Our paper started with a simple question: would LLMs that are specifically trained to reason actually be better at evaluating translations and summaries? The results are quite interesting! While OpenAI's o3-mini models improved with more reasoning, DeepSeek-R1 actually performed worse than its non-reasoning version in most tests. It's like having a math genius who struggles with essay grading! Also, we found that reasoning abilities don't compress well in deepseek - distillation maintained quality in 32B parameter models but fell apart in 8B models. The best part of this research? Our team retreat in the snowy Tyrolean Alps where we started the paper - the mountain air definitely helped clear our thinking!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.08120 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.08120 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.08120 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.