Papers
arxiv:2501.16273

Return of the Encoder: Maximizing Parameter Efficiency for SLMs

Published on Jan 27
· Submitted by melfeki11 on Jan 28
Authors:
,

Abstract

The dominance of large decoder-only language models has overshadowed encoder-decoder architectures, despite their fundamental efficiency advantages in sequence processing. For small language models (SLMs) - those with 1 billion parameters or fewer - our systematic analysis across GPU, CPU, and NPU platforms reveals that encoder-decoder architectures achieve 47% lower first-token latency and 4.7x higher throughput compared to decoder-only models on edge devices. These gains may be attributed to encoder-decoder's one-time input processing and efficient separation of understanding and generation phases. We introduce a novel knowledge distillation framework that enables encoder-decoder models to leverage capabilities from large scalable decoder-only teachers while preserving their architectural advantages, achieving up to 6 average performance points improvement across diverse tasks, with significant gains in asymmetric sequence tasks where input and output distributions can benefit from different processing approaches. When combined with modern advances like Rotary Positional Embeddings (RoPE) and Vision encoders, our systematic investigation demonstrates that encoder-decoder architectures provide a more practical path toward deploying capable language models in resource-constrained environments. Our findings challenge the prevailing trend toward decoder-only scaling, showing that architectural choices become increasingly crucial as parameter budgets decrease, particularly for on-device and edge deployments where computational efficiency is paramount.

Community

Paper author Paper submitter
edited 2 days ago

[1/8] The field's obsession with decoder-only models has led us to overlook a fundamental truth:
For models under 1B parameters, encoder-decoder architectures are demonstrably better - and this advantage might extend even to larger models.

tasks_330m.png

[2/8] The evidence is striking
While everyone chases massive decoder-only models, we challenge this trend with evidence. Small encoder-decoder models achieve:

  • 47% lower latency
  • 4.7x higher throughput
  • 2-4% better performance in various downstream tasks

And this holds across GPU, CPU, and NPU platforms.
InferenceTime.png

[3/8] "But decoder-only models are better at learning from large teachers, right?"
Wrong. Our novel distillation framework allows encoder-decoder models to learn from large scalable decoder-only teachers while maintaining their efficiency advantages.
Result: +6 performance points across tasks over decoder-only models.

[4/8] The scaling behavior reveals something fascinating:
The performance gap between architectures widens as we scale up to 1B parameters. At 330M → 1B, encoder-decoder maintains a consistent 6-7% lead over decoder-only models.

Scaling.png

[5/8] "Surely this only works for text?"
Think again. The architecture's advantages carry over to vision-language tasks:

  • +11.2% on VQAv2
  • +8.2% on TextVQA
  • +7.3% on ChartQA

All while maintaining the same efficiency benefits.

vision_model_comparison.png

[6/8] Take the latest reasoning models (OpenAI's o1/o3, DeepSeek-R1):
They prove reasoning requires efficient processing of short prompts to generate long, complex outputs.
Why are encoder-decoders perfect for this?

  • One-time prompt processing (no repeated KV cache waste)
  • Fixed memory footprint after encoding
  • Natural separation of understanding/generation

Decoder-only models waste precious compute repeatedly processing the input. For large inputs (e.g., large code processing/analysis), this is computational overhead you can't afford.

[7/8] About that encoder-decoder "bottleneck" everyone worries about:
We've proven strong performance up to 1B parameters. T5 showed impressive results at 20B. The fascinating part? Nobody knows where this supposed bottleneck actually kicks in.
Perhaps this constraint is a feature, not a bug - pushing models to learn more efficient representations instead of relying on brute-force scaling.
Add residual connections between encoder-decoder? We might push past any theoretical limitations while keeping all the efficiency benefits.
The real question: did we abandon an inherently more efficient architecture too soon?

[Final/8] The message is clear: When efficiency matters, encoder-decoder isn't just an alternative - it's the better choice.
Time to rethink our approach to building smaller, more efficient language models. The future isn't just about scaling down - it's about smarter architecture choices.
Paper: https://arxiv.org/pdf/2501.16273
Code: microsoft/encoder-decoder-slm

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.16273 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.16273 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.16273 in a Space README.md to link it from this page.

Collections including this paper 3