Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models
Abstract
Test-time training (TTT) improves performance by allowing foundation models to specialize on test tasks, reducing in-distribution test error through a mechanism of focusing on relevant concepts.
Recent empirical studies have explored the idea of continuing to train a model at test-time for a given task, known as test-time training (TTT), and have found it to yield significant performance improvements. However, there is limited understanding of why and when TTT is effective. Earlier explanations mostly focused on the observation that TTT may help when applied to out-of-distribution adaptation or used with privileged data. However, the growing scale of foundation models with most test data being in-distribution questions these explanations. We instead posit that foundation models remain globally underparameterized, with TTT providing a mechanism for specialization after generalization, focusing capacity on concepts relevant to the test task. Specifically, under the linear representation hypothesis, we propose a model in which TTT achieves a substantially smaller in-distribution test error than global training. We empirically validate our model's key assumptions by training a sparse autoencoder on ImageNet, showing that semantically related data points are explained by only a few shared concepts. Finally, we perform scaling studies across image and language tasks that confirm the practical implications of our model, identifying the regimes where specialization is most effective.
Community
Our motivation for this work: While the last decade was about building larger and more general zero-shot foundation models, this decade will be about "specializing" models through continued training.
In this work, we develop a theoretical understanding of why TTT can work in foundation models, even when test data is in-distribution and similar data has been seen during pre-training. We posit that today's foundation models are underparameterized, and due to this, they cannot simultaneously ("globally") approximate the ground truth across the full data distribution. TTT offers a mechanism to "specialize" the model to a local region around the test point, allowing it to temporarily "forget" irrelevant pre-trained knowledge and "free up" capacity to better learn concepts relevant to the immediate task. We call this "specialization after generalization."
We model this phenomenon under the linear representation hypothesis (LRH), which posits a large, linear, sparsely activated concept space. We make some surprising observations when analyzing the behavior of TTT with SAEs. Moreover, in an idealized model based on the LRH, we prove that TTT can achieve a smaller in-distribution test error than exponentially larger global models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Test time training enhances in-context learning of nonlinear functions (2025)
- Evolution of Concepts in Language Model Pre-Training (2025)
- Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks (2025)
- Cross-Model Semantics in Representation Learning (2025)
- Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time Scaling (2025)
- Understanding and Enhancing Mask-Based Pretraining towards Universal Representations (2025)
- SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper