Papers
arxiv:2510.10224

Text2Token: Unsupervised Text Representation Learning with Token Target Prediction

Published on Oct 11
Authors:
,
,
,
,
,

Abstract

Text2Token, an unsupervised generative framework for text representation learning, constructs synthetic token targets to achieve performance competitive with state-of-the-art embedders using contrastive learning.

AI-generated summary

Unsupervised text representation learning (TRL) is a fundamental task in natural language processing, which is beneficial for improving search and recommendations with the web's unlabeled texts. A recent empirical study finds that the high-quality representation aligns with the key token of the input text, uncovering the potential connection between representation space and vocabulary space. Inspired by the findings, we revisit the generative tasks and develop an unsupervised generative framework for TRL, Text2Token. The framework is based on the token target prediction task, utilizing carefully constructed target token distribution as supervisory signals. To construct the high-quality target token distribution, we analyze the token-alignment properties with advanced embedders and identify two essential categories of key tokens: (1) the meaningful tokens in the text and (2) semantically derived tokens beyond the text. Based on these insights, we propose two methods -- data-driven and model-derived -- to construct synthetic token targets from data or the LLM backbone. Experiments on the MTEB v2 benchmark demonstrate that Text2Token achieves performance competitive with the state-of-the-art embedder with unsupervised contrastive learning, LLM2Vec. Our analysis further shows that vocabulary and representation spaces optimize together and toward the optimum solution during training, providing new ideas and insights for future work.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.10224 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.10224 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.10224 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.