求教关于模型训练的问题
我的理解是:Qwen3 Embedding系列,就是以Qwen3的LLM为架构和初始化参数,去掉对embedding没用的输出头,进行了一系列对比学习训练以强化句子表征能力。
我的疑问点在于,为什么要选择以【模型最后一个Decoder层的输出的最后一个token】(即终止符)为句子表征进行优化呢?(我并未在技术报告中找到这一点的解释)
我的理解是:在LLM的训练中,【模型最后一个Decoder层的输出的最后一个token】本身的含义应该是一个序列终止判断器,而句子表征的目的是希望获取序列的整体语义,二者似乎并没有什么关系(起码来说,这种序列终止的信息对于句子整体语义是没有任何价值的信息)。为什么不以【模型最后一个Decoder层的输出的所有token的meanpooling】为起点进行优化呢?毕竟【模型最后一个Decoder层的输出的所有token的meanpooling】本来就聚合了序列所有token的信息,本身就有一定的整体语义属性,如果优化它,获得的收益难道不会更大吗?
我能想到的可能解释是:1、作为单个token好训练,梯度传播方便,训练效率高(但这个不是关键,毕竟Tencent的Conan-embedding-v1等模型都选择优化的是meanpooling,说明meanpooling也是可以训练的);2、受限于LLM训练时的注意力掩码,所以只有位于序列最后的 token才能看到全局序列,而其他token都不够全面,因此其embedding如果也参与meanpooling意义不大。
希望得到开发者的回复。
补充一点,我的那个可能的解释是基于Qwen3 Embedding系列模型思考的,因为技术报告里说,Qwen3 Embedding模型应该是因果注意力的,即is_causal=True训练的,所以我觉得对比学习训练表征句子还可以理解,毕竟只有它能看全整个序列。
但是,gte-qwen2系列也是文本表征模型,我看到模型配置config.json文件中明确is_causal=False,即它是基于双向注意力训练的,既然是双向注意力,那为什么还要优化呢?这次应该得优化meanpooling了吧?因为meanpooling在训练期间应该会展现出更强的整体语义表征能力,因为它本来就是全部token的聚合。
Just to clarify, I'm not an official developer for these models, but I'm happy to share my understanding based on my experience!
You're likely correct about the second point regarding mean pooling in a decoder-only model. Since these models generate text sequentially, only the final token truly encompasses the full contextual meaning of the entire sequence.
As for gte-qwen2
, while I haven't used that specific model, I have used the thenlper/gte-*
series embedding models. These models are famously based on their foundational paper, "Towards General Text Embeddings with Multi-stage Contrastive Learning." While a standard encoder-only model (which uses bidirectional attention) inherently develops some sentence understanding during pre-training (e.g., with masked language modeling), it usually requires further fine-tuning to become an effective embedding model. For instance, embedding models are often optimized for specific downstream tasks like question-answering (e.g., matching "What is the capital of China?" with "The capital of China is Beijing.") or information retrieval, and these tasks have subtle differences.
Regarding the efficiency of using the first token versus mean pooling, I'm not certain, though theoretically, their computational cost should be similar given the attention mechanism. However, the thenlper/gte-* models typically utilize the first token (CLS token) as the embedding representation.