CoDi-Embedding-V1

CoDi-Embedding-V1 is an outstanding embedding model that supports both Chinese and English retrieval, with particularly exceptional performance in Chinese retrieval. It has achieved SOTA results on the Chinese MTEB benchmark as of August 20, 2025. Based on the MiniCPM-Embedding model, CoDi-Embedding-V1 extends the maximum sequence length from 512 to 4,196 tokens, significantly enhancing its capability for long-document retrieval. The model employs mean pooling strategy, where tokens from the instruction are excluded during pooling to optimize retrieval effectiveness.

Model Description

  • Maximum Sequence Length: 4096 tokens
  • Output Dimensionality: 2304
  • Model Size: 2.4B

Requirements

transformers>=4.37.2

Usage

Sentence Transformers

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(model_name_or_path)

queries = ["结算业务系统用户使用"]
documents = [
    "根据解冻日输入范围,查询出该时间范围内到期的账户冻结列表。",
    "智能定期存款到期日为节假日时处理”设置提前或顺延,支持智能定期证实书提前或顺延到期提醒。",
    "账户开户时设置了账户到期日,账户到期提醒是根据全机构系统参数设置"
]

query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

# Get the similarity scores for the embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
Downloads last month
-
Safetensors
Model size
2.72B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Youtu-RAG/CoDi-Embedding-V1

Finetuned
(1)
this model