Youtu-RAG
/

CoDi-Embedding-V1

Sentence Similarity

sentence-transformers

Model card Files Files and versions

sudongpo commited on about 17 hours ago

Commit

9ee4337

·

verified ·

1 Parent(s): 4dddd40

Update README.md

Files changed (1) hide show

README.md +11 -2

README.md CHANGED Viewed

@@ -1,5 +1,14 @@
 ## CoDi-Embedding-V1
-CoDi-Embedding-V1 is an outstanding embedding model that supports both Chinese and English retrieval, with particularly exceptional performance in Chinese retrieval. It has achieved SOTA results on the Chinese MTEB benchmark as of August 20, 2025. Based on the ![MiniCPM-Embedding](https://huggingface.co/openbmb/MiniCPM-Embedding) model, CoDi-Embedding-V1 extends the maximum sequence length from 512 to 4,196 tokens, significantly enhancing its capability for long-document retrieval. The model employs mean pooling strategy, where tokens from the instruction are excluded during pooling to optimize retrieval effectiveness.
 ### Model Description
 - **Maximum Sequence Length:** 4096 tokens
@@ -39,4 +48,4 @@ document_embeddings = model.encode(documents)
 # Get the similarity scores for the embeddings
 similarity = model.similarity(query_embeddings, document_embeddings)
 print(similarity)
-```

+---
+language:
+- en
+- zh
+base_model:
+- openbmb/MiniCPM-Embedding
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+---
 ## CoDi-Embedding-V1
+CoDi-Embedding-V1 is an outstanding embedding model that supports both Chinese and English retrieval, with particularly exceptional performance in Chinese retrieval. It has achieved SOTA results on the Chinese MTEB benchmark as of August 20, 2025. Based on the [MiniCPM-Embedding](https://huggingface.co/openbmb/MiniCPM-Embedding) model, CoDi-Embedding-V1 extends the maximum sequence length from 512 to 4,196 tokens, significantly enhancing its capability for long-document retrieval. The model employs mean pooling strategy, where tokens from the instruction are excluded during pooling to optimize retrieval effectiveness.
 ### Model Description
 - **Maximum Sequence Length:** 4096 tokens
 # Get the similarity scores for the embeddings
 similarity = model.similarity(query_embeddings, document_embeddings)
 print(similarity)
+```