Update README.md
Browse files
README.md
CHANGED
@@ -37,14 +37,14 @@ The model is designed for **semantic search**, **retrieval**, and **sentence sim
|
|
37 |
|
38 |
---
|
39 |
|
40 |
-
|
41 |
The model was trained to **replicate the embedding space of Deepvk/USER-BGE-M3**, while maintaining the simplicity and flexibility of E5.
|
42 |
-
|
43 |
|
44 |
- Teacher embeddings were precomputed with `Deepvk/USER-BGE-M3`.
|
45 |
-
- Student embeddings were trained to minimize **MSE** with the teacher’s embeddings.
|
46 |
-
- A projection layer (768→1024) was added to match the teacher
|
47 |
-
- **No prefixes
|
48 |
|
49 |
---
|
50 |
|
@@ -84,70 +84,37 @@ Key points:
|
|
84 |
|
85 |
## 📊 Evaluation Results
|
86 |
|
87 |
-
The model was evaluated against the **teacher (`Deepvk/USER-BGE-M3`)** and the **original `intfloat/multilingual-e5-base`** on validation and test datasets
|
88 |
-
|
89 |
-
### 🔹 TL;DR Summary
|
90 |
-
|
91 |
-
- The **distilled E5-base student** reproduces the **Deepvk/USER-BGE-M3** embedding space with **high fidelity**.
|
92 |
-
- The **original E5-base** embeddings are **incompatible** with the teacher’s space (cosine ≈ 0).
|
93 |
-
- Recall@1: **EN ≈ 86% (Student)** vs **87.7% (Teacher)**
|
94 |
-
- Recall@1: **RU ≈ 65.2% (Student)** vs **59.9% (Teacher)** — student even outperforms teacher on Russian.
|
95 |
|
96 |
---
|
97 |
|
98 |
-
|
99 |
-
|
100 |
-
This table combines **main validation/test metrics** with **additional EN/RU benchmarks**.
|
101 |
-
Note: EN/RU Benchmarks are external datasets used to test retrieval performance; they are **not part of the training/validation splits**.
|
102 |
-
|
103 |
-
| Dataset / Split | Model | MSE | Cosine mean / Cosine_Pos | Cosine std / Cosine_Pos_std | Cosine_Neg / Cosine_Neg | Cosine_Neg_std | MRR | Recall@1 | Recall@5 | Recall@10 |
|
104 |
-
|--------------------|--------------------|----------|-------------------------|-----------------------------|------------------------|----------------|--------|----------|----------|-----------|
|
105 |
-
| **Validation** | Teacher (BGE-M3) | 0.000000 | 1.0000 | 0.0000 | — | — | 0.9244 | 0.8746 | 0.9851 | 0.9966 |
|
106 |
-
| | Student (E5-distilled) | 0.000288 | 0.8389 | 0.0498 | — | — | 0.9158 | 0.8607 | 0.9829 | 0.9955 |
|
107 |
-
| | e5-base (original) | 0.001866 | -0.0042 | 0.0297 | — | — | 0.0003 | 0.0000 | 0.0002 | 0.0003 |
|
108 |
-
| **Test** | Teacher (BGE-M3) | 0.000000 | 1.0000 | 0.0000 | — | — | 0.9273 | 0.8771 | 0.9908 | 0.9962 |
|
109 |
-
| | Student (E5-distilled) | 0.000276 | 0.8462 | 0.0425 | — | — | 0.9176 | 0.8608 | 0.9896 | 0.9956 |
|
110 |
-
| | e5-base (original) | 0.001867 | -0.0027 | 0.0293 | — | — | 0.0002 | 0.0000 | 0.0001 | 0.0002 |
|
111 |
-
| **EN Benchmark** (MS MARCO) | Teacher (BGE-M3) | — | 0.6710 | 0.0724 | 0.5575 | 0.0676 | 0.6362 | 0.4385 | 0.9205 | 1.0000 |
|
112 |
-
| | Student (E5-distilled) | — | 0.7233 | 0.0670 | 0.6269 | 0.0615 | 0.5912 | 0.3745 | 0.9130 | 1.0000 |
|
113 |
-
| | e5-base (original) | — | 0.8886 | 0.0259 | 0.8427 | 0.0264 | 0.6852 | 0.5100 | 0.9380 | 1.0000 |
|
114 |
-
| **RU Benchmark** (SberQuad) | Teacher (BGE-M3) | — | 0.6070 | 0.0871 | 0.5790 | 0.1140 | 0.7995 | 0.5990 | 1.0000 | 1.0000 |
|
115 |
-
| | Student (E5-distilled) | — | 0.6716 | 0.0777 | 0.6435 | 0.1016 | 0.8263 | 0.6525 | 1.0000 | 1.0000 |
|
116 |
-
| | e5-base (original) | — | 0.8467 | 0.0323 | 0.8412 | 0.0426 | 0.7458 | 0.4915 | 1.0000 | 1.0000 |
|
117 |
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
- Student closely reproduces the teacher’s embedding space.
|
122 |
-
- Slight drop in Recall@1 (−6.4 p.p.), but Recall@10 remains perfect.
|
123 |
-
- Student embeddings are **more compact**, with slightly higher cosine similarities.
|
124 |
-
- Original e5-base is incompatible with teacher’s space.
|
125 |
|
126 |
---
|
127 |
|
128 |
-
### 🔹
|
129 |
-
|
130 |
-
| Model | Recall@1 | Recall@5 | Recall@10 | Cosine_Pos | Cosine_Pos_std | Cosine_Neg | Cosine_Neg_std | MRR |
|
131 |
-
|------------------------|----------|----------|-----------|------------|----------------|------------|----------------|-------|
|
132 |
-
| Teacher (USER-BGE-M3) | 0.5990 | 1.0000 | 1.0000 | 0.6070 | 0.0871 | 0.5790 | 0.1140 | 0.7995 |
|
133 |
-
| Student (E5-distilled) | 0.6525 | 1.0000 | 1.0000 | 0.6716 | 0.0777 | 0.6435 | 0.1016 | 0.8263 |
|
134 |
-
| multilingual-e5-base | 0.4915 | 1.0000 | 1.0000 | 0.8467 | 0.0323 | 0.8412 | 0.0426 | 0.7458 |
|
135 |
-
| **Δ Student–Teacher** | +0.0535 | 0.0000 | 0.0000 | +0.0646 | −0.0093 | +0.0645 | −0.0125 | +0.0268 |
|
136 |
|
137 |
-
|
138 |
-
|
139 |
-
-
|
140 |
-
-
|
|
|
|
|
|
|
|
|
141 |
|
142 |
---
|
143 |
|
144 |
### 🔹 Conclusions
|
145 |
|
146 |
-
-
|
147 |
-
-
|
148 |
-
-
|
149 |
-
-
|
150 |
-
- 🔄 **Final Result:** a bilingual, lightweight student preserving teacher quality, without prefix requirements.
|
151 |
|
152 |
---
|
153 |
|
@@ -167,4 +134,4 @@ Note: EN/RU Benchmarks are external datasets used to test retrieval performance;
|
|
167 |
from sentence_transformers import SentenceTransformer
|
168 |
|
169 |
model = SentenceTransformer("skatzR/USER-BGE-M3-E5-Base-Distilled")
|
170 |
-
embeddings = model.encode(["Hello world", "Привет мир"],
|
|
|
37 |
|
38 |
---
|
39 |
|
40 |
+
**About Distillation:**
|
41 |
The model was trained to **replicate the embedding space of Deepvk/USER-BGE-M3**, while maintaining the simplicity and flexibility of E5.
|
42 |
+
To achieve this:
|
43 |
|
44 |
- Teacher embeddings were precomputed with `Deepvk/USER-BGE-M3`.
|
45 |
+
- Student embeddings were trained to minimize the **MSE** with the teacher’s embeddings.
|
46 |
+
- A projection layer (768→1024) was added to match the dimensionality of the teacher model.
|
47 |
+
- **No prefixes (like “query:” or “passage:”)** were used — the student encodes sentences directly.
|
48 |
|
49 |
---
|
50 |
|
|
|
84 |
|
85 |
## 📊 Evaluation Results
|
86 |
|
87 |
+
The model was evaluated against the **teacher (`Deepvk/USER-BGE-M3`)** and the **original `intfloat/multilingual-e5-base`** on validation and test datasets.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
88 |
|
89 |
---
|
90 |
|
91 |
+
### 🔹 TL;DR
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
92 |
|
93 |
+
- The **distilled E5-base student** reproduces the **Deepvk/USER-BGE-M3** embedding space with **very high fidelity**.
|
94 |
+
- The **original E5-base** embeddings are **incompatible** with the BGE space (cosine ≈ 0).
|
95 |
+
- **Recall@1: 86% (Student)** vs **87.7% (Teacher)** — nearly identical retrieval performance.
|
|
|
|
|
|
|
|
|
96 |
|
97 |
---
|
98 |
|
99 |
+
### 🔹 Main Metrics
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
|
101 |
+
| Split | Model | MSE | Cosine mean | Cosine std | MRR | Recall@1 | Recall@5 | Recall@10 |
|
102 |
+
|--------------|--------------------|----------:|-------------:|------------:|--------:|----------:|----------:|----------:|
|
103 |
+
| **Validation** | Teacher (BGE-M3) | 0.000000 | 1.0000 | 0.0000 | 0.9244 | 0.8746 | 0.9851 | 0.9966 |
|
104 |
+
| | **Student (E5-distilled)** | **0.000288** | **0.8389** | **0.0498** | **0.9158** | **0.8607** | **0.9829** | **0.9955** |
|
105 |
+
| | E5-base (original) | 0.001866 | -0.0042 | 0.0297 | 0.0003 | 0.0000 | 0.0002 | 0.0003 |
|
106 |
+
| **Test** | Teacher (BGE-M3) | 0.000000 | 1.0000 | 0.0000 | 0.9273 | 0.8771 | 0.9908 | 0.9962 |
|
107 |
+
| | **Student (E5-distilled)** | **0.000276** | **0.8462** | **0.0425** | **0.9176** | **0.8608** | **0.9896** | **0.9956** |
|
108 |
+
| | E5-base (original) | 0.001867 | -0.0027 | 0.0293 | 0.0002 | 0.0000 | 0.0001 | 0.0002 |
|
109 |
|
110 |
---
|
111 |
|
112 |
### 🔹 Conclusions
|
113 |
|
114 |
+
- ✅ **Student ≈ Teacher** — the distilled model learned the teacher’s semantic space almost perfectly.
|
115 |
+
- ❌ **Original E5 ≠ Teacher** — default E5 embeddings are unrelated to BGE’s space.
|
116 |
+
- 📈 **Stable generalization** — validation and test results match closely.
|
117 |
+
- 🧩 The new student is a **drop-in BGE-compatible encoder**, with **no prefix requirement**.
|
|
|
118 |
|
119 |
---
|
120 |
|
|
|
134 |
from sentence_transformers import SentenceTransformer
|
135 |
|
136 |
model = SentenceTransformer("skatzR/USER-BGE-M3-E5-Base-Distilled")
|
137 |
+
embeddings = model.encode(["Hello world", "Привет мир"], normalize_embeddings=True)
|