File size: 5,806 Bytes
861ed04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
da2023f
861ed04
da2023f
861ed04
 
da2023f
 
 
861ed04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
da2023f
861ed04
 
 
da2023f
861ed04
da2023f
 
 
861ed04
 
 
da2023f
861ed04
da2023f
 
 
 
 
 
 
 
861ed04
 
 
 
 
da2023f
 
 
 
861ed04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
da2023f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
base_model:
- intfloat/multilingual-e5-base
language:
- ru
- en
tags:
- sentence-embeddings
- semantic-search
- distillation
- student-model
- multilingual
---

[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-model-blue)](https://huggingface.co/skatzR/USER-BGE-M3-E5-Base-Distilled)
# 🧩 Student-Distilled Sentence Embeddings β€” Deepvk/USER-bge-m3 β†’ intfloat/multilingual-e5-base

✨ This repository contains a **student model distilled from [`Deepvk/USER-BGE-M3`](https://huggingface.co/deepvk/USER-bge-m3)** using [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) as the base encoder.  
The model is designed for **semantic search**, **retrieval**, and **sentence similarity** tasks in **Russian πŸ‡·πŸ‡Ί** and **English πŸ‡¬πŸ‡§**, optimized for **practical use without prefixes**.

---

# πŸ” Model Card

| Property           | Value                                                                 |
|--------------------|----------------------------------------------------------------------|
| **Teacher Model**  | [`Deepvk/USER-BGE-M3`](https://huggingface.co/deepvk/USER-bge-m3)   |
| **Base Model**     | [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) |
| **Distillation Type** | Embedding-level distillation (teacher β†’ student) |
| **Embedding Dim**  | 1024 |
| **Projection**     | Dense layer (768 β†’ 1024) |
| **Loss Function**  | Mean Squared Error (MSE) |
| **Libraries**      | `sentence-transformers`, `torch` |
| **License**        | Apache-2.0 |
| **Hardware**       | CPU & GPU supported |

---

**About Distillation:**  
The model was trained to **replicate the embedding space of Deepvk/USER-BGE-M3**, while maintaining the simplicity and flexibility of E5.  
To achieve this:

- Teacher embeddings were precomputed with `Deepvk/USER-BGE-M3`.  
- Student embeddings were trained to minimize the **MSE** with the teacher’s embeddings.  
- A projection layer (768β†’1024) was added to match the dimensionality of the teacher model.  
- **No prefixes (like β€œquery:” or β€œpassage:”)** were used β€” the student encodes sentences directly.

---

## πŸš€ Features

- ⚑ **Fast inference** β€” optimized E5-base architecture with no prefix processing  
- 🧠 **High-quality semantic understanding** β€” inherits BGE’s retrieval capability  
- 🌍 **Multilingual (RU/EN)** β€” strong in Russian, solid in English  
- πŸ”„ **Teacher-compatible** β€” embeddings align closely with Deepvk/USER-BGE-M3  
- πŸ›  **Sentence-transformers ready** β€” plug-and-play for semantic search, clustering, and retrieval

---

## 🧠 Intended Use

**βœ… Recommended for:**
- Semantic search and retrieval systems  
- Text embedding and similarity pipelines  
- Multilingual tasks focused on Russian and English  
- Clustering and topic discovery  

**❌ Not ideal for:**
- Prefix-based retrieval setups (e.g., original E5 behavior)
- Cross-encoder scoring tasks  

---

## πŸ“š Training Details

- **Training Objective:** Mimic teacher embeddings (Deepvk/USER-BGE-M3)  
- **Dataset Composition:** Retrieval/Semantic ratio = 60/40  
- **Language Distribution:** Russian / English β‰ˆ 80 / 20  
- **Training Duration:** 5 epochs with warmup and cosine evaluation  
- **Optimizer:** AdamW with automatic mixed precision (AMP)

---

## πŸ“Š Evaluation Results

The model was evaluated against the **teacher (`Deepvk/USER-BGE-M3`)** and the **original `intfloat/multilingual-e5-base`** on validation and test datasets.

---

### πŸ”Ή TL;DR

- The **distilled E5-base student** reproduces the **Deepvk/USER-BGE-M3** embedding space with **very high fidelity**.  
- The **original E5-base** embeddings are **incompatible** with the BGE space (cosine β‰ˆ 0).  
- **Recall@1: 86% (Student)** vs **87.7% (Teacher)** β€” nearly identical retrieval performance.  

---

### πŸ”Ή Main Metrics

| Split       | Model               | MSE      | Cosine mean | Cosine std | MRR    | Recall@1 | Recall@5 | Recall@10 |
|--------------|--------------------|----------:|-------------:|------------:|--------:|----------:|----------:|----------:|
| **Validation** | Teacher (BGE-M3)   | 0.000000 | 1.0000 | 0.0000 | 0.9244 | 0.8746 | 0.9851 | 0.9966 |
|               | **Student (E5-distilled)** | **0.000288** | **0.8389** | **0.0498** | **0.9158** | **0.8607** | **0.9829** | **0.9955** |
|               | E5-base (original) | 0.001866 | -0.0042 | 0.0297 | 0.0003 | 0.0000 | 0.0002 | 0.0003 |
| **Test** | Teacher (BGE-M3) | 0.000000 | 1.0000 | 0.0000 | 0.9273 | 0.8771 | 0.9908 | 0.9962 |
|               | **Student (E5-distilled)** | **0.000276** | **0.8462** | **0.0425** | **0.9176** | **0.8608** | **0.9896** | **0.9956** |
|               | E5-base (original) | 0.001867 | -0.0027 | 0.0293 | 0.0002 | 0.0000 | 0.0001 | 0.0002 |

---

### πŸ”Ή Conclusions

- βœ… **Student β‰ˆ Teacher** β€” the distilled model learned the teacher’s semantic space almost perfectly.  
- ❌ **Original E5 β‰  Teacher** β€” default E5 embeddings are unrelated to BGE’s space.  
- πŸ“ˆ **Stable generalization** β€” validation and test results match closely.  
- 🧩 The new student is a **drop-in BGE-compatible encoder**, with **no prefix requirement**.

---

## πŸ“‚ Model Structure

- `USER-BGE-M3-E5-Base-Distilled` β€” trained model folder containing:  
  - Transformer encoder (`intfloat/multilingual-e5-base`)  
  - Pooling layer  
  - Dense projection layer (768 β†’ 1024)  
- Fully compatible with `sentence-transformers` API.

---

## 🧩 Using the Model

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("skatzR/USER-BGE-M3-E5-Base-Distilled")
embeddings = model.encode(["Hello world", "ΠŸΡ€ΠΈΠ²Π΅Ρ‚ ΠΌΠΈΡ€"], normalize_embeddings=True)