π§© Student-Distilled Sentence Embeddings β Sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 β Deepvk/USER-bge-m3
β¨ This repository contains a student model distilled from Deepvk/USER-BGE-M3
using Sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
as base.
It is designed for fast inference, semantic search, sentence similarity, and clustering tasks in Russian π·πΊ and English π¬π§.
π Model Card
Property | Value |
---|---|
Teacher Model | Deepvk/USER-BGE-M3 |
Base Model | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |
Distillation | Student model trained to mimic teacher embeddings |
Embedding Dim | 1024 |
Libraries | sentence-transformers , torch |
Supported HW | CPU & GPU |
License | Apache-2.0 |
Note on Distillation:
To allow the student to correctly βmimicβ the teacher:
- We used the output embeddings from MiniLM-L12-v2 (384D).
- Added a projection (DENSE) layer to upsample them to 1024D, matching the teacher.
- Training was performed with a cosine similarity + MSE loss against teacher embeddings, so the student learned to reproduce the teacherβs embedding space accurately.
π Features
- β‘ Fast inference β smaller student model for quicker embeddings
- π¦ Lightweight β reduced memory footprint compared to teacher
- π Drop-in replacement β embeddings compatible with teacher model
- π Multilingual β supports Russian π·πΊ and English π¬π§
- π Full sentence-transformers compatibility β use in any pipeline (embeddings, clustering, semantic search) with minimal adjustments
π§ Intended Use
β Recommended for:
- Semantic search & retrieval pipelines
- Sentence similarity & clustering
- Lightweight deployment for inference
β Not ideal for:
- Tasks requiring absolute maximum precision (teacher model gives slightly better quality)
βοΈ Pros & Cons of Distilled Model
Pros β
- Smaller and faster than teacher
- Easy to integrate using
sentence-transformers
- Good cosine similarity to teacher embeddings
Cons β
- Slight drop in precision vs full teacher model
- Performance depends on training coverage and languages
πTraining Data
- Volume: 730,000 sentences (sentence pairs)
- Data Type: Retrieval / Semantic β 60/40
- Language Distribution: Russian / English β 80/20
- Training Goal: Focus on Russian language while minimizing quality loss in English
- Purpose: Distillation of knowledge from the Deepvk/USER-bge-m3 teacher model into a compact student model
π Evaluation
We evaluated the model on validation and test splits against the teacher (Deepvk/USER-bge-m3
), the distilled student, and the original paraphrase-multilingual-MiniLM-L12-v2.
πΉ TL;DR
- Our distilled model (Student) reproduces the space of Deepvk/USER-bge-m3 (Teacher) with minimal loss.
- Recall@1: 82% (Student) vs 87% (Teacher).
- The original MiniLM is completely incompatible with the teacherβs space (Recall ~0%).
πΉ Main Metrics
Split | Model | MSE | Cosine mean | Cosine std | MRR | Recall@1 | Recall@5 | Recall@10 |
---|---|---|---|---|---|---|---|---|
Validation | Teacher | 0.000000 | 1.0000 | 0.0000 | 0.9244 | 0.8746 | 0.9851 | 0.9966 |
Validation | Student | 0.000365 | 0.7896 | 0.0643 | 0.8906 | 0.8248 | 0.9726 | 0.9893 |
Validation | MiniLM | 0.018173 | -0.0012 | 0.0303 | 0.0003 | 0.0000 | 0.0002 | 0.0002 |
Validation | Student vs MiniLM | 0.017826 | -0.0019 | 0.0295 | 0.0002 | 0.0000 | 0.0000 | 0.0002 |
Test | Teacher | 0.000000 | 1.0000 | 0.0000 | 0.9273 | 0.8771 | 0.9908 | 0.9962 |
Test | Student | 0.000362 | 0.7921 | 0.0556 | 0.8832 | 0.8107 | 0.9763 | 0.9895 |
Test | MiniLM | 0.015902 | -0.0033 | 0.0302 | 0.0003 | 0.0000 | 0.0002 | 0.0002 |
Test | Student vs MiniLM | 0.015553 | -0.0037 | 0.0295 | 0.0002 | 0.0000 | 0.0001 | 0.0002 |
πΉ Conclusions
- Student β Teacher β the distilled model successfully learned the teacherβs embedding space.
- MiniLM β Teacher β the original MiniLM is not aligned with the teacher.
- Student vs MiniLM β 0 β the student learned a new embedding space, not just copied MiniLM.
- Validation β Test β stable results, no overfitting.
π Files
USER-BGE-M3-MiniLM-L12-v2-Distilled
β trained student model directory, includes all weights and architecture, including the custom 2_Dense projection layer used for distillationtokenizer.json
,config.json
β model configuration and tokenizer- Optional internal folders (like
2_Dense
) contain layers specific to the distillation setup and are handled automatically bysentence-transformers
π§© Using
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("skatzR/USER-BGE-M3-MiniLM-L12-v2-Distilled")
embeddings = model.encode(["Hello world", "ΠΡΠΈΠ²Π΅Ρ ΠΌΠΈΡ"])
- Downloads last month
- 44
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support