🧩 Student-Distilled Sentence Embeddings — Sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 → Deepvk/USER-bge-m3

✨ This repository contains a student model distilled from Deepvk/USER-BGE-M3 using Sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 as base.
It is designed for fast inference, semantic search, sentence similarity, and clustering tasks in Russian 🇷🇺 and English 🇬🇧.

🔍 Model Card

Property	Value
Teacher Model	`Deepvk/USER-BGE-M3`
Base Model	`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`
Distillation	Student model trained to mimic teacher embeddings
Embedding Dim	1024
Libraries	`sentence-transformers`, `torch`
Supported HW	CPU & GPU
License	Apache-2.0

Note on Distillation:
To allow the student to correctly “mimic” the teacher:

We used the output embeddings from MiniLM-L12-v2 (384D).
Added a projection (DENSE) layer to upsample them to 1024D, matching the teacher.
Training was performed with a cosine similarity + MSE loss against teacher embeddings, so the student learned to reproduce the teacher’s embedding space accurately.

🚀 Features

⚡ Fast inference — smaller student model for quicker embeddings
📦 Lightweight — reduced memory footprint compared to teacher
🔄 Drop-in replacement — embeddings compatible with teacher model
🌍 Multilingual — supports Russian 🇷🇺 and English 🇬🇧
🛠 Full sentence-transformers compatibility — use in any pipeline (embeddings, clustering, semantic search) with minimal adjustments

🧠 Intended Use

✅ Recommended for:

Semantic search & retrieval pipelines
Sentence similarity & clustering
Lightweight deployment for inference

❌ Not ideal for:

Tasks requiring absolute maximum precision (teacher model gives slightly better quality)

⚖️ Pros & Cons of Distilled Model

Pros ✅

Smaller and faster than teacher
Easy to integrate using sentence-transformers
Good cosine similarity to teacher embeddings

Cons ❌

Slight drop in precision vs full teacher model
Performance depends on training coverage and languages

📚Training Data

Volume: 730,000 sentences (sentence pairs)
Data Type: Retrieval / Semantic — 60/40
Language Distribution: Russian / English — 80/20
Training Goal: Focus on Russian language while minimizing quality loss in English
Purpose: Distillation of knowledge from the Deepvk/USER-bge-m3 teacher model into a compact student model

📊 Evaluation

We evaluated the model on validation and test splits against the teacher (Deepvk/USER-bge-m3), the distilled student, and the original paraphrase-multilingual-MiniLM-L12-v2.

🔹 TL;DR

Our distilled model (Student) reproduces the space of Deepvk/USER-bge-m3 (Teacher) with minimal loss.
Recall@1: 82% (Student) vs 87% (Teacher).
The original MiniLM is completely incompatible with the teacher’s space (Recall ~0%).

🔹 Main Metrics

Split	Model	MSE	Cosine mean	Cosine std	MRR	Recall@1	Recall@5	Recall@10
Validation	Teacher	0.000000	1.0000	0.0000	0.9244	0.8746	0.9851	0.9966
Validation	Student	0.000365	0.7896	0.0643	0.8906	0.8248	0.9726	0.9893
Validation	MiniLM	0.018173	-0.0012	0.0303	0.0003	0.0000	0.0002	0.0002
Validation	Student vs MiniLM	0.017826	-0.0019	0.0295	0.0002	0.0000	0.0000	0.0002
Test	Teacher	0.000000	1.0000	0.0000	0.9273	0.8771	0.9908	0.9962
Test	Student	0.000362	0.7921	0.0556	0.8832	0.8107	0.9763	0.9895
Test	MiniLM	0.015902	-0.0033	0.0302	0.0003	0.0000	0.0002	0.0002
Test	Student vs MiniLM	0.015553	-0.0037	0.0295	0.0002	0.0000	0.0001	0.0002

🔹 Conclusions

Student ≈ Teacher → the distilled model successfully learned the teacher’s embedding space.
MiniLM ≠ Teacher → the original MiniLM is not aligned with the teacher.
Student vs MiniLM ≈ 0 → the student learned a new embedding space, not just copied MiniLM.
Validation ≈ Test → stable results, no overfitting.

📂 Files

USER-BGE-M3-MiniLM-L12-v2-Distilled — trained student model directory, includes all weights and architecture, including the custom 2_Dense projection layer used for distillation
tokenizer.json, config.json — model configuration and tokenizer
Optional internal folders (like 2_Dense) contain layers specific to the distillation setup and are handled automatically by sentence-transformers

🧩 Using

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("skatzR/USER-BGE-M3-MiniLM-L12-v2-Distilled")
embeddings = model.encode(["Hello world", "Привет мир"])

Downloads last month: 44

Safetensors

Model size

118M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for skatzR/USER-BGE-M3-MiniLM-L12-v2-Distilled

Base model

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Finetuned

(250)

this model