Hugging Face

🧩 Student-Distilled Sentence Embeddings β€” Sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 β†’ Deepvk/USER-bge-m3

✨ This repository contains a student model distilled from Deepvk/USER-BGE-M3 using Sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 as base.
It is designed for fast inference, semantic search, sentence similarity, and clustering tasks in Russian πŸ‡·πŸ‡Ί and English πŸ‡¬πŸ‡§.


πŸ” Model Card

Property Value
Teacher Model Deepvk/USER-BGE-M3
Base Model sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Distillation Student model trained to mimic teacher embeddings
Embedding Dim 1024
Libraries sentence-transformers, torch
Supported HW CPU & GPU
License Apache-2.0

Note on Distillation:
To allow the student to correctly β€œmimic” the teacher:

  • We used the output embeddings from MiniLM-L12-v2 (384D).
  • Added a projection (DENSE) layer to upsample them to 1024D, matching the teacher.
  • Training was performed with a cosine similarity + MSE loss against teacher embeddings, so the student learned to reproduce the teacher’s embedding space accurately.

πŸš€ Features

  • ⚑ Fast inference β€” smaller student model for quicker embeddings
  • πŸ“¦ Lightweight β€” reduced memory footprint compared to teacher
  • πŸ”„ Drop-in replacement β€” embeddings compatible with teacher model
  • 🌍 Multilingual β€” supports Russian πŸ‡·πŸ‡Ί and English πŸ‡¬πŸ‡§
  • πŸ›  Full sentence-transformers compatibility β€” use in any pipeline (embeddings, clustering, semantic search) with minimal adjustments

🧠 Intended Use

βœ… Recommended for:

  • Semantic search & retrieval pipelines
  • Sentence similarity & clustering
  • Lightweight deployment for inference

❌ Not ideal for:

  • Tasks requiring absolute maximum precision (teacher model gives slightly better quality)

βš–οΈ Pros & Cons of Distilled Model

Pros βœ…

  • Smaller and faster than teacher
  • Easy to integrate using sentence-transformers
  • Good cosine similarity to teacher embeddings

Cons ❌

  • Slight drop in precision vs full teacher model
  • Performance depends on training coverage and languages

πŸ“šTraining Data

  • Volume: 730,000 sentences (sentence pairs)
  • Data Type: Retrieval / Semantic β€” 60/40
  • Language Distribution: Russian / English β€” 80/20
  • Training Goal: Focus on Russian language while minimizing quality loss in English
  • Purpose: Distillation of knowledge from the Deepvk/USER-bge-m3 teacher model into a compact student model

πŸ“Š Evaluation

We evaluated the model on validation and test splits against the teacher (Deepvk/USER-bge-m3), the distilled student, and the original paraphrase-multilingual-MiniLM-L12-v2.


πŸ”Ή TL;DR

  • Our distilled model (Student) reproduces the space of Deepvk/USER-bge-m3 (Teacher) with minimal loss.
  • Recall@1: 82% (Student) vs 87% (Teacher).
  • The original MiniLM is completely incompatible with the teacher’s space (Recall ~0%).

πŸ”Ή Main Metrics

Split Model MSE Cosine mean Cosine std MRR Recall@1 Recall@5 Recall@10
Validation Teacher 0.000000 1.0000 0.0000 0.9244 0.8746 0.9851 0.9966
Validation Student 0.000365 0.7896 0.0643 0.8906 0.8248 0.9726 0.9893
Validation MiniLM 0.018173 -0.0012 0.0303 0.0003 0.0000 0.0002 0.0002
Validation Student vs MiniLM 0.017826 -0.0019 0.0295 0.0002 0.0000 0.0000 0.0002
Test Teacher 0.000000 1.0000 0.0000 0.9273 0.8771 0.9908 0.9962
Test Student 0.000362 0.7921 0.0556 0.8832 0.8107 0.9763 0.9895
Test MiniLM 0.015902 -0.0033 0.0302 0.0003 0.0000 0.0002 0.0002
Test Student vs MiniLM 0.015553 -0.0037 0.0295 0.0002 0.0000 0.0001 0.0002

πŸ”Ή Conclusions

  • Student β‰ˆ Teacher β†’ the distilled model successfully learned the teacher’s embedding space.
  • MiniLM β‰  Teacher β†’ the original MiniLM is not aligned with the teacher.
  • Student vs MiniLM β‰ˆ 0 β†’ the student learned a new embedding space, not just copied MiniLM.
  • Validation β‰ˆ Test β†’ stable results, no overfitting.

πŸ“‚ Files

  • USER-BGE-M3-MiniLM-L12-v2-Distilled β€” trained student model directory, includes all weights and architecture, including the custom 2_Dense projection layer used for distillation
  • tokenizer.json, config.json β€” model configuration and tokenizer
  • Optional internal folders (like 2_Dense) contain layers specific to the distillation setup and are handled automatically by sentence-transformers

🧩 Using

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("skatzR/USER-BGE-M3-MiniLM-L12-v2-Distilled")
embeddings = model.encode(["Hello world", "ΠŸΡ€ΠΈΠ²Π΅Ρ‚ ΠΌΠΈΡ€"])
Downloads last month
44
Safetensors
Model size
118M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for skatzR/USER-BGE-M3-MiniLM-L12-v2-Distilled