Simeon Emanuilov PRO

s-emanuilov

AI & ML interests

Software Engineer & Ph.D. candidate | Specializing in ML/DL system development & applying AI to solve real-world business problems.

Recent Activity

liked a dataset 22 minutes ago

HuggingFaceM4/FineVision

updated a dataset about 8 hours ago

llm-bg/bulgarian-history-complex

published a dataset about 14 hours ago

llm-bg/bulgarian-history-complex

View all activity

Organizations

liked a dataset 22 minutes ago

HuggingFaceM4/FineVision

Viewer • Updated 3 days ago • 24.2M • 38.8k • 217

updated a dataset about 8 hours ago

llm-bg/bulgarian-history-complex

Updated about 8 hours ago

published a dataset about 14 hours ago

llm-bg/bulgarian-history-complex

Updated about 8 hours ago

liked a dataset about 15 hours ago

HuggingFaceFW/finepdfs

Updated about 6 hours ago • 3 • 170

updated a dataset about 22 hours ago

llm-bg/bulgarian-history-qa

Updated about 22 hours ago

published a dataset about 22 hours ago

llm-bg/bulgarian-history-qa

Updated about 22 hours ago

liked a dataset 2 days ago

opendatalab/OmniDocBench

Viewer • Updated Feb 11 • 984 • 5.07k • 36

upvoted a paper 2 days ago

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Paper • 2509.03867 • Published 4 days ago • 182

replied to their post 2 days ago

I got way better results now! Just needed to use the recommended version of transformers.

I'll edit the main post when I'm ready with the graphs.

Thanks once more time.

replied to their post 2 days ago

Hey Tom,

First, appreciate your work! Thanks for everything you're doing.

I did use the prompt dict for "intfloat/multilingual-e5-large", like: prompts = {"query": "query: ", "passage": "passage: "} to SentenceTransformer.

For "google/embeddinggemma-300m", I kept the default: model = SentenceTransformer("google/embeddinggemma-300m") and then evaluated with MTEB library, assuming that "MTEB will automatically detect and use these prompts if they are defined in your model's configuration," as written here https://sbert.net/docs/sentence_transformer/usage/mteb_evaluation.html

So in short, I did not add prompts for EmbeddingGemma, but added them to multilingual-e5-large, as per their instructions (didn't have time to check their model config, but I think it's not added by default).

BUT, I ran with transformers==4.55.4, so need to re-run maybe...
sentence-transformers==5.1.0, which is fine I guess.

Thanks!

posted an update 3 days ago

Post

132

Ran MTEB evaluation on Bulgarian tasks comparing EmbeddingGemma-300M ( google/embeddinggemma-300m) vs Multilingual-E5-Large ( intfloat/multilingual-e5-large)

EmbeddingGemma-300M scored 71.6% average while E5-Large got 75.9%. Pretty solid results for EmbeddingGemma considering it's half the size and uses way less resources.

EmbeddingGemma actually beats E5-Large on sentiment analysis and natural language inference. E5-Large wins on retrieval and bitext mining tasks.

The 300M model has 4x longer context window (2048 vs 512 tokens) and lower carbon footprint which is good.

Both models work great for Bulgarian but have different strengths depending what you need.

Blog article about the usage: https://huggingface.co/blog/embeddinggemma

PS: Don't forget to use the recommended libraries versions :D

pip install git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview
pip install sentence-transformers>=5.0.0

4 replies

upvoted a paper 3 days ago

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published Feb 19 • 39

liked a model 3 days ago

google/embeddinggemma-300m

upvoted a collection 3 days ago

EmbeddingGemma

Collection

3 items • Updated 3 days ago • 58

upvoted an article 3 days ago

Article

Welcome EmbeddingGemma, Google's new efficient embedding model

and 5 others •

4 days ago

• 157

posted an update 7 days ago

Post

231

Embeddings are pretty useful, but mathematically limited.

Great insights from Google DeepMind: On the Theoretical Limitations of Embedding-Based Retrieval (2508.21038)

What could be the alternative? Cross-Encoders (good but can't scale), Multi-vector, Sparse models...or something new.

Hybrid retrieval is the current quick fix, imo.