Tom Aarsen
tomaarsen
AI & ML interests
NLP: text embeddings, information retrieval, named entity recognition, few-shot text classification
Recent Activity
liked
a Space
about 24 hours ago
Foaster/Werewolf_benchmark
liked
a model
1 day ago
asmud/indonesian-embedding-small
replied to
s-emanuilov's
post
1 day ago
Ran MTEB evaluation on Bulgarian tasks comparing EmbeddingGemma-300M (https://huggingface.co/google/embeddinggemma-300m)) vs Multilingual-E5-Large (https://huggingface.co/intfloat/multilingual-e5-large)
EmbeddingGemma-300M scored 71.6% average while E5-Large got 75.9%. Pretty solid results for EmbeddingGemma considering it's half the size and uses way less resources.
EmbeddingGemma actually beats E5-Large on sentiment analysis and natural language inference. E5-Large wins on retrieval and bitext mining tasks.
The 300M model has 4x longer context window (2048 vs 512 tokens) and lower carbon footprint which is good.
Both models work great for Bulgarian but have different strengths depending what you need.
Blog article about the usage: https://huggingface.co/blog/embeddinggemma
PS: Don't forget to use the recommended libraries versions :D
```
pip install git+https://github.com/huggingface/[email protected]
pip install sentence-transformers>=5.0.0
```