Tom Aarsen
tomaarsen
AI & ML interests
NLP: text embeddings, information retrieval, named entity recognition, few-shot text classification
Recent Activity
liked
a Space
about 24 hours ago
Foaster/Werewolf_benchmark
liked
a model
1 day ago
asmud/indonesian-embedding-small
replied to
s-emanuilov's
post
1 day ago
Ran MTEB evaluation on Bulgarian tasks comparing EmbeddingGemma-300M (https://huggingface.co/google/embeddinggemma-300m)) vs Multilingual-E5-Large (https://huggingface.co/intfloat/multilingual-e5-large)
EmbeddingGemma-300M scored 71.6% average while E5-Large got 75.9%. Pretty solid results for EmbeddingGemma considering it's half the size and uses way less resources.
EmbeddingGemma actually beats E5-Large on sentiment analysis and natural language inference. E5-Large wins on retrieval and bitext mining tasks.
The 300M model has 4x longer context window (2048 vs 512 tokens) and lower carbon footprint which is good.
Both models work great for Bulgarian but have different strengths depending what you need.
Blog article about the usage: https://huggingface.co/blog/embeddinggemma
PS: Don't forget to use the recommended libraries versions :D
```
pip install git+https://github.com/huggingface/[email protected]
pip install sentence-transformers>=5.0.0
```
Organizations
Training with Prompts
See the Training with Prompts documentation for more details: https://sbert.net/examples/training/prompts/README.html
Reranker Models for MS MARCO
State-of-the-Art NER models - Biomedical domain
State-of-the-Art NER models - Keyphrases
State-of-the-Art NER models - Organizations
-
nbroad/span-marker-roberta-large-orgs-v1
Token Classification • 0.4B • Updated • 3 • 2 -
tomaarsen/span-marker-bert-base-orgs
Token Classification • Updated • 10 • 1 -
nbroad/span-marker-xdistil-l12-h384-orgs-v3
Token Classification • 0.0B • Updated • 1 -
tomaarsen/span-marker-bert-small-orgs
Token Classification • Updated • 4
SetFit models
Reranker Models for GooAQ
https://huggingface.co/blog/train-reranker
-
tomaarsen/reranker-ModernBERT-large-gooaq-bce
Text Ranking • 0.4B • Updated • 1.3k • 8 -
tomaarsen/reranker-NeoBERT-gooaq-bce
Text Ranking • 0.2B • Updated • 6 • 2 -
tomaarsen/reranker-ModernBERT-base-gooaq-bce
Text Ranking • 0.1B • Updated • 1.52k • 2 -
tomaarsen/reranker-MiniLM-L12-gooaq-bce
Text Ranking • 0.0B • Updated • 16 • 1
Matryoshka Embedding Models
https://huggingface.co/blog/matryoshka
-
BEE-spoke-data/bert-plus-L8-v1.0-syntheticSTS-4k
Sentence Similarity • 0.1B • Updated • 6 • 5 -
aspire/acge_text_embedding
Sentence Similarity • 0.3B • Updated • 1.14k • 147 -
dunzhang/stella-mrl-large-zh-v3.5-1792d
Sentence Similarity • 0.3B • Updated • 1.01k • • 51 -
NeuML/pubmedbert-base-embeddings-matryoshka
Sentence Similarity • 0.1B • Updated • 1.94k • 23
State-of-the-Art NER models - General purpose
-
tomaarsen/span-marker-bert-base-fewnerd-fine-super
Token Classification • 0.1B • Updated • 1.89k • 13 -
tomaarsen/span-marker-roberta-large-fewnerd-fine-super
Token Classification • 0.4B • Updated • 27 • 14 -
tomaarsen/span-marker-mbert-base-multinerd
Token Classification • 0.2B • Updated • 819 • 64 -
tomaarsen/span-marker-roberta-large-ontonotes5
Token Classification • 0.4B • Updated • 6.06k • 13
State-of-the-Art NER models - Acronyms
State-of-the-Art NER models - Tagalog
SpanMarker NER Models
SpanMarker NER models for various domains
SetFitABSA models
-
tomaarsen/setfit-absa-bge-small-en-v1.5-restaurants-aspect
Text Classification • Updated • 227 • 4 -
tomaarsen/setfit-absa-bge-small-en-v1.5-restaurants-polarity
Text Classification • Updated • 222 -
tomaarsen/setfit-absa-paraphrase-mpnet-base-v2-restaurants-aspect
Text Classification • 0.1B • Updated • 30 • 1 -
tomaarsen/setfit-absa-paraphrase-mpnet-base-v2-restaurants-polarity
Text Classification • 0.1B • Updated • 19
Qwen3 Rerankers converted to Sequence Classification
Reranker Models for GooAQ
https://huggingface.co/blog/train-reranker
-
tomaarsen/reranker-ModernBERT-large-gooaq-bce
Text Ranking • 0.4B • Updated • 1.3k • 8 -
tomaarsen/reranker-NeoBERT-gooaq-bce
Text Ranking • 0.2B • Updated • 6 • 2 -
tomaarsen/reranker-ModernBERT-base-gooaq-bce
Text Ranking • 0.1B • Updated • 1.52k • 2 -
tomaarsen/reranker-MiniLM-L12-gooaq-bce
Text Ranking • 0.0B • Updated • 16 • 1
Training with Prompts
See the Training with Prompts documentation for more details: https://sbert.net/examples/training/prompts/README.html
Matryoshka Embedding Models
https://huggingface.co/blog/matryoshka
-
BEE-spoke-data/bert-plus-L8-v1.0-syntheticSTS-4k
Sentence Similarity • 0.1B • Updated • 6 • 5 -
aspire/acge_text_embedding
Sentence Similarity • 0.3B • Updated • 1.14k • 147 -
dunzhang/stella-mrl-large-zh-v3.5-1792d
Sentence Similarity • 0.3B • Updated • 1.01k • • 51 -
NeuML/pubmedbert-base-embeddings-matryoshka
Sentence Similarity • 0.1B • Updated • 1.94k • 23
Reranker Models for MS MARCO
State-of-the-Art NER models - General purpose
-
tomaarsen/span-marker-bert-base-fewnerd-fine-super
Token Classification • 0.1B • Updated • 1.89k • 13 -
tomaarsen/span-marker-roberta-large-fewnerd-fine-super
Token Classification • 0.4B • Updated • 27 • 14 -
tomaarsen/span-marker-mbert-base-multinerd
Token Classification • 0.2B • Updated • 819 • 64 -
tomaarsen/span-marker-roberta-large-ontonotes5
Token Classification • 0.4B • Updated • 6.06k • 13
State-of-the-Art NER models - Biomedical domain
State-of-the-Art NER models - Acronyms
State-of-the-Art NER models - Keyphrases
State-of-the-Art NER models - Tagalog
State-of-the-Art NER models - Organizations
-
nbroad/span-marker-roberta-large-orgs-v1
Token Classification • 0.4B • Updated • 3 • 2 -
tomaarsen/span-marker-bert-base-orgs
Token Classification • Updated • 10 • 1 -
nbroad/span-marker-xdistil-l12-h384-orgs-v3
Token Classification • 0.0B • Updated • 1 -
tomaarsen/span-marker-bert-small-orgs
Token Classification • Updated • 4
SpanMarker NER Models
SpanMarker NER models for various domains
SetFit models
SetFitABSA models
-
tomaarsen/setfit-absa-bge-small-en-v1.5-restaurants-aspect
Text Classification • Updated • 227 • 4 -
tomaarsen/setfit-absa-bge-small-en-v1.5-restaurants-polarity
Text Classification • Updated • 222 -
tomaarsen/setfit-absa-paraphrase-mpnet-base-v2-restaurants-aspect
Text Classification • 0.1B • Updated • 30 • 1 -
tomaarsen/setfit-absa-paraphrase-mpnet-base-v2-restaurants-polarity
Text Classification • 0.1B • Updated • 19