Rapido NER + Entity Embedding
Rapido NER is a strongly multilingual named-entity recogniser and entity embedding model named in memory of Rapido, the cat who used to smooth out the loss spike of my training runs.
The model wraps an encoder, attention-based mention pooling, a per-type projection layer for downstream entity retrieval, and CRF decoding for sequence predictions.
Model context length: 4096
tokens.
This model aims to solve the following common issues in NER context:
- Strong multilingual NER performance: Robust NER model that can generalise across many languages and domains, leveraging both high-quality annotated/synthetic datasets and weakly supervised linked mentions.
- Entity clustering and retrieval: Obtain meaningful entity representations that can be used for clustering, linking, or retrieval tasks, especially in a multilingual setting. Each entity is projected to a
768
-dimensional L2-normalised vector space, with type-aware conditioning to separate coarse types (PER/ORG/LOC). - Within document clustering: Cluster within the same document mentions of the same entity in different languages (e.g., "Cologne" and "Köln").
- Long context handling: Most NER models are limited to
512
tokens, which can be insufficient for documents with multiple entities or complex structures. This model was trained with a context of4096
tokens.
Want to quickly check the model's performance? Use the space: https://huggingface.co/spaces/pierre-tassel/rapido-ner-space
Model Overview
- Architecture: Full finetuned MLM Encoder backbone (
Alibaba-NLP/gte-multilingual-mlm-base
) + token-classification head + attention pooling + per entity-type projection head + CRF - Objectives: Span-level NER (BIO) and contrastive entity embedding
- Output: Token logits, CRF decode, pooled span embeddings
- Modalities: Text only
- License: Apache 2
Key Features
- Entity embeddings out of the box pass a
mention_mask
to obtain L2-normalised768
-dimensional span vectors suitable for clustering, retrieval, and linking tasks. - Type-aware projection FiLM conditioning aligns entity representations with coarse types (PER/ORG/LOC) learned during training.
- CRF decoding first-order CRF with BIO constraints for sharper sequence predictions.
- Multilingual coverage the training corpus spans 55 languages, with linked-entity supervision.
Usage
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
model_id = "pierre-tassel/rapido-ner-entity"
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
texts = [
"I'm traveling to Cologne next week.",
"Ich fahre nächste Woche nach Köln.",
"Je vais à Cologne la semaine prochaine.",
"Apple opened a lab in Paris.",
"Microsoft acquired a startup in Berlin.",
# one document with multiple mentions to show within-doc clustering (two Cologne LOC)
"The Cologne office met with Köln University and Microsoft in Cologne.",
]
enc = tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
)
input_ids = enc["input_ids"]
attention_mask = enc["attention_mask"].bool()
with torch.no_grad():
hidden = model.encode_tokens(input_ids, attention_mask)
logits = model.ner_head(hidden)
decoded = model.crf_decode(logits, attention_mask)
id2label = {int(k): v for k, v in model.config.id2label.items()}
def spans_from_bio(tag_ids: list[int]) -> list[tuple[int, int, str]]:
spans = []
start, cur_type = None, None
for idx, tag_id in enumerate(tag_ids):
tag = id2label[tag_id]
if tag == "O":
if start is not None:
spans.append((start, idx, cur_type))
start, cur_type = None, None
continue
prefix, etype = tag.split("-", 1) if "-" in tag else ("B", tag)
if prefix == "B" or etype != cur_type:
if start is not None:
spans.append((start, idx, cur_type))
start, cur_type = idx, etype
if start is not None:
spans.append((start, len(tag_ids), cur_type))
return spans
results = []
for i, text in enumerate(texts):
seq_len = int(attention_mask[i].sum().item())
tokens = tokenizer.convert_ids_to_tokens(input_ids[i, :seq_len])
tag_ids = decoded[i][:seq_len]
spans = spans_from_bio(tag_ids)
# Print predictions per text
tag_labels = [id2label[t] for t in tag_ids]
print(f"\nPrediction for text {i + 1}: {text}")
print("Tokens with tags:")
print(list(zip(tokens, tag_labels))) # e.g., ... ('▁Colo','B-LOC'), ('gne','I-LOC'), ...
if spans:
ents_readable = [
f"{tokenizer.decode(input_ids[i, s:e]).strip()} [{typ}]"
for (s, e, typ) in spans
]
print("Entities:", ents_readable) # e.g., ['Apple [ORG]', 'Paris [LOC]']
else:
print("Entities: []")
mention_vectors = []
mention_surfaces = []
mention_types = []
if spans:
mention_mask = torch.zeros(
(1, len(spans), input_ids.size(1)), dtype=torch.bool
)
for m, (s, e, _) in enumerate(spans):
mention_mask[0, m, s:e] = True
with torch.no_grad():
pooled = model.encode_mentions_with_attention(
hidden[i: i + 1], mention_mask
)
projected = model.project_mentions(pooled).squeeze(0)
mention_vectors = F.normalize(projected, dim=-1)
for (s, e, typ) in spans:
surface = tokenizer.decode(input_ids[i, s:e])
mention_surfaces.append(surface.strip())
mention_types.append(typ)
results.append(
{
"text": text,
"tokens": tokens,
"tags": tag_labels,
"spans": spans,
"mention_surfaces": mention_surfaces,
"mention_types": mention_types,
"embeddings": mention_vectors, # tensor or []
}
)
# Global pairwise similarities
all_embeds = []
all_labels = []
for r in results:
for surface, typ, emb in zip(
r["mention_surfaces"], r["mention_types"], r["embeddings"]
):
all_embeds.append(emb)
all_labels.append(f"{surface} [{typ}]")
if all_embeds:
all_embeds = torch.stack(all_embeds)
sim = all_embeds @ all_embeds.T
print(
"\nGlobal cosine similarities:") # e.g., Cologne [LOC] vs Köln [LOC] ≈ 0.818; Apple [ORG] vs Microsoft [ORG] ≈ 0.400; Paris [LOC] vs Berlin [LOC] ≈ 0.491
for i, li in enumerate(all_labels):
for j, lj in enumerate(all_labels):
print(f"{li:20s} vs {lj:20s}: {sim[i, j]:.3f}")
print()
# Within-document similarity matrices
print("\nPer-document similarity (only if >=2 entities predicted):") # e.g., Doc 6: Cologne vs Cologne = 0.986
for r in results:
embs = r["embeddings"]
if len(embs) < 2:
continue
sim_doc = embs @ embs.T
print(f"\nText: {r['text']}")
for i, label_i in enumerate(r["mention_surfaces"]):
for j, label_j in enumerate(r["mention_surfaces"]):
print(f" {label_i:12s} vs {label_j:12s}: {sim_doc[i, j]:.3f}")
Evaluation Summary
Dataset | Split | Precision | Recall | F1 | Support |
---|---|---|---|---|---|
CoNLL 2003 (en) | test | 0.9111 | 0.9551 | 0.9326 | 4,946 |
CoNLL 2002 (es) | test | 0.7515 | 0.8568 | 0.8007 | 3,219 |
GermEval 2014 (de) | test | 0.8290 | 0.8663 | 0.8473 | 3,157 |
GermEval 2014 per-type (test)
- LOC: P 85.21%, R 88.25%, F1 86.70% (support 1,123)
- ORG: P 70.40%, R 75.46%, F1 72.84% (support 933)
- PER: P 91.55%, R 94.46%, F1 92.99% (support 1,101)
- Token accuracy: 98.24%
- Macro F1: 84.73%
Note: The released checkpoint predicts the coarse types {PER, ORG, LOC}. Historical "MISC" mentions are mapped to
O
during decoding; evaluation scripts should ignore that label when computing metrics.
Dataset Statistics
- Documents: 183,095
- Mentions: 473,054 (66.61% of docs contain at least one mention)
- Linked mentions: 45.66%
- Avg. mentions per document: 2.584
- Languages covered: 55
Language Distribution (top 10 shown)
Language | Docs |
---|---|
en | 68,809 |
zh | 12,926 |
pt | 12,258 |
sk | 11,361 |
hr | 9,842 |
sv | 8,133 |
da | 6,537 |
sr | 5,192 |
de | 3,080 |
fr | 2,584 |
ru | 2,448 |
es | 2,413 |
it | 1,827 |
ja | 1,579 |
ko | 1,454 |
nl | 1,317 |
ar | 1,290 |
pl | 1,228 |
tl | 1,192 |
cs | 1,161 |
tr | 1,156 |
no | 1,115 |
uk | 1,100 |
fi | 1,062 |
vi | 1,018 |
ro | 1,017 |
id | 977 |
lv | 950 |
ms | 928 |
el | 905 |
bg | 901 |
bn | 893 |
fa | 876 |
pa | 845 |
ta | 845 |
th | 834 |
sl | 832 |
hu | 829 |
he | 822 |
te | 791 |
hi | 780 |
mr | 774 |
lt | 765 |
ur | 745 |
et | 734 |
ml | 712 |
gu | 629 |
ca | 554 |
jv | 534 |
sw | 465 |
my | 450 |
az | 436 |
ceb | 188 |
Corpus Splits
- Train: 151,754 documents
- Validation: 11,911 documents
- Test: 19,430 documents
Mention Type Counts
- LOC: 166,358
- ORG: 124,232
- PER: 112,803
Intended Use & Limitations
- Intended: General-purpose NER, multilingual entity clustering, retrieval-augmented generation with type-aware filtering.
- Not intended: Fine-grained entity typing beyond PER/ORG/LOC, or high-stakes decision-making without human oversight.
- Bias considerations: Training sources blend news, encyclopaedia content, and linked corpora. Regional under-representation (e.g., low-resource languages) can lead to uneven recall. Linked mentions cover less than half the spans, so entity embedding performance varies by domain. Remember that it is a machine learning model; it can make mistakes. Never forget to benchmark this model on your domain before deployment.
Training Details
Training followed a two-stage curriculum (NER-only warmup, then joint NER + contrastive objectives) on a single H200 machine.
Closed-source training code, not intended for public release.
Commercial Usage
You are allowed to use the model for commercial purposes. If this model was useful in your business, I would like you to donate to the animal shelter of your choice. While appreciated, donations are not a condition of using the model, but are simply suggested. If you choose to do so, you can contact me, it will make me very happy.
Acknowledgements
Dedicated to Rapido, my cat, whose kindness, affectionate company, and lightning-fast speed inspired this model's name. He was loved and loving, a wonderful companion who will be deeply missed. Rapido loved to work with me, often sitting on my lap or beside me as I coded. He had a knack for sensing when I needed a break. His presence during long coding sessions was a source of comfort and joy. Rapido passed away in August 2025, but his memory will stay with our family forever.
Contact
Pierre Tassel: pierre.tassel[at]outlook.fr
- Downloads last month
- 651
Model tree for pierre-tassel/rapido-ner-entity
Base model
Alibaba-NLP/gte-multilingual-mlm-baseEvaluation results
- F1 on CoNLL 2003 (English)test set self-reported0.933
- Precision on CoNLL 2003 (English)test set self-reported0.911
- Recall on CoNLL 2003 (English)test set self-reported0.955
- F1 on CoNLL 2002 (Spanish)test set self-reported0.801
- Precision on CoNLL 2002 (Spanish)test set self-reported0.751
- Recall on CoNLL 2002 (Spanish)test set self-reported0.857
- F1 on GermEval 2014 (German)test set self-reported0.847
- Precision on GermEval 2014 (German)test set self-reported0.829
- Recall on GermEval 2014 (German)test set self-reported0.866