Rapido NER + Entity Embedding

Rapido, the model mascot

Rapido NER is a strongly multilingual named-entity recogniser and entity embedding model named in memory of Rapido, the cat who used to smooth out the loss spike of my training runs.
The model wraps an encoder, attention-based mention pooling, a per-type projection layer for downstream entity retrieval, and CRF decoding for sequence predictions. Model context length: 4096 tokens.
This model aims to solve the following common issues in NER context:

Strong multilingual NER performance: Robust NER model that can generalise across many languages and domains, leveraging both high-quality annotated/synthetic datasets and weakly supervised linked mentions.
Entity clustering and retrieval: Obtain meaningful entity representations that can be used for clustering, linking, or retrieval tasks, especially in a multilingual setting. Each entity is projected to a 768-dimensional L2-normalised vector space, with type-aware conditioning to separate coarse types (PER/ORG/LOC).
Within document clustering: Cluster within the same document mentions of the same entity in different languages (e.g., "Cologne" and "Köln").
Long context handling: Most NER models are limited to 512 tokens, which can be insufficient for documents with multiple entities or complex structures. This model was trained with a context of 4096 tokens.

Want to quickly check the model's performance? Use the space: https://huggingface.co/spaces/pierre-tassel/rapido-ner-space

Model Overview

Architecture: Full finetuned MLM Encoder backbone (Alibaba-NLP/gte-multilingual-mlm-base) + token-classification head + attention pooling + per entity-type projection head + CRF
Objectives: Span-level NER (BIO) and contrastive entity embedding
Output: Token logits, CRF decode, pooled span embeddings
Modalities: Text only
License: Apache 2

Key Features

Entity embeddings out of the box pass a mention_mask to obtain L2-normalised 768-dimensional span vectors suitable for clustering, retrieval, and linking tasks.
Type-aware projection FiLM conditioning aligns entity representations with coarse types (PER/ORG/LOC) learned during training.
CRF decoding first-order CRF with BIO constraints for sharper sequence predictions.
Multilingual coverage the training corpus spans 55 languages, with linked-entity supervision.

Usage

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model_id = "pierre-tassel/rapido-ner-entity"

model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

texts = [
    "I'm traveling to Cologne next week.",
    "Ich fahre nächste Woche nach Köln.",
    "Je vais à Cologne la semaine prochaine.",
    "Apple opened a lab in Paris.",
    "Microsoft acquired a startup in Berlin.",
    # one document with multiple mentions to show within-doc clustering (two Cologne LOC)
    "The Cologne office met with Köln University and Microsoft in Cologne.",
]

enc = tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
)
input_ids = enc["input_ids"]
attention_mask = enc["attention_mask"].bool()

with torch.no_grad():
    hidden = model.encode_tokens(input_ids, attention_mask)
    logits = model.ner_head(hidden)

decoded = model.crf_decode(logits, attention_mask)

id2label = {int(k): v for k, v in model.config.id2label.items()}


def spans_from_bio(tag_ids: list[int]) -> list[tuple[int, int, str]]:
    spans = []
    start, cur_type = None, None
    for idx, tag_id in enumerate(tag_ids):
        tag = id2label[tag_id]
    if tag == "O":
        if start is not None:
        spans.append((start, idx, cur_type))
    start, cur_type = None, None
    continue
    prefix, etype = tag.split("-", 1) if "-" in tag else ("B", tag)
    if prefix == "B" or etype != cur_type:
        if start is not None:
        spans.append((start, idx, cur_type))
    start, cur_type = idx, etype
    if start is not None:
        spans.append((start, len(tag_ids), cur_type))
    return spans


results = []

for i, text in enumerate(texts):
    seq_len = int(attention_mask[i].sum().item())
    tokens = tokenizer.convert_ids_to_tokens(input_ids[i, :seq_len])
    tag_ids = decoded[i][:seq_len]
    spans = spans_from_bio(tag_ids)

    # Print predictions per text
    tag_labels = [id2label[t] for t in tag_ids]
    print(f"\nPrediction for text {i + 1}: {text}")
    print("Tokens with tags:")
    print(list(zip(tokens, tag_labels)))  # e.g., ... ('▁Colo','B-LOC'), ('gne','I-LOC'), ...
    if spans:
        ents_readable = [
            f"{tokenizer.decode(input_ids[i, s:e]).strip()} [{typ}]"
            for (s, e, typ) in spans
        ]
    print("Entities:", ents_readable)  # e.g., ['Apple [ORG]', 'Paris [LOC]']
    else:
    print("Entities: []")

    mention_vectors = []
    mention_surfaces = []
    mention_types = []

    if spans:
        mention_mask = torch.zeros(
            (1, len(spans), input_ids.size(1)), dtype=torch.bool
        )
    for m, (s, e, _) in enumerate(spans):
        mention_mask[0, m, s:e] = True

    with torch.no_grad():
        pooled = model.encode_mentions_with_attention(
            hidden[i: i + 1], mention_mask
        )
    projected = model.project_mentions(pooled).squeeze(0)

    mention_vectors = F.normalize(projected, dim=-1)
    for (s, e, typ) in spans:
        surface = tokenizer.decode(input_ids[i, s:e])
    mention_surfaces.append(surface.strip())
    mention_types.append(typ)

    results.append(
        {
            "text": text,
            "tokens": tokens,
            "tags": tag_labels,
            "spans": spans,
            "mention_surfaces": mention_surfaces,
            "mention_types": mention_types,
            "embeddings": mention_vectors,  # tensor or []
        }
    )

# Global pairwise similarities
all_embeds = []
all_labels = []
for r in results:
    for surface, typ, emb in zip(
            r["mention_surfaces"], r["mention_types"], r["embeddings"]
    ):
        all_embeds.append(emb)
    all_labels.append(f"{surface} [{typ}]")

if all_embeds:
    all_embeds = torch.stack(all_embeds)
    sim = all_embeds @ all_embeds.T
    print(
        "\nGlobal cosine similarities:")  # e.g., Cologne [LOC] vs Köln [LOC] ≈ 0.818; Apple [ORG] vs Microsoft [ORG] ≈ 0.400; Paris [LOC] vs Berlin [LOC] ≈ 0.491
    for i, li in enumerate(all_labels):
        for j, lj in enumerate(all_labels):
        print(f"{li:20s} vs {lj:20s}: {sim[i, j]:.3f}")
    print()

# Within-document similarity matrices
print("\nPer-document similarity (only if >=2 entities predicted):")  # e.g., Doc 6: Cologne vs Cologne = 0.986
for r in results:
    embs = r["embeddings"]
    if len(embs) < 2:
        continue
    sim_doc = embs @ embs.T
    print(f"\nText: {r['text']}")
    for i, label_i in enumerate(r["mention_surfaces"]):
        for j, label_j in enumerate(r["mention_surfaces"]):
        print(f"  {label_i:12s} vs {label_j:12s}: {sim_doc[i, j]:.3f}")

Evaluation Summary

Dataset	Split	Precision	Recall	F1	Support
CoNLL 2003 (en)	test	0.9111	0.9551	0.9326	4,946
CoNLL 2002 (es)	test	0.7515	0.8568	0.8007	3,219
GermEval 2014 (de)	test	0.8290	0.8663	0.8473	3,157

GermEval 2014 per-type (test)

LOC: P 85.21%, R 88.25%, F1 86.70% (support 1,123)
ORG: P 70.40%, R 75.46%, F1 72.84% (support 933)
PER: P 91.55%, R 94.46%, F1 92.99% (support 1,101)
Token accuracy: 98.24%
Macro F1: 84.73%

Note: The released checkpoint predicts the coarse types {PER, ORG, LOC}. Historical "MISC" mentions are mapped to O during decoding; evaluation scripts should ignore that label when computing metrics.

Dataset Statistics

Documents: 183,095
Mentions: 473,054 (66.61% of docs contain at least one mention)
Linked mentions: 45.66%
Avg. mentions per document: 2.584
Languages covered: 55

Language Distribution (top 10 shown)

Language	Docs
en	68,809
zh	12,926
pt	12,258
sk	11,361
hr	9,842
sv	8,133
da	6,537
sr	5,192
de	3,080
fr	2,584
ru	2,448
es	2,413
it	1,827
ja	1,579
ko	1,454
nl	1,317
ar	1,290
pl	1,228
tl	1,192
cs	1,161
tr	1,156
no	1,115
uk	1,100
fi	1,062
vi	1,018
ro	1,017
id	977
lv	950
ms	928
el	905
bg	901
bn	893
fa	876
pa	845
ta	845
th	834
sl	832
hu	829
he	822
te	791
hi	780
mr	774
lt	765
ur	745
et	734
ml	712
gu	629
ca	554
jv	534
sw	465
my	450
az	436
ceb	188

Corpus Splits

Train: 151,754 documents
Validation: 11,911 documents
Test: 19,430 documents

Mention Type Counts

LOC: 166,358
ORG: 124,232
PER: 112,803

Intended Use & Limitations

Intended: General-purpose NER, multilingual entity clustering, retrieval-augmented generation with type-aware filtering.
Not intended: Fine-grained entity typing beyond PER/ORG/LOC, or high-stakes decision-making without human oversight.
Bias considerations: Training sources blend news, encyclopaedia content, and linked corpora. Regional under-representation (e.g., low-resource languages) can lead to uneven recall. Linked mentions cover less than half the spans, so entity embedding performance varies by domain. Remember that it is a machine learning model; it can make mistakes. Never forget to benchmark this model on your domain before deployment.

Training Details

Training followed a two-stage curriculum (NER-only warmup, then joint NER + contrastive objectives) on a single H200 machine.
Closed-source training code, not intended for public release.

Commercial Usage

You are allowed to use the model for commercial purposes. If this model was useful in your business, I would like you to donate to the animal shelter of your choice. While appreciated, donations are not a condition of using the model, but are simply suggested. If you choose to do so, you can contact me, it will make me very happy.

Acknowledgements

Dedicated to Rapido, my cat, whose kindness, affectionate company, and lightning-fast speed inspired this model's name. He was loved and loving, a wonderful companion who will be deeply missed. Rapido loved to work with me, often sitting on my lap or beside me as I coded. He had a knack for sensing when I needed a break. His presence during long coding sessions was a source of comfort and joy. Rapido passed away in August 2025, but his memory will stay with our family forever.

Contact

Pierre Tassel: pierre.tassel[at]outlook.fr

Downloads last month: 651

Safetensors

Model size

306M params

Tensor type

F32

BOOL

Model tree for pierre-tassel/rapido-ner-entity

Base model

Alibaba-NLP/gte-multilingual-mlm-base

Finetuned

(7)

this model

Evaluation results

F1 on CoNLL 2003 (English)
test set self-reported

0.933
Precision on CoNLL 2003 (English)
test set self-reported

0.911
Recall on CoNLL 2003 (English)
test set self-reported

0.955
F1 on CoNLL 2002 (Spanish)
test set self-reported

0.801
Precision on CoNLL 2002 (Spanish)
test set self-reported

0.751
Recall on CoNLL 2002 (Spanish)
test set self-reported

0.857
F1 on GermEval 2014 (German)
test set self-reported

0.847
Precision on GermEval 2014 (German)
test set self-reported

0.829
Recall on GermEval 2014 (German)
test set self-reported

0.866

View on Papers With Code