Rapido NER + Entity Embedding

Rapido the tuxedo cat in profile, bright green eyes
Rapido, the model mascot

Rapido NER is a strongly multilingual named-entity recogniser and entity embedding model named in memory of Rapido, the cat who used to smooth out the loss spike of my training runs.
The model wraps an encoder, attention-based mention pooling, a per-type projection layer for downstream entity retrieval, and CRF decoding for sequence predictions. Model context length: 4096 tokens.
This model aims to solve the following common issues in NER context:

  • Strong multilingual NER performance: Robust NER model that can generalise across many languages and domains, leveraging both high-quality annotated/synthetic datasets and weakly supervised linked mentions.
  • Entity clustering and retrieval: Obtain meaningful entity representations that can be used for clustering, linking, or retrieval tasks, especially in a multilingual setting. Each entity is projected to a 768-dimensional L2-normalised vector space, with type-aware conditioning to separate coarse types (PER/ORG/LOC).
  • Within document clustering: Cluster within the same document mentions of the same entity in different languages (e.g., "Cologne" and "Köln").
  • Long context handling: Most NER models are limited to 512 tokens, which can be insufficient for documents with multiple entities or complex structures. This model was trained with a context of 4096 tokens.

Want to quickly check the model's performance? Use the space: https://huggingface.co/spaces/pierre-tassel/rapido-ner-space

HuggingFace Space Example of the NER + Entity Linking model usage

Model Overview

  • Architecture: Full finetuned MLM Encoder backbone (Alibaba-NLP/gte-multilingual-mlm-base) + token-classification head + attention pooling + per entity-type projection head + CRF
  • Objectives: Span-level NER (BIO) and contrastive entity embedding
  • Output: Token logits, CRF decode, pooled span embeddings
  • Modalities: Text only
  • License: Apache 2

Key Features

  1. Entity embeddings out of the box pass a mention_mask to obtain L2-normalised 768-dimensional span vectors suitable for clustering, retrieval, and linking tasks.
  2. Type-aware projection FiLM conditioning aligns entity representations with coarse types (PER/ORG/LOC) learned during training.
  3. CRF decoding first-order CRF with BIO constraints for sharper sequence predictions.
  4. Multilingual coverage the training corpus spans 55 languages, with linked-entity supervision.

Usage

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model_id = "pierre-tassel/rapido-ner-entity"

model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

texts = [
    "I'm traveling to Cologne next week.",
    "Ich fahre nächste Woche nach Köln.",
    "Je vais à Cologne la semaine prochaine.",
    "Apple opened a lab in Paris.",
    "Microsoft acquired a startup in Berlin.",
    # one document with multiple mentions to show within-doc clustering (two Cologne LOC)
    "The Cologne office met with Köln University and Microsoft in Cologne.",
]

enc = tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
)
input_ids = enc["input_ids"]
attention_mask = enc["attention_mask"].bool()

with torch.no_grad():
    hidden = model.encode_tokens(input_ids, attention_mask)
    logits = model.ner_head(hidden)

decoded = model.crf_decode(logits, attention_mask)

id2label = {int(k): v for k, v in model.config.id2label.items()}


def spans_from_bio(tag_ids: list[int]) -> list[tuple[int, int, str]]:
    spans = []
    start, cur_type = None, None
    for idx, tag_id in enumerate(tag_ids):
        tag = id2label[tag_id]
    if tag == "O":
        if start is not None:
        spans.append((start, idx, cur_type))
    start, cur_type = None, None
    continue
    prefix, etype = tag.split("-", 1) if "-" in tag else ("B", tag)
    if prefix == "B" or etype != cur_type:
        if start is not None:
        spans.append((start, idx, cur_type))
    start, cur_type = idx, etype
    if start is not None:
        spans.append((start, len(tag_ids), cur_type))
    return spans


results = []

for i, text in enumerate(texts):
    seq_len = int(attention_mask[i].sum().item())
    tokens = tokenizer.convert_ids_to_tokens(input_ids[i, :seq_len])
    tag_ids = decoded[i][:seq_len]
    spans = spans_from_bio(tag_ids)

    # Print predictions per text
    tag_labels = [id2label[t] for t in tag_ids]
    print(f"\nPrediction for text {i + 1}: {text}")
    print("Tokens with tags:")
    print(list(zip(tokens, tag_labels)))  # e.g., ... ('▁Colo','B-LOC'), ('gne','I-LOC'), ...
    if spans:
        ents_readable = [
            f"{tokenizer.decode(input_ids[i, s:e]).strip()} [{typ}]"
            for (s, e, typ) in spans
        ]
    print("Entities:", ents_readable)  # e.g., ['Apple [ORG]', 'Paris [LOC]']
    else:
    print("Entities: []")

    mention_vectors = []
    mention_surfaces = []
    mention_types = []

    if spans:
        mention_mask = torch.zeros(
            (1, len(spans), input_ids.size(1)), dtype=torch.bool
        )
    for m, (s, e, _) in enumerate(spans):
        mention_mask[0, m, s:e] = True

    with torch.no_grad():
        pooled = model.encode_mentions_with_attention(
            hidden[i: i + 1], mention_mask
        )
    projected = model.project_mentions(pooled).squeeze(0)

    mention_vectors = F.normalize(projected, dim=-1)
    for (s, e, typ) in spans:
        surface = tokenizer.decode(input_ids[i, s:e])
    mention_surfaces.append(surface.strip())
    mention_types.append(typ)

    results.append(
        {
            "text": text,
            "tokens": tokens,
            "tags": tag_labels,
            "spans": spans,
            "mention_surfaces": mention_surfaces,
            "mention_types": mention_types,
            "embeddings": mention_vectors,  # tensor or []
        }
    )

# Global pairwise similarities
all_embeds = []
all_labels = []
for r in results:
    for surface, typ, emb in zip(
            r["mention_surfaces"], r["mention_types"], r["embeddings"]
    ):
        all_embeds.append(emb)
    all_labels.append(f"{surface} [{typ}]")

if all_embeds:
    all_embeds = torch.stack(all_embeds)
    sim = all_embeds @ all_embeds.T
    print(
        "\nGlobal cosine similarities:")  # e.g., Cologne [LOC] vs Köln [LOC] ≈ 0.818; Apple [ORG] vs Microsoft [ORG] ≈ 0.400; Paris [LOC] vs Berlin [LOC] ≈ 0.491
    for i, li in enumerate(all_labels):
        for j, lj in enumerate(all_labels):
        print(f"{li:20s} vs {lj:20s}: {sim[i, j]:.3f}")
    print()

# Within-document similarity matrices
print("\nPer-document similarity (only if >=2 entities predicted):")  # e.g., Doc 6: Cologne vs Cologne = 0.986
for r in results:
    embs = r["embeddings"]
    if len(embs) < 2:
        continue
    sim_doc = embs @ embs.T
    print(f"\nText: {r['text']}")
    for i, label_i in enumerate(r["mention_surfaces"]):
        for j, label_j in enumerate(r["mention_surfaces"]):
        print(f"  {label_i:12s} vs {label_j:12s}: {sim_doc[i, j]:.3f}")

Evaluation Summary

Dataset Split Precision Recall F1 Support
CoNLL 2003 (en) test 0.9111 0.9551 0.9326 4,946
CoNLL 2002 (es) test 0.7515 0.8568 0.8007 3,219
GermEval 2014 (de) test 0.8290 0.8663 0.8473 3,157

GermEval 2014 per-type (test)

  • LOC: P 85.21%, R 88.25%, F1 86.70% (support 1,123)
  • ORG: P 70.40%, R 75.46%, F1 72.84% (support 933)
  • PER: P 91.55%, R 94.46%, F1 92.99% (support 1,101)
  • Token accuracy: 98.24%
  • Macro F1: 84.73%

Note: The released checkpoint predicts the coarse types {PER, ORG, LOC}. Historical "MISC" mentions are mapped to O during decoding; evaluation scripts should ignore that label when computing metrics.

Dataset Statistics

  • Documents: 183,095
  • Mentions: 473,054 (66.61% of docs contain at least one mention)
  • Linked mentions: 45.66%
  • Avg. mentions per document: 2.584
  • Languages covered: 55

Language Distribution (top 10 shown)

Language Docs
en 68,809
zh 12,926
pt 12,258
sk 11,361
hr 9,842
sv 8,133
da 6,537
sr 5,192
de 3,080
fr 2,584
ru 2,448
es 2,413
it 1,827
ja 1,579
ko 1,454
nl 1,317
ar 1,290
pl 1,228
tl 1,192
cs 1,161
tr 1,156
no 1,115
uk 1,100
fi 1,062
vi 1,018
ro 1,017
id 977
lv 950
ms 928
el 905
bg 901
bn 893
fa 876
pa 845
ta 845
th 834
sl 832
hu 829
he 822
te 791
hi 780
mr 774
lt 765
ur 745
et 734
ml 712
gu 629
ca 554
jv 534
sw 465
my 450
az 436
ceb 188

Corpus Splits

  • Train: 151,754 documents
  • Validation: 11,911 documents
  • Test: 19,430 documents

Mention Type Counts

  • LOC: 166,358
  • ORG: 124,232
  • PER: 112,803

Intended Use & Limitations

  • Intended: General-purpose NER, multilingual entity clustering, retrieval-augmented generation with type-aware filtering.
  • Not intended: Fine-grained entity typing beyond PER/ORG/LOC, or high-stakes decision-making without human oversight.
  • Bias considerations: Training sources blend news, encyclopaedia content, and linked corpora. Regional under-representation (e.g., low-resource languages) can lead to uneven recall. Linked mentions cover less than half the spans, so entity embedding performance varies by domain. Remember that it is a machine learning model; it can make mistakes. Never forget to benchmark this model on your domain before deployment.

Training Details

Training followed a two-stage curriculum (NER-only warmup, then joint NER + contrastive objectives) on a single H200 machine.
Closed-source training code, not intended for public release.

Commercial Usage

You are allowed to use the model for commercial purposes. If this model was useful in your business, I would like you to donate to the animal shelter of your choice. While appreciated, donations are not a condition of using the model, but are simply suggested. If you choose to do so, you can contact me, it will make me very happy.

Acknowledgements

Dedicated to Rapido, my cat, whose kindness, affectionate company, and lightning-fast speed inspired this model's name. He was loved and loving, a wonderful companion who will be deeply missed. Rapido loved to work with me, often sitting on my lap or beside me as I coded. He had a knack for sensing when I needed a break. His presence during long coding sessions was a source of comfort and joy. Rapido passed away in August 2025, but his memory will stay with our family forever.

Contact

Pierre Tassel: pierre.tassel[at]outlook.fr

Downloads last month
651
Safetensors
Model size
306M params
Tensor type
F32
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pierre-tassel/rapido-ner-entity

Finetuned
(7)
this model

Evaluation results