Scandi NER Model 🏔️

A multilingual Named Entity Recognition model trained on multiple Scandi language datasets plus English and German. The model identifies Person (PER), Organization (ORG), and Location (LOC) entities.

Model Description

This model is based on jhu-clsp/mmBERT-base and has been fine-tuned for token classification on a combined dataset of Scandi NER corpora. It supports:

🇩🇰 Danish - Multiple high-quality datasets including DaNE
🇸🇪 Swedish - SUC 3.0, Swedish NER corpus, and more
🇳🇴 Norwegian - NorNE (Bokmål and Nynorsk)
🇬🇧 English - CoNLL-2003 and additional datasets

Performance

The model achieves the following performance on the held-out test set:

Metric	Score
F1 Score	0.9834
Precision	0.9836
Recall	0.9846

Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/nordic-ner-model")
model = AutoModelForTokenClassification.from_pretrained("your-username/nordic-ner-model")

# Create NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example usage
text = "Barack Obama besökte Stockholm och träffade Stefan Löfven."
entities = ner_pipeline(text)

for entity in entities:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.3f})")

Supported Entity Types
The model predicts the following entity types using BIO tagging:

PER (Person): Names of people
ORG (Organization): Companies, institutions, organizations
LOC (Location): Geographic locations, places

Training Data
The model was trained on a combination of the following datasets:
- **eriktks/conll2003**: 20,682 examples
- **NbAiLab/norne_bokmaal**: 20,044 examples
- **NbAiLab/norne_nynorsk**: 17,575 examples
- **KBLab/sucx3_ner_original_lower**: 71,915 examples
- **alexandrainst/dane**: 5,508 examples
- **ljos/norwegian_ner_nynorsk**: 17,575 examples
- **ljos/norwegian_ner_bokmaal**: 20,044 examples
- **chcaa/dansk-ner**: 14,651 examples
Dataset Statistics

Total examples: 187,994
Average sequence length: 13.8 tokens
Languages: en, no, sv, da, unknown
Label distribution:
  - B-ORG: 11,827 (0.5%)
  - O: 2,523,693 (97.1%)
  - B-PER: 27,352 (1.1%)
  - I-PER: 15,165 (0.6%)
  - B-LOC: 12,668 (0.5%)
  - I-ORG: 6,179 (0.2%)
  - I-LOC: 1,987 (0.1%)

Training Details
Training Hyperparameters

Base model: jhu-clsp/mmBERT-base
Training epochs: 3
Batch size: 16
Learning rate: 2e-05
Warmup steps: 5000
Weight decay: 0.01

Training Infrastructure

Mixed precision: False
Gradient accumulation: 1
Early stopping: Enabled with patience=3

Usage Examples
Basic NER Tagging

text = "Olof Palme var Sveriges statsminister."
entities = ner_pipeline(text)
# Output: [{'entity_group': 'PER', 'word': 'Olof Palme', 'start': 0, 'end': 10, 'score': 0.999}]

Batch Processing

texts = [
    "Microsoft fue fundada por Bill Gates.",
    "Angela Merkel var förbundskansler i Tyskland.",
    "Universitetet i Oslo ligger i Norge."
]

for text in texts:
    entities = ner_pipeline(text)
    print(f"Text: {text}")
    for entity in entities:
        print(f"  {entity['word']} -> {entity['entity_group']}")

Limitations and Considerations

Domain: Primarily trained on news and Wikipedia text; performance may vary on other domains
Subword handling: The model uses subword tokenization; ensure proper aggregation
Language mixing: While multilingual, performance is best when languages don't mix within sentences
Entity coverage: Limited to PER, ORG, LOC; doesn't detect MISC, DATE, or other entity types

Downloads last month: 15

Safetensors

Model size

308M params

Tensor type

F32

Model tree for MediaCatch/mmBERT-base-scandi-ner

Base model

jhu-clsp/mmBERT-base

Finetuned

(14)

this model