Scandi NER Model 🏔️
A multilingual Named Entity Recognition model trained on multiple Scandi language datasets plus English and German. The model identifies Person (PER), Organization (ORG), and Location (LOC) entities.
Model Description
This model is based on jhu-clsp/mmBERT-base
and has been fine-tuned for token classification on a combined dataset of Scandi NER corpora. It supports:
- 🇩🇰 Danish - Multiple high-quality datasets including DaNE
- 🇸🇪 Swedish - SUC 3.0, Swedish NER corpus, and more
- 🇳🇴 Norwegian - NorNE (Bokmål and Nynorsk)
- 🇬🇧 English - CoNLL-2003 and additional datasets
Performance
The model achieves the following performance on the held-out test set:
Metric | Score |
---|---|
F1 Score | 0.9834 |
Precision | 0.9836 |
Recall | 0.9846 |
Quick Start
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/nordic-ner-model")
model = AutoModelForTokenClassification.from_pretrained("your-username/nordic-ner-model")
# Create NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Example usage
text = "Barack Obama besökte Stockholm och träffade Stefan Löfven."
entities = ner_pipeline(text)
for entity in entities:
print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.3f})")
Supported Entity Types
The model predicts the following entity types using BIO tagging:
PER (Person): Names of people
ORG (Organization): Companies, institutions, organizations
LOC (Location): Geographic locations, places
Training Data
The model was trained on a combination of the following datasets:
- **eriktks/conll2003**: 20,682 examples
- **NbAiLab/norne_bokmaal**: 20,044 examples
- **NbAiLab/norne_nynorsk**: 17,575 examples
- **KBLab/sucx3_ner_original_lower**: 71,915 examples
- **alexandrainst/dane**: 5,508 examples
- **ljos/norwegian_ner_nynorsk**: 17,575 examples
- **ljos/norwegian_ner_bokmaal**: 20,044 examples
- **chcaa/dansk-ner**: 14,651 examples
Dataset Statistics
Total examples: 187,994
Average sequence length: 13.8 tokens
Languages: en, no, sv, da, unknown
Label distribution:
- B-ORG: 11,827 (0.5%)
- O: 2,523,693 (97.1%)
- B-PER: 27,352 (1.1%)
- I-PER: 15,165 (0.6%)
- B-LOC: 12,668 (0.5%)
- I-ORG: 6,179 (0.2%)
- I-LOC: 1,987 (0.1%)
Training Details
Training Hyperparameters
Base model: jhu-clsp/mmBERT-base
Training epochs: 3
Batch size: 16
Learning rate: 2e-05
Warmup steps: 5000
Weight decay: 0.01
Training Infrastructure
Mixed precision: False
Gradient accumulation: 1
Early stopping: Enabled with patience=3
Usage Examples
Basic NER Tagging
text = "Olof Palme var Sveriges statsminister."
entities = ner_pipeline(text)
# Output: [{'entity_group': 'PER', 'word': 'Olof Palme', 'start': 0, 'end': 10, 'score': 0.999}]
Batch Processing
texts = [
"Microsoft fue fundada por Bill Gates.",
"Angela Merkel var förbundskansler i Tyskland.",
"Universitetet i Oslo ligger i Norge."
]
for text in texts:
entities = ner_pipeline(text)
print(f"Text: {text}")
for entity in entities:
print(f" {entity['word']} -> {entity['entity_group']}")
Limitations and Considerations
Domain: Primarily trained on news and Wikipedia text; performance may vary on other domains
Subword handling: The model uses subword tokenization; ensure proper aggregation
Language mixing: While multilingual, performance is best when languages don't mix within sentences
Entity coverage: Limited to PER, ORG, LOC; doesn't detect MISC, DATE, or other entity types
- Downloads last month
- 15
Model tree for MediaCatch/mmBERT-base-scandi-ner
Base model
jhu-clsp/mmBERT-base