Model Card: BERT for Named Entity Recognition (NER)

Model Overview

This model, bert-conll-ner, is a fine-tuned version of bert-base-uncased trained for the task of Named Entity Recognition (NER) using the CoNLL-2003 dataset. It is designed to identify and classify entities in text, such as person names (PER), organizations (ORG), locations (LOC), and miscellaneous (MISC) entities.

Model Architecture

Base Model: BERT (Bidirectional Encoder Representations from Transformers) with the bert-base-uncased architecture.
Task: Token Classification (NER).

Training Dataset

Dataset: CoNLL-2003, a standard dataset for NER tasks containing sentences annotated with named entity spans.
Classes:
- PER (Person)
- ORG (Organization)
- LOC (Location)
- MISC (Miscellaneous)
- O (Outside of any entity span)

Performance Metrics

The model demonstrates strong performance metrics on the CoNLL-2003 evaluation set:

Metric	Value
Loss	0.0649
Precision	93.59%
Recall	95.07%
F1 Score	94.32%
Accuracy	98.79%

These metrics indicate the model's high accuracy and robustness in identifying and classifying entities.

Training Details

Optimizer: AdamW (Adam with weight decay)
Learning Rate: 2e-5
Batch Size: 8
Number of Epochs: 3
Scheduler: Linear scheduler with warm-up steps
Loss Function: Cross-entropy loss with ignored index (-100) for padding tokens

Model Input/Output

Input Format: Tokenized text with special tokens [CLS] and [SEP].
Output Format: Token-level predictions with corresponding labels from the NER tag set (B-PER, I-PER, etc.).

How to Use the Model

Installation

pip install transformers

Loading the Model

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("sfarrukh/modernbert-conll-ner")
model = AutoModelForTokenClassification.from_pretrained("sfarrukh/modernbert-conll-ner")

Running Inference

from transformers import pipeline

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "John lives in New York City."
result = nlp(text)
print(result)

[{'entity_group': 'PER',
  'score': 0.99912304,
  'word': 'john',
  'start': 0,
  'end': 4},
 {'entity_group': 'LOC',
  'score': 0.9993351,
  'word': 'new york city',
  'start': 14,
  'end': 27}]

Limitations

Domain-Specific Adaptability: Performance might drop on domain-specific texts (e.g., legal or medical) not covered by the CoNLL-2003 dataset.
Ambiguity: Ambiguous entities or overlapping spans are not explicitly handled.

Recommendations

For domain-specific tasks, consider fine-tuning this model further on a relevant dataset.
Use a pre-processing pipeline to handle long texts by splitting them into smaller segments.

Acknowledgements

Transformers Library: Hugging Face
Dataset: CoNLL-2003
Base Model: bert-base-uncased by Google

sfarrukh
/

bert-conll-ner