Model Card: BERT for Named Entity Recognition (NER)

Model Overview

This model, bert-conll-ner, is a fine-tuned version of bert-base-uncased trained for the task of Named Entity Recognition (NER) using the CoNLL-2003 dataset. It is designed to identify and classify entities in text, such as person names (PER), organizations (ORG), locations (LOC), and miscellaneous (MISC) entities.

Model Architecture

  • Base Model: BERT (Bidirectional Encoder Representations from Transformers) with the bert-base-uncased architecture.
  • Task: Token Classification (NER).

Training Dataset

  • Dataset: CoNLL-2003, a standard dataset for NER tasks containing sentences annotated with named entity spans.
  • Classes:
    • PER (Person)
    • ORG (Organization)
    • LOC (Location)
    • MISC (Miscellaneous)
    • O (Outside of any entity span)

Performance Metrics

The model demonstrates strong performance metrics on the CoNLL-2003 evaluation set:

Metric Value
Loss 0.0649
Precision 93.59%
Recall 95.07%
F1 Score 94.32%
Accuracy 98.79%

These metrics indicate the model's high accuracy and robustness in identifying and classifying entities.

Training Details

  • Optimizer: AdamW (Adam with weight decay)
  • Learning Rate: 2e-5
  • Batch Size: 8
  • Number of Epochs: 3
  • Scheduler: Linear scheduler with warm-up steps
  • Loss Function: Cross-entropy loss with ignored index (-100) for padding tokens

Model Input/Output

  • Input Format: Tokenized text with special tokens [CLS] and [SEP].
  • Output Format: Token-level predictions with corresponding labels from the NER tag set (B-PER, I-PER, etc.).

How to Use the Model

Installation

pip install transformers

Loading the Model

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("sfarrukh/modernbert-conll-ner")
model = AutoModelForTokenClassification.from_pretrained("sfarrukh/modernbert-conll-ner")

Running Inference

from transformers import pipeline

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "John lives in New York City."
result = nlp(text)
print(result)
[{'entity_group': 'PER',
  'score': 0.99912304,
  'word': 'john',
  'start': 0,
  'end': 4},
 {'entity_group': 'LOC',
  'score': 0.9993351,
  'word': 'new york city',
  'start': 14,
  'end': 27}]

Limitations

  1. Domain-Specific Adaptability: Performance might drop on domain-specific texts (e.g., legal or medical) not covered by the CoNLL-2003 dataset.
  2. Ambiguity: Ambiguous entities or overlapping spans are not explicitly handled.

Recommendations

  • For domain-specific tasks, consider fine-tuning this model further on a relevant dataset.
  • Use a pre-processing pipeline to handle long texts by splitting them into smaller segments.

Acknowledgements

  • Transformers Library: Hugging Face
  • Dataset: CoNLL-2003
  • Base Model: bert-base-uncased by Google
Downloads last month
9
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for sfarrukh/bert-conll-ner

Finetuned
(2444)
this model

Dataset used to train sfarrukh/bert-conll-ner