NERC Extraction – Stage 2 Models

This repository contains two small neural models for Named Entity Recognition (NER) that have been trained using different annotation sources:

model_llm_pure: Trained solely on low-quality annotations generated by a Large Language Model (LLM).
primary_model: Fine-tuned on the original, ground-truth annotations from the CoNLL2003 dataset.

Both models use a hybrid architecture combining a pre-trained BERT model for contextualized word embeddings, a bidirectional LSTM layer to capture sequence dependencies, and a linear classifier to predict NER tags. The models are evaluated using an entity-level evaluation strategy that measures the correctness of entire entities (including boundaries and labels) using the seqeval library.

Model Architecture

Core Components:

Pre-trained BERT Encoder:
Uses bert-base-cased to generate high-quality contextualized embeddings for input tokens.
Bidirectional LSTM (BiLSTM):
Processes the sequence of BERT embeddings to capture sequential dependencies, ensuring that both left and right contexts are taken into account.
Linear Classification Layer:
Maps the output of the BiLSTM to the set of NER tags defined in the project.
The tag set includes standard BIO tags for Person, Organization, Location, Miscellaneous, and additional special tokens ([CLS], [SEP], X).

Training Data & Annotation Sources

Low-Quality (LLM) Annotations:
The model_llm_pure was trained on a dataset generated using the best method from the first stage of the project. This dataset contains approximately 1,000 sentences with LLM-generated annotations.
Ground-Truth Annotations (CoNLL2003):
The primary_model was trained on the original expert annotations from the CoNLL2003 dataset (approximately 14,000 sentences).
As a result, primary_model exhibits significantly improved performance over model_llm_pure.

Evaluation Metrics

Our evaluation strategy is based on an entity-level approach:

Entity-Level Evaluation Module:
- Prediction Collection: For each sentence, predicted and true labels are collected in a list-of-lists format.
- Seqeval Accuracy: Measures the overall accuracy at the entity level.
- F1-Score: Calculated as the harmonic mean of precision and recall for entire entities. A correct prediction requires that the full entity (with correct boundaries and label) is identified.
- Classification Report: Provides detailed precision, recall, and F1-scores for each entity type.
Results Comparison:

Model	Validation Loss	Seqeval Accuracy	F1-Score
model_llm_pure	0.53443	0.85185	0.47493
primary_model	0.09430	0.97955	0.88959

These results demonstrate that primary_model (trained on ground-truth CoNLL2003 data) achieves significantly better performance compared to model_llm_pure, reflecting the importance of high-quality annotations in NER.

Usage

Inference

You can load any of the models using the Hugging Face from_pretrained API. For example, to load the primary model:

from transformers import BertTokenizer
from your_model_module import NERSmall  # make sure NERSmall is imported

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model_primary = NERSmall.from_pretrained("estnafinema0/nerc-extraction", revision="model_primary").to("cuda")

Similarly, to load the LLM-based model:

model_llm_pure = NERSmall.from_pretrained("estnafinema0/nerc-extraction", revision="main").to("cuda")

Fine-tuning & Active Learning

This repository also serves as the basis for further active learning experiments. The evaluation is performed using an entity-level strategy that ensures that only complete entities (with correct boundaries and labels) are counted as correct. Our active learning experiments (described in additional documentation) have demonstrated that adding high-quality expert examples significantly improves the F1-score.

Training & Evaluation

Training Environment:

Optimizer: Stochastic Gradient Descent (SGD) with learning rate 0.001 and momentum 0.9.
Batch Size: 32
Epochs: Models are trained for 5 epochs during initial training (with further fine-tuning as part of active learning experiments).

Evaluation Function:

Our evaluation function computes entity-level metrics (F1, seqeval accuracy, and validation loss) by processing batches and collecting predictions in a list-of-lists format to ensure that only correctly identified complete entities contribute to the final score.

Additional Information

Repository: All models and intermediate checkpoints are stored in separate branches of the repository. For instance, primary_model is available in the branch model_primary, while other models (from active learning experiments) are stored in branches with names indicating the iteration and percentage of added expert data (e.g., active_iter_1_added_20).

estnafinema0
/

nerc-extraction