Dhivehi Named Entity Recognition (NER) Model

This is a BERT-based Named Entity Recognition model trained specifically for the Dhivehi language. The model can identify and classify named entities in Dhivehi text into different categories including Person (PER), Organization (ORG), Location (LOC), and Miscellaneous (MISC) entities.

Model Details

Model Name: alakxender/bert-dhivehi-ner-model
Base Model: BERT Multilingual Cased
Tokenizer: alakxender/bert-fast-dhivehi-tokenizer-extended
Task: Named Entity Recognition (NER)
Language: Dhivehi (dv)

Entity Types

The model can identify the following entity types:

PER: Person names
ORG: Organization names
LOC: Location names
MISC: Miscellaneous named entities

Each entity type uses the standard BIO (Beginning, Inside, Outside) tagging scheme:

B-: Marks the beginning of an entity
I-: Marks the continuation (inside) of an entity
O: Marks tokens that are not part of any entity

Usage

Here's how to use the model with the Transformers library:

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification

# Load model and tokenizer
model_name = "alakxender/bert-dhivehi-ner-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create NER pipeline
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example text (in Dhivehi)
text = "ރަސްމާލެ ނަމުގައި ތަރައްގީކުރާ ކ. ފުށިދިއްގަރު ފަޅުން ބިން ހިއްކުމަށް ސްރީ ލަންކާގެ ކުންފުންޏެއް"

# Get predictions
entities = ner(text)

# Print results
for entity in entities:
    print(f"Entity: {entity['word']}")
    print(f"Type: {entity['entity_group']}")
    print(f"Confidence: {entity['score']:.4f}")
    print("---")

Model Performance

The model was trained for 10 epochs with the following training parameters:

Learning rate: 5e-5
Batch size: 16
Weight decay: 0.01
Max sequence length: 128 tokens

Final training metrics:

Training loss: 0.3016
Training runtime: ~27 hours
Training samples per second: 37.96

Limitations

The model works best with properly formatted Dhivehi text
Maximum sequence length is 128 tokens
Performance may vary for highly technical or domain-specific text

alakxender
/

bert-dhivehi-ner-model