Tamil Named Entity Recognition

Fine-tuning bert-base-multilingual-cased on Wikiann dataset for performing NER on Tamil language.

Label ID and its corresponding label name

Label ID Label Name
0 O
1 B-PER
2 I-PER
3 B-ORG
4 I-ORG
5 B-LOC
6 I-LOC

Results

Step Training Loss Validation Loss Overall Precision Overall Recall Overall F1 Overall Accuracy Loc F1 Org F1 Per F1
1000 0.386900 0.300006 0.833469 0.824748 0.829086 0.912857 0.835343 0.781625 0.867752
2000 0.210200 0.251389 0.845455 0.842052 0.843750 0.924861 0.851711 0.790198 0.886515
3000 0.140000 0.264964 0.866952 0.856137 0.861510 0.930141 0.874872 0.818150 0.885203
4000 0.095400 0.298542 0.860871 0.882696 0.871647 0.935692 0.881348 0.829285 0.899245
5000 0.062200 0.296011 0.871805 0.878471 0.875125 0.938806 0.875434 0.850889 0.898148
6000 0.042200 0.320418 0.868416 0.879074 0.873713 0.937497 0.877588 0.833611 0.907737

Example

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("Ambareeshkumar/BERT-Tamil")
model = AutoModelForTokenClassification.from_pretrained("Ambareeshkumar/BERT-Tamil")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "இந்திய"
ner_results = nlp(example)
ner_results
Downloads last month
334
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train Ambareeshkumar/BERT-Tamil