bert-base-multilingual-cased-classification-ner

This model is designed to classify place named entities recognized from geographic encyclopedia articles. It is a fine-tuned version of the bert-base-multilingual-cased model. It has been trained on GeoEDdA-TopoRel, a manually annotated subset of the French Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772) edited by Diderot and d'Alembert (provided by the ARTFL Encyclopédie Project).

Model Description

Authors: Bin Yang, Ludovic Moncla, Fabien Duchateau and Frédérique Laforest in the framework of the ECoDA and GEODE projects
Model type: Text classification
Repository: https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg
Language(s) (NLP): French
License: cc-by-nc-4.0

Class labels

The tagset is as follows:

City:
Country:
Human-made:
Island:
Lake:
Mountain:
Other:
Region:
River:
Sea:

Dataset

The model was trained using the GeoEDdA-TopoRel dataset. The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes:

	Train	Validation	Test
City	2,657	276	277
Country	1,544	239	169
Human-made	104	7	7
Island	554	81	109
Lake	69	15	11
Mountain	232	76	70
Other	235	47	39
Region	2,706	424	440
River	128	944	125
Sea	196	37	57

Evaluation

Overall weighted-average model performances

	Precision	Recall	F-score
	0.84	0.84	0.84

Model performances (Test set)

	Precision	Recall	F-score	Support
City	0.82	0.88	0.85	277
Country	0.80	0.91	0.85	169
Human-made	0.50	0.71	0.59	7
Island	0.79	0.76	0.78	109
Lake	1.00	0.64	0.78	11
Mountain	0.81	0.73	0.77	70
Other	0.68	0.49	0.57	39
Region	0.89	0.85	0.87	440
River	0.87	0.90	0.88	125
Sea	0.96	0.93	0.95	57

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))

ner = pipeline("token-classification", model="GEODE/camembert-base-edda-span-classification", aggregation_strategy="simple", device=device)
placename_classifier = pipeline("text-classification", model="GEODE/bert-base-multilingual-cased-classification-ner", truncation=True, device=device)

def get_context(text, span, ngram_context_size=5):
    word = span["word"]
    start = span["start"]
    end = span["end"]
    label = span["entity_group"]

    # Extract context
    previous_text = text[:start].strip()
    next_text = text[end:].strip()
    previous_words = previous_text.split()[-ngram_context_size:]
    next_words = next_text.split()[:ngram_context_size]

    # Build context string
    context = f"[{word}]: {' '.join(previous_words)} {word} {' '.join(next_words)}"
    return word, context, label

content = "WINCHESTER, (Géog. mod.) ou plutôt Wintchester, ville d'Angleterre, capitale du Hampshire, sur le bord de l'Itching, à dix-huit milles au sud-est de Salisbury, & à soixante sud-ouest de Londres. Long. 16. 20. latit. 51. 3."

spans = ner(content)
for span in spans:
    if span['entity_group'] == 'NP_Spatial':
        word, context, label = get_context(content, span, ngram_context_size=5)
        print(f"Place name: {word}")

        label = placename_classifier(context)
        print(f"Predicted label: {label}")


# Output
Place name: Wintchester
Predicted label: [{'label': 'City', 'score': 0.9968810081481934}]
Place name: Angleterre
Predicted label: [{'label': 'Country', 'score': 0.9953059554100037}]
Place name: Hampshire
Predicted label: [{'label': 'Region', 'score': 0.9967537522315979}]
Place name: Itching
Predicted label: [{'label': 'River', 'score': 0.9929990768432617}]
Place name: Salisbury
Predicted label: [{'label': 'City', 'score': 0.9969013929367065}]
Place name: Londres
Predicted label: [{'label': 'City', 'score': 0.9969471096992493}]

Bias, Risks, and Limitations

This model was trained entirely on French encyclopaedic entries classified as Geography and will likely not perform well on text in other languages or other corpora.

Acknowledgement

The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.

Downloads last month: 57

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for GEODE/bert-base-multilingual-cased-classification-ner

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(895)

this model

GEODE
/

bert-base-multilingual-cased-classification-ner