bert-base-multilingual-cased-place-entry-classification

This model is designed to classify geographic encyclopedia articles describing places. It is a fine-tuned version of the bert-base-multilingual-cased model. It has been trained on GeoEDdA-TopoRel, a manually annotated subset of the French Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772) edited by Diderot and d'Alembert (provided by the ARTFL Encyclopédie Project).

Model Description

Developed by: Bin Yang, Ludovic Moncla, Fabien Duchateau and Frédérique Laforest
Model type: Text classification
Repository: https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg
Language(s) (NLP): French
License: cc-by-nc-4.0

Class labels

The tagset is as follows (with examples from the dataset):

City: villes, bourgs, villages, etc.
Island: îles, presqu'îles, etc.
Region: régions, contrées, provinces, cercles, etc.
River: rivières, fleuves,etc.
Mountain: montagnes, vallées, etc.
Country: pays, royaumes, etc.
Sea: mer, golphe, baie, etc.
Other: promontoires, caps, rivages, déserts, etc.
Human-made: ports, châteaux, forteresses, abbayes, etc.
Lake: lacs, étangs, marais, etc.

Dataset

The model was trained using the GeoEDdA-TopoRel dataset. The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes:

	Train	Validation	Test
City	921	33	40
Island	216	20	27
Region	138	40	28
River	133	20	28
Mountain	63	29	22
Human-made	38	10	9
Other	27	12	12
Sea	26	13	12
Lake	22	9	9
Country	16	14	13

Evaluation

Overall macro-average model performances

Precision	Recall	F-score
0.95	0.92	0.93

Overall weighted-average model performances

Precision	Recall	F-score
0.94	0.94	0.94

Model performances (Test set)

	Precision	Recall	F-score	Support
City	0.91	1.00	0.95	40
Island	0.96	0.96	0.96	27
River	0.97	1.00	0.98	28
Region	0.86	0.89	0.88	28
Mountain	1.00	0.95	0.98	22
Country	1.00	0.85	0.92	13
Sea	1.00	0.92	0.96	12
Other	0.90	0.75	0.82	12
Human-made	0.90	1.00	0.95	9
Lake	1.00	0.89	0.94	9

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))

tokenizer = AutoTokenizer.from_pretrained("GEODE/bert-base-multilingual-cased-place-entry-classification")
model = AutoModelForSequenceClassification.from_pretrained("GEODE/bert-base-multilingual-cased-place-entry-classification")

pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, truncation=True, device=device)

samples = [
    "* ALBI, (Géog.) ville de France, capitale de l'Albigeois, dans le haut Languedoc : elle est sur le Tarn. Long. 19. 49. lat. 43. 55. 44.",
    "* ARCALU (Principauté d') petit état des Tartares-Monguls, sur la riviere d'Hoamko, où commence  la grande muraille de la Chine, sous le 122e degré de longitude & le 42e de latitude septentrionale."
]


for sample in samples:
    print(pipe(sample))


# Output

[{'label': 'City', 'score': 0.9969543218612671}]
[{'label': 'Region', 'score': 0.9811353087425232}]

Bias, Risks, and Limitations

This model was trained entirely on French encyclopaedic entries classified as Geography (and place) and will likely not perform well on text in other languages or other corpora.

Acknowledgement

The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.

Downloads last month: 33

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for GEODE/bert-base-multilingual-cased-place-entry-classification

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(900)

this model

GEODE
/

bert-base-multilingual-cased-place-entry-classification