bert-base-multilingual-cased-classification-ner

This model is designed to classify place named entities recognized from geographic encyclopedia articles. It is a fine-tuned version of the bert-base-multilingual-cased model. It has been trained on GeoEDdA-TopoRel, a manually annotated subset of the French Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772) edited by Diderot and d'Alembert (provided by the ARTFL Encyclopédie Project).

Model Description

Class labels

The tagset is as follows:

  • City:
  • Country:
  • Human-made:
  • Island:
  • Lake:
  • Mountain:
  • Other:
  • Region:
  • River:
  • Sea:

Dataset

The model was trained using the GeoEDdA-TopoRel dataset. The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes:

Train Validation Test
City 2,657 276 277
Country 1,544 239 169
Human-made 104 7 7
Island 554 81 109
Lake 69 15 11
Mountain 232 76 70
Other 235 47 39
Region 2,706 424 440
River 128 944 125
Sea 196 37 57

Evaluation

  • Overall weighted-average model performances
Precision Recall F-score
0.84 0.84 0.84
  • Model performances (Test set)
Precision Recall F-score Support
City 0.82 0.88 0.85 277
Country 0.80 0.91 0.85 169
Human-made 0.50 0.71 0.59 7
Island 0.79 0.76 0.78 109
Lake 1.00 0.64 0.78 11
Mountain 0.81 0.73 0.77 70
Other 0.68 0.49 0.57 39
Region 0.89 0.85 0.87 440
River 0.87 0.90 0.88 125
Sea 0.96 0.93 0.95 57

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))

ner = pipeline("token-classification", model="GEODE/camembert-base-edda-span-classification", aggregation_strategy="simple", device=device)
placename_classifier = pipeline("text-classification", model="GEODE/bert-base-multilingual-cased-classification-ner", truncation=True, device=device)

def get_context(text, span, ngram_context_size=5):
    word = span["word"]
    start = span["start"]
    end = span["end"]
    label = span["entity_group"]

    # Extract context
    previous_text = text[:start].strip()
    next_text = text[end:].strip()
    previous_words = previous_text.split()[-ngram_context_size:]
    next_words = next_text.split()[:ngram_context_size]

    # Build context string
    context = f"[{word}]: {' '.join(previous_words)} {word} {' '.join(next_words)}"
    return word, context, label

content = "WINCHESTER, (Géog. mod.) ou plutôt Wintchester, ville d'Angleterre, capitale du Hampshire, sur le bord de l'Itching, à dix-huit milles au sud-est de Salisbury, & à soixante sud-ouest de Londres. Long. 16. 20. latit. 51. 3."

spans = ner(content)
for span in spans:
    if span['entity_group'] == 'NP_Spatial':
        word, context, label = get_context(content, span, ngram_context_size=5)
        print(f"Place name: {word}")

        label = placename_classifier(context)
        print(f"Predicted label: {label}")


# Output
Place name: Wintchester
Predicted label: [{'label': 'City', 'score': 0.9968810081481934}]
Place name: Angleterre
Predicted label: [{'label': 'Country', 'score': 0.9953059554100037}]
Place name: Hampshire
Predicted label: [{'label': 'Region', 'score': 0.9967537522315979}]
Place name: Itching
Predicted label: [{'label': 'River', 'score': 0.9929990768432617}]
Place name: Salisbury
Predicted label: [{'label': 'City', 'score': 0.9969013929367065}]
Place name: Londres
Predicted label: [{'label': 'City', 'score': 0.9969471096992493}]

Bias, Risks, and Limitations

This model was trained entirely on French encyclopaedic entries classified as Geography and will likely not perform well on text in other languages or other corpora.

Acknowledgement

The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.

Downloads last month
57
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GEODE/bert-base-multilingual-cased-classification-ner

Finetuned
(895)
this model

Dataset used to train GEODE/bert-base-multilingual-cased-classification-ner

Space using GEODE/bert-base-multilingual-cased-classification-ner 1