README.md · Crysy-rthomas/T-AIA-CamemBERT-NER-V2 at main

metadata

license: mit
language:
  - fr
base_model:
  - cmarkea/distilcamembert-base-ner
datasets:
  - Crysy-rthomas/T-AIA-NER-DATASET

Model Overview

This model is a fine-tuned version of the cmarkea/distilcamembert-base-ner, adapted for Named Entity Recognition (NER) on French datasets. It is a lighter variant of CamemBERT, which is specifically optimized for NER tasks involving entities like locations, organizations, persons, and other miscellaneous entities in French texts.

Model Type

Architecture: CamembertForTokenClassification
Base Model: DistilCamemBERT
Number of Layers: 6 hidden layers, 12 attention heads
Tokenizer: Based on CamemBERT's tokenizer
Vocab Size: 32,005 tokens

Intended Use

This model is fine-tuned for Named Entity Recognition (NER) tasks, identifying and classifying entities such as:

LOC (Location)
PER (Person)
ORG (Organization)
MISC (Miscellaneous) It can also identify the Starting city and the Ending city of a travel

Example Use Case:

Given a sentence such as "Je veux aller de Paris à Lyon", the model will detect and label:

Paris as LOC
Lyon as LOC

Limitations:

Language: The model is primarily designed for French texts.
Performance: Performance may degrade if used for non-French text or tasks outside NER.

Labels and Tokens

The model uses the following entity labels:

O: Outside any named entity
B-START: Beginning of a named entity (start location)
I-START: Inside a named entity (start location)
B-END: Beginning of a named entity (end location)
I-END: Inside a named entity (end location)

Training Data

The model was fine-tuned using a French NER dataset of travel queries, including phrases like "Je veux aller de Paris à Lyon" to simulate common transportation-related interactions. The dataset contains named entity labels for city and station names.

Hyperparameters and Fine-Tuning:

Learning Rate: 2e-5
Batch Size: 16
Epochs: 3
Evaluation Strategy: Epoch-based
Optimizer: AdamW
Early Stopping: Used to prevent overfitting

Tokenizer

The tokenizer is based on the pre-trained CamemBERT tokenizer, adapted for the specific entity-labeling task. It uses subword tokenization based on the BPE (Byte-Pair Encoding) approach, which splits words into smaller units.

Tokenizer special settings:

Max Length: 128
Padding: Right-padded to 128 tokens
Truncation: Longest-first strategy, truncating tokens beyond 128.

How to Use

You can load and use this model with Hugging Face’s transformers library as follows:

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("Crysy-rthomas/T-AIA-CamemBERT-NER-V2")
model = AutoModelForTokenClassification.from_pretrained("Crysy-rthomas/T-AIA-CamemBERT-NER-V2")

text = "Je veux aller de Paris à Lyon"
tokens = tokenizer(text, return_tensors="pt")
outputs = model(**tokens)

Limitations and Bias

The model may not generalize well beyond French texts.
Results may be biased towards specific named entities frequently seen in the training data (such as city names).

License

This model is released under the Apache 2.0 License.