Crysy-rthomas's picture
Update README.md
352a815 verified
metadata
license: mit
language:
  - fr
base_model:
  - cmarkea/distilcamembert-base-ner
datasets:
  - Crysy-rthomas/T-AIA-NER-DATASET

Model Overview

This model is a fine-tuned version of the cmarkea/distilcamembert-base-ner, adapted for Named Entity Recognition (NER) on French datasets. It is a lighter variant of CamemBERT, which is specifically optimized for NER tasks involving entities like locations, organizations, persons, and other miscellaneous entities in French texts.

Model Type

  • Architecture: CamembertForTokenClassification
  • Base Model: DistilCamemBERT
  • Number of Layers: 6 hidden layers, 12 attention heads
  • Tokenizer: Based on CamemBERT's tokenizer
  • Vocab Size: 32,005 tokens

Intended Use

This model is fine-tuned for Named Entity Recognition (NER) tasks, identifying and classifying entities such as:

  • LOC (Location)
  • PER (Person)
  • ORG (Organization)
  • MISC (Miscellaneous) It can also identify the Starting city and the Ending city of a travel

Example Use Case:

Given a sentence such as "Je veux aller de Paris à Lyon", the model will detect and label:

  • Paris as LOC
  • Lyon as LOC

Limitations:

  • Language: The model is primarily designed for French texts.
  • Performance: Performance may degrade if used for non-French text or tasks outside NER.

Labels and Tokens

The model uses the following entity labels:

  • O: Outside any named entity
  • B-START: Beginning of a named entity (start location)
  • I-START: Inside a named entity (start location)
  • B-END: Beginning of a named entity (end location)
  • I-END: Inside a named entity (end location)

Training Data

The model was fine-tuned using a French NER dataset of travel queries, including phrases like "Je veux aller de Paris à Lyon" to simulate common transportation-related interactions. The dataset contains named entity labels for city and station names.

Hyperparameters and Fine-Tuning:

  • Learning Rate: 2e-5
  • Batch Size: 16
  • Epochs: 3
  • Evaluation Strategy: Epoch-based
  • Optimizer: AdamW
  • Early Stopping: Used to prevent overfitting

Tokenizer

The tokenizer is based on the pre-trained CamemBERT tokenizer, adapted for the specific entity-labeling task. It uses subword tokenization based on the BPE (Byte-Pair Encoding) approach, which splits words into smaller units.

Tokenizer special settings:

  • Max Length: 128
  • Padding: Right-padded to 128 tokens
  • Truncation: Longest-first strategy, truncating tokens beyond 128.

How to Use

You can load and use this model with Hugging Face’s transformers library as follows:

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("Crysy-rthomas/T-AIA-CamemBERT-NER-V2")
model = AutoModelForTokenClassification.from_pretrained("Crysy-rthomas/T-AIA-CamemBERT-NER-V2")

text = "Je veux aller de Paris à Lyon"
tokens = tokenizer(text, return_tensors="pt")
outputs = model(**tokens)

Limitations and Bias

  • The model may not generalize well beyond French texts.
  • Results may be biased towards specific named entities frequently seen in the training data (such as city names).

License

This model is released under the Apache 2.0 License.