---
license: mit
language:
- fr
base_model:
- cmarkea/distilcamembert-base-ner
datasets:
- Crysy-rthomas/T-AIA-NER-DATASET
---
## Model Overview
This model is a fine-tuned version of the **[cmarkea/distilcamembert-base-ner](https://huggingface.co/cmarkea/distilcamembert-base-ner)**, adapted for Named Entity Recognition (NER) on French datasets. It is a lighter variant of CamemBERT, which is specifically optimized for NER tasks involving entities like **locations, organizations, persons**, and other miscellaneous entities in French texts.

### Model Type
- **Architecture**: `CamembertForTokenClassification`
- **Base Model**: DistilCamemBERT
- **Number of Layers**: 6 hidden layers, 12 attention heads
- **Tokenizer**: Based on CamemBERT's tokenizer
- **Vocab Size**: 32,005 tokens

## Intended Use
This model is fine-tuned for Named Entity Recognition (NER) tasks, identifying and classifying entities such as:
- **LOC** (Location)
- **PER** (Person)
- **ORG** (Organization)
- **MISC** (Miscellaneous)
It can also identify the Starting city and the Ending city of a travel

### Example Use Case:
Given a sentence such as "Je veux aller de Paris à Lyon", the model will detect and label:
- `Paris` as `LOC`
- `Lyon` as `LOC`

### Limitations:
- **Language**: The model is primarily designed for French texts.
- **Performance**: Performance may degrade if used for non-French text or tasks outside NER.

## Labels and Tokens
The model uses the following entity labels:
- `O`: Outside any named entity
- `B-START`: Beginning of a named entity (start location)
- `I-START`: Inside a named entity (start location)
- `B-END`: Beginning of a named entity (end location)
- `I-END`: Inside a named entity (end location)

## Training Data
The model was fine-tuned using a French NER dataset of travel queries, including phrases like "Je veux aller de Paris à Lyon" to simulate common transportation-related interactions. The dataset contains named entity labels for city and station names.

## Hyperparameters and Fine-Tuning:
- **Learning Rate**: 2e-5
- **Batch Size**: 16
- **Epochs**: 3
- **Evaluation Strategy**: Epoch-based
- **Optimizer**: AdamW
- **Early Stopping**: Used to prevent overfitting

## Tokenizer
The tokenizer is based on the pre-trained CamemBERT tokenizer, adapted for the specific entity-labeling task. It uses subword tokenization based on the BPE (Byte-Pair Encoding) approach, which splits words into smaller units.

Tokenizer special settings:
- **Max Length**: 128
- **Padding**: Right-padded to 128 tokens
- **Truncation**: Longest-first strategy, truncating tokens beyond 128.

## How to Use
You can load and use this model with Hugging Face’s `transformers` library as follows:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("Crysy-rthomas/T-AIA-CamemBERT-NER-V2")
model = AutoModelForTokenClassification.from_pretrained("Crysy-rthomas/T-AIA-CamemBERT-NER-V2")

text = "Je veux aller de Paris à Lyon"
tokens = tokenizer(text, return_tensors="pt")
outputs = model(**tokens)
```

## Limitations and Bias
- The model may not generalize well beyond French texts.
- Results may be biased towards specific named entities frequently seen in the training data (such as city names).

## License
This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).