--- license: mit language: - fr base_model: - cmarkea/distilcamembert-base-ner datasets: - Crysy-rthomas/T-AIA-NER-DATASET --- ## Model Overview This model is a fine-tuned version of the **[cmarkea/distilcamembert-base-ner](https://huggingface.co/cmarkea/distilcamembert-base-ner)**, adapted for Named Entity Recognition (NER) on French datasets. It is a lighter variant of CamemBERT, which is specifically optimized for NER tasks involving entities like **locations, organizations, persons**, and other miscellaneous entities in French texts. ### Model Type - **Architecture**: `CamembertForTokenClassification` - **Base Model**: DistilCamemBERT - **Number of Layers**: 6 hidden layers, 12 attention heads - **Tokenizer**: Based on CamemBERT's tokenizer - **Vocab Size**: 32,005 tokens ## Intended Use This model is fine-tuned for Named Entity Recognition (NER) tasks, identifying and classifying entities such as: - **LOC** (Location) - **PER** (Person) - **ORG** (Organization) - **MISC** (Miscellaneous) It can also identify the Starting city and the Ending city of a travel ### Example Use Case: Given a sentence such as "Je veux aller de Paris à Lyon", the model will detect and label: - `Paris` as `LOC` - `Lyon` as `LOC` ### Limitations: - **Language**: The model is primarily designed for French texts. - **Performance**: Performance may degrade if used for non-French text or tasks outside NER. ## Labels and Tokens The model uses the following entity labels: - `O`: Outside any named entity - `B-START`: Beginning of a named entity (start location) - `I-START`: Inside a named entity (start location) - `B-END`: Beginning of a named entity (end location) - `I-END`: Inside a named entity (end location) ## Training Data The model was fine-tuned using a French NER dataset of travel queries, including phrases like "Je veux aller de Paris à Lyon" to simulate common transportation-related interactions. The dataset contains named entity labels for city and station names. ## Hyperparameters and Fine-Tuning: - **Learning Rate**: 2e-5 - **Batch Size**: 16 - **Epochs**: 3 - **Evaluation Strategy**: Epoch-based - **Optimizer**: AdamW - **Early Stopping**: Used to prevent overfitting ## Tokenizer The tokenizer is based on the pre-trained CamemBERT tokenizer, adapted for the specific entity-labeling task. It uses subword tokenization based on the BPE (Byte-Pair Encoding) approach, which splits words into smaller units. Tokenizer special settings: - **Max Length**: 128 - **Padding**: Right-padded to 128 tokens - **Truncation**: Longest-first strategy, truncating tokens beyond 128. ## How to Use You can load and use this model with Hugging Face’s `transformers` library as follows: ```python from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("Crysy-rthomas/T-AIA-CamemBERT-NER-V2") model = AutoModelForTokenClassification.from_pretrained("Crysy-rthomas/T-AIA-CamemBERT-NER-V2") text = "Je veux aller de Paris à Lyon" tokens = tokenizer(text, return_tensors="pt") outputs = model(**tokens) ``` ## Limitations and Bias - The model may not generalize well beyond French texts. - Results may be biased towards specific named entities frequently seen in the training data (such as city names). ## License This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).