Lemone-Router: A Series of Fine-Tuned Classification Models for French Taxation

Lemone-router is a series of classification models designed to produce an optimal multi-agent system for different branches of tax law. Trained on a base of 49k lines comprising a set of synthetic questions generated by GPT-4 Turbo and Llama 3.1 70B, which have been further refined through evol-instruction tuning and manual curation and authority documents, these models are based on an 8-category decomposition of the classification scheme derived from the Bulletin officiel des finances publiques - impôts :

label2id = {
    "Bénéfices professionnels": 0,
    "Contrôle et contentieux": 1,
    "Dispositifs transversaux": 2,
    "Fiscalité des entreprises": 3,
    "Patrimoine et enregistrement": 4,
    "Revenus particuliers": 5,
    "Revenus patrimoniaux": 6,
    "Taxes sur la consommation": 7
}
    
id2label = {
    0: "Bénéfices professionnels",
    1: "Contrôle et contentieux",
    2: "Dispositifs transversaux",
    3: "Fiscalité des entreprises",
    4: "Patrimoine et enregistrement",
    5: "Revenus particuliers",
    6: "Revenus patrimoniaux",
    7: "Taxes sur la consommation"
}

This model is a fine-tuned version of intfloat/multilingual-e5-base. It achieves the following results on the evaluation set of 5000 texts:

  • Loss: 0.4096
  • Accuracy: 0.9265

Usage

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("louisbrulenaudet/lemone-router-m")
model = AutoModelForSequenceClassification.from_pretrained("louisbrulenaudet/lemone-router-m")

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 4.099463734610582e-05
  • train_batch_size: 16
  • eval_batch_size: 64
  • seed: 23
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 5

Training results

Training Loss Epoch Step Validation Loss Accuracy
0.5371 1.0 2809 0.4147 0.8680
0.3154 2.0 5618 0.3470 0.8914
0.2241 3.0 8427 0.3345 0.9147
0.1273 4.0 11236 0.3788 0.9187
0.0525 5.0 14045 0.4096 0.9265

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA H100 NVL
  • CPU Model: AMD EPYC 9V84 96-Core Processor

Framework versions

  • Transformers 4.45.2
  • Pytorch 2.4.1+cu121
  • Datasets 2.21.0
  • Tokenizers 0.20.1

Citation

If you use this code in your research, please use the following BibTeX entry.

@misc{louisbrulenaudet2024,
  author =       {Louis Brulé Naudet},
  title =        {Lemone-Router: A Series of Fine-Tuned Classification Models for French Taxation},
  year =         {2024}
  howpublished = {\url{https://huggingface.co/datasets/louisbrulenaudet/lemone-router-m}},
}

Feedback

If you have any feedback, please reach out at [email protected].

Downloads last month
12
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for louisbrulenaudet/lemone-router-m

Finetuned
(39)
this model

Datasets used to train louisbrulenaudet/lemone-router-m

Collection including louisbrulenaudet/lemone-router-m