deepseek-r1-distill-llama-8B-finetuned-nfe-detection-r1-distill-llama-8B-finetuned-nfe-detection
Finetuned DeepSeek R1 Distill Llama 8B for detecting suppliers under federal sanctions (CGU/CEIS) in Brazilian NF‑e documents.
TL;DR
- Task Binary text classification (
0
= ordinary purchase,1
= purchase from sanctioned supplier) - Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-8B
(8 B params) - Training data 40 000 NF‑e records (70 % train, 30 % test)
- Best epoch 5 / 5 (early‑stopping on validation loss)
- Performance Accuracy 0.952 | F1 0.953 | ROC‑AUC 0.981
- License Apache 2.0 (weights, code & dataset)
Motivation
Brazil’s federal administration issued 1.76 M invoices in 2023. Detecting suppliers already punished by regulators is tedious and error‑prone. This model automates the first triage step, highlighting suspicious transactions for auditors.
The work is part of master’s dissertation Detection of Potentially Untrustworthy Companies through Government Procurement Extracts (UFBA, 2025).
Quick start
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model_name = "CleitonOERocha/deepseek-r1-distill-llama-8B-finetuned-nfe-detection"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)
clf = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
top_k=None,
device="cuda" # "cpu" also works
)
print(clf("[CLS] Destinatario: XXX BATALHAO LOG [SEP] Municipio emitente: SÃO PAULO [SEP] Descricao do produto: MICROFONE LAPELA BY-M1 PRETO P2 [SEP] Qtd: 2 [SEP] Total: 649.78"))
Example output:
[{"label": "LABEL_1", "score": 0.9752885699272156}, {"label": "LABEL_0", "score": 0.024711500853300095}]
Dataset creation
- Crawled NF‑e ZIPs from Portal da Transparência
- Merged with sanction list from CGU/CEIS
- Filtering → deduplication → text normalisation → label propagation
Model details
Item | Value |
---|---|
Base | DeepSeek R1 Distill Llama 8B |
Parameters | 8 B |
Architecture | Decoder‑only Transformer |
Max sequence length | 4 096 |
Fine‑tuned epochs | 5 |
Learning rate | 2 × 10⁻⁵ |
Optimizer | AdamW |
Loss | Cross‑entropy |
Dataset: 40 000 NF‑e lines (28 000 train | 12 000 test)
Evaluation metrics
Metric | Value |
---|---|
Accuracy | 0.9519 |
Precision | 0.9457 |
Recall | 0.9599 |
F1‑score | 0.9527 |
ROC‑AUC | 0.9812 |
Confusion matrix (test set)
Pred 0 | Pred 1 | |
---|---|---|
True 0 | 5 603 | 334 |
True 1 | 243 | 5 820 |
Limitations & Biases
- Relies only on free‑text invoice fields; numeric anomalies (e.g., price outliers) are out of scope.
- Trained on 2023 federal data; state/municipal or older invoices may need adaptation.
- False positives are expected; always corroborate with additional data.
Ethical considerations
Use this model as an assistant, not a final verdict. Always corroborate predictions with official sanction registries.
License
Released under the Apache License 2.0. This applies to weights, code and dataset scripts.
Resources
- GitHub — source code, data‑processing notebooks and training logs: https://github.com/CleitonOERocha/Mestrado
- Hugging Face — model hub page: https://huggingface.co/CleitonOERocha/deepseek-r1-distill-llama-8B-finetuned-nfe-detection
Citation
@mastersthesis{rocha2025nfe,
author = {Cleiton Otavio da Exaltação Rocha and Gecynalda Soares da Silva Gomes},
title = {Detection of Potentially Untrustworthy Companies through Government Procurement Extracts},
school = {Universidade Federal da Bahia},
year = {2025},
address = {Salvador, Brasil}
}
Contact
Open an issue on the GitHub repo or tag @CleitonOERocha on the 🤗 Hub.
- Downloads last month
- 28
Model tree for CleitonOERocha/deepseek-r1-distill-llama-8B-finetuned-nfe-detection
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-8B