|
--- |
|
license: apache-2.0 |
|
language: |
|
- it |
|
- en |
|
pipeline_tag: token-classification |
|
tags: |
|
- legal |
|
- finance |
|
- medical |
|
- privacy |
|
- named-entity-recognition |
|
--- |
|
|
|
--- |
|
|
|
**๐ก Found this resource helpful?** Creating and maintaining open source AI models and datasets requires significant computational resources. If this work has been valuable to you, consider [supporting my research](https://buymeacoffee.com/michele.montebovi) to help me continue building tools that benefit the entire AI community. Every contribution directly funds more open source innovation! โ |
|
|
|
--- |
|
|
|
# Italian_NER_XXL_v2 |
|
|
|
## ๐ Model Overview |
|
Welcome to the second generation of our state-of-the-art Named Entity Recognition model for Italian text. Building on the success of our previous version, Italian_NER_XXL_v2 delivers significantly enhanced performance with an **accuracy of 87.5%** and **F1 score of 89.2%** - an improvement of over 8 percentage points from my previous model. |
|
|
|
## ๐ก Key Improvements |
|
- **Enhanced Accuracy**: From 79% to 87.5% |
|
- **Better Context Understanding**: Improved recognition of entities in complex sentences |
|
- **Reduced False Positives**: More precise identification of sensitive information |
|
- **Expanded Training Data**: Trained on a more diverse corpus of Italian text |
|
|
|
## ๐ Market Leadership |
|
Italian_NER_XXL_v2 remains the only model in Italy capable of identifying a comprehensive range of **52** different entity categories, maintaining our unique position in the Italian NLP landscape. This unparalleled breadth of entity recognition makes our model the premier choice for privacy, legal, and financial applications. |
|
|
|
## ๐ Recognized Categories |
|
Our model identifies an extensive range of entities across multiple domains: |
|
|
|
### Personal Information |
|
- **NOME**: First name of a person |
|
- **COGNOME**: Last name of a person |
|
- **DATA_NASCITA**: Date of birth |
|
- **DATA_MORTE**: Date of death |
|
- **ETA**: Age of a person |
|
- **CODICE_FISCALE**: Italian tax code |
|
- **PROFESSIONE**: Occupation or profession |
|
- **STATO_CIVILE**: Civil status |
|
|
|
### Contact Information |
|
- **INDIRIZZO**: Physical address |
|
- **NUMERO_TELEFONO**: Phone number |
|
- **EMAIL**: Email address |
|
- **CODICE_POSTALE**: Postal code |
|
|
|
### Financial Information |
|
- **VALUTA**: Currency |
|
- **IMPORTO**: Monetary amount |
|
- **NUMERO_CARTA**: Credit/debit card number |
|
- **CVV**: Card security code |
|
- **NUMERO_CONTO**: Bank account number |
|
- **IBAN**: International bank account number |
|
- **BIC**: Bank identifier code |
|
- **P_IVA**: VAT number |
|
- **TASSO_MUTUO**: Mortgage rate |
|
- **NUM_ASSEGNO_BANCARIO**: Bank check number |
|
- **BANCA**: Bank name |
|
|
|
### Legal Entities |
|
- **RAGIONE_SOCIALE**: Company legal name |
|
- **TRIBUNALE**: Court identifier |
|
- **LEGGE**: Law reference |
|
- **N_SENTENZA**: Sentence number |
|
- **N_LICENZA**: License number |
|
- **AVV_NOTAIO**: Lawyer or notary reference |
|
- **REGIME_PATRIMONIALE**: Property regime |
|
|
|
### Medical Information |
|
- **CARTELLA_CLINICA**: Medical record |
|
- **MALATTIA**: Disease or medical condition |
|
- **MEDICINA**: Medicine or medical treatment |
|
- **STORIA_CLINICA**: Clinical history |
|
- **STRENGTH**: Medicine strength |
|
- **FREQUENZA**: Treatment frequency |
|
- **DURATION**: Duration of treatment |
|
- **DOSAGGIO**: Medicine dosage |
|
- **FORM**: Medicine form (e.g., tablet) |
|
|
|
### Technical Information |
|
- **IP**: IP address |
|
- **IPV6_1**: IPv6 address |
|
- **MAC**: MAC address |
|
- **USER_AGENT**: Browser user agent |
|
- **IMEI**: Mobile device identifier |
|
|
|
### Geographic and Temporal Data |
|
- **STATO**: Country or nation |
|
- **LUOGO**: Geographic location |
|
- **ORARIO**: Specific time |
|
- **DATA**: Generic date |
|
|
|
### Document and Vehicle Information |
|
- **NUMERO_DOCUMENTO**: Document number |
|
- **TARGA_VEICOLO**: Vehicle license plate |
|
- **FOGLIO**: Document sheet reference |
|
- **PARTICELLA**: Land registry particle |
|
- **MAPPALE**: Land registry map reference |
|
- **SUBALTERNO**: Land registry subordinate reference |
|
|
|
### Web and Security |
|
- **URL**: Web address |
|
- **PASSWORD**: Password |
|
- **PIN**: Personal identification number |
|
- **BRAND**: Commercial brand or trademark |
|
|
|
## ๐ป Implementation |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
from transformers import pipeline |
|
import torch |
|
|
|
# Load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("DeepMount00/Italian_NER_XXL_v2") |
|
model = AutoModelForTokenClassification.from_pretrained("DeepMount00/Italian_NER_XXL_v2") |
|
|
|
# Create NER pipeline |
|
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
|
|
# Example text |
|
example = """Il commendatore Gianluigi Alberico De Laurentis-Ponti, con residenza legale in Corso Imperatrice 67, |
|
Torino, avente codice fiscale DLNGGL60B01L219P, รจ amministratore delegato della "De Laurentis Advanced Engineering |
|
Group S.p.A.", che si trova in Piazza Affari 32, Milano (MI); con una partita IVA di 09876543210, la societร รจ stata |
|
recentemente incaricata di sviluppare una nuova linea di componenti aerospaziali per il progetto internazionale |
|
di esplorazione di Marte.""" |
|
|
|
# Run NER |
|
ner_results = nlp(example) |
|
|
|
# Process results |
|
for entity in ner_results: |
|
print(f"{entity['entity_group']}: {entity['word']} (confidence: {entity['score']:.4f})") |
|
``` |
|
|
|
## ๐ Use Cases |
|
- **Privacy Compliance**: GDPR data mapping and PII detection |
|
- **Document Anonymization**: Automated redaction of sensitive information |
|
- **Legal Document Analysis**: Extraction of key entities from contracts and legal texts |
|
- **Financial Monitoring**: Detection of financial entities for compliance and fraud prevention |
|
- **Medical Record Processing**: Structured extraction from clinical notes and reports |
|
|
|
## ๐ฎ Future Development |
|
We're committed to continuous improvement of the model: |
|
- Quarterly updates with further accuracy enhancements |
|
- Expansion to include new entity types based on user feedback |
|
- Development of domain-specific variants for specialized applications |
|
- Integration of contextual entity linking capabilities |
|
|
|
## ๐ฅ Contribution and Contact |
|
Your feedback is essential to improving this model. If you're interested in contributing, have suggestions, or need a customized NER solution, please contact: |
|
|
|
Michele Montebovi |
|
Email: [[email protected]](mailto:[email protected]) |
|
|
|
We welcome collaboration from the Italian NLP community to further enhance this tool and expand its applications across industries. |
|
|
|
## ๐ Citation |
|
If you use this model in your research or applications, please cite: |
|
|
|
```bibtex |
|
@misc{montebovi2025italiannerxxl, |
|
author = {Montebovi, Michele}, |
|
title = {Italian\_NER\_XXL\_v2: A Comprehensive Named Entity Recognition Model for Italian}, |
|
year = {2025}, |
|
publisher = {HuggingFace}, |
|
howpublished = {\url{https://huggingface.co/DeepMount00/Italian_NER_XXL_v2}} |
|
} |
|
``` |