Italian_NER_XXL_v2 / README.md

Update README.md

e50650a verified 3 months ago

6.75 kB

	---
	license: apache-2.0
	language:
	- it
	- en
	pipeline_tag: token-classification
	tags:
	- legal
	- finance
	- medical
	- privacy
	- named-entity-recognition
	---

	---

	💡 Found this resource helpful? Creating and maintaining open source AI models and datasets requires significant computational resources. If this work has been valuable to you, consider [supporting my research](https://buymeacoffee.com/michele.montebovi) to help me continue building tools that benefit the entire AI community. Every contribution directly funds more open source innovation! ☕

	---

	# Italian_NER_XXL_v2

	## 🚀 Model Overview
	Welcome to the second generation of our state-of-the-art Named Entity Recognition model for Italian text. Building on the success of our previous version, Italian_NER_XXL_v2 delivers significantly enhanced performance with an accuracy of 87.5% and F1 score of 89.2% - an improvement of over 8 percentage points from my previous model.

	## 💡 Key Improvements
	- Enhanced Accuracy: From 79% to 87.5%
	- Better Context Understanding: Improved recognition of entities in complex sentences
	- Reduced False Positives: More precise identification of sensitive information
	- Expanded Training Data: Trained on a more diverse corpus of Italian text

	## 🏆 Market Leadership
	Italian_NER_XXL_v2 remains the only model in Italy capable of identifying a comprehensive range of 52 different entity categories, maintaining our unique position in the Italian NLP landscape. This unparalleled breadth of entity recognition makes our model the premier choice for privacy, legal, and financial applications.

	## 📋 Recognized Categories
	Our model identifies an extensive range of entities across multiple domains:

	### Personal Information
	- NOME: First name of a person
	- COGNOME: Last name of a person
	- DATA_NASCITA: Date of birth
	- DATA_MORTE: Date of death
	- ETA: Age of a person
	- CODICE_FISCALE: Italian tax code
	- PROFESSIONE: Occupation or profession
	- STATO_CIVILE: Civil status

	### Contact Information
	- INDIRIZZO: Physical address
	- NUMERO_TELEFONO: Phone number
	- EMAIL: Email address
	- CODICE_POSTALE: Postal code

	### Financial Information
	- VALUTA: Currency
	- IMPORTO: Monetary amount
	- NUMERO_CARTA: Credit/debit card number
	- CVV: Card security code
	- NUMERO_CONTO: Bank account number
	- IBAN: International bank account number
	- BIC: Bank identifier code
	- P_IVA: VAT number
	- TASSO_MUTUO: Mortgage rate
	- NUM_ASSEGNO_BANCARIO: Bank check number
	- BANCA: Bank name

	### Legal Entities
	- RAGIONE_SOCIALE: Company legal name
	- TRIBUNALE: Court identifier
	- LEGGE: Law reference
	- N_SENTENZA: Sentence number
	- N_LICENZA: License number
	- AVV_NOTAIO: Lawyer or notary reference
	- REGIME_PATRIMONIALE: Property regime

	### Medical Information
	- CARTELLA_CLINICA: Medical record
	- MALATTIA: Disease or medical condition
	- MEDICINA: Medicine or medical treatment
	- STORIA_CLINICA: Clinical history
	- STRENGTH: Medicine strength
	- FREQUENZA: Treatment frequency
	- DURATION: Duration of treatment
	- DOSAGGIO: Medicine dosage
	- FORM: Medicine form (e.g., tablet)

	### Technical Information
	- IP: IP address
	- IPV6_1: IPv6 address
	- MAC: MAC address
	- USER_AGENT: Browser user agent
	- IMEI: Mobile device identifier

	### Geographic and Temporal Data
	- STATO: Country or nation
	- LUOGO: Geographic location
	- ORARIO: Specific time
	- DATA: Generic date

	### Document and Vehicle Information
	- NUMERO_DOCUMENTO: Document number
	- TARGA_VEICOLO: Vehicle license plate
	- FOGLIO: Document sheet reference
	- PARTICELLA: Land registry particle
	- MAPPALE: Land registry map reference
	- SUBALTERNO: Land registry subordinate reference

	### Web and Security
	- URL: Web address
	- PASSWORD: Password
	- PIN: Personal identification number
	- BRAND: Commercial brand or trademark

	## 💻 Implementation

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from transformers import pipeline
	import torch

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("DeepMount00/Italian_NER_XXL_v2")
	model = AutoModelForTokenClassification.from_pretrained("DeepMount00/Italian_NER_XXL_v2")

	# Create NER pipeline
	nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

	# Example text
	example = """Il commendatore Gianluigi Alberico De Laurentis-Ponti, con residenza legale in Corso Imperatrice 67,
	Torino, avente codice fiscale DLNGGL60B01L219P, è amministratore delegato della "De Laurentis Advanced Engineering
	Group S.p.A.", che si trova in Piazza Affari 32, Milano (MI); con una partita IVA di 09876543210, la società è stata
	recentemente incaricata di sviluppare una nuova linea di componenti aerospaziali per il progetto internazionale
	di esplorazione di Marte."""

	# Run NER
	ner_results = nlp(example)

	# Process results
	for entity in ner_results:
	print(f"{entity['entity_group']}: {entity['word']} (confidence: {entity['score']:.4f})")
	```

	## 🚀 Use Cases
	- Privacy Compliance: GDPR data mapping and PII detection
	- Document Anonymization: Automated redaction of sensitive information
	- Legal Document Analysis: Extraction of key entities from contracts and legal texts
	- Financial Monitoring: Detection of financial entities for compliance and fraud prevention
	- Medical Record Processing: Structured extraction from clinical notes and reports

	## 🔮 Future Development
	We're committed to continuous improvement of the model:
	- Quarterly updates with further accuracy enhancements
	- Expansion to include new entity types based on user feedback
	- Development of domain-specific variants for specialized applications
	- Integration of contextual entity linking capabilities

	## 👥 Contribution and Contact
	Your feedback is essential to improving this model. If you're interested in contributing, have suggestions, or need a customized NER solution, please contact:

	Michele Montebovi
	Email: [[email protected]](mailto:[email protected])

	We welcome collaboration from the Italian NLP community to further enhance this tool and expand its applications across industries.

	## 📝 Citation
	If you use this model in your research or applications, please cite:

	```bibtex
	@misc{montebovi2025italiannerxxl,
	author = {Montebovi, Michele},
	title = {Italian\_NER\_XXL\_v2: A Comprehensive Named Entity Recognition Model for Italian},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/DeepMount00/Italian_NER_XXL_v2}}
	}
	```