alexandrlukashov's picture
Upload README.md
de016fa verified
---
license: apache-2.0
language:
- en
library_name: gliner
pipeline_tag: token-classification
tags:
- NER
- GLiNER
- information extraction
- PII
- PHI
- PCI
- entity recognition
- multilingual
---
# GLiNER-PII: Zero-shot PII model
A production-grade open-source model for privacy-focused PII, PHI, and PCI detection with zero-shot entity recognition capabilities.
This model was developed in collaboration between [Wordcab](https://wordcab.com/) and [Knowledgator](https://www.knowledgator.com/). For enterprise-ready, specialized PII/PHI/PCI models, contact us at [email protected].
## 🧠 What is GLiNER?
GLiNER (Generalist and Lightweight Named Entity Recognition) is a bidirectional transformer model that can identify **any entity type** without predefined categories. Unlike traditional NER models that are limited to specific entity classes, GLiNER allows you to specify exactly what entities you want to extract at runtime.
### Key Advantages
- **Zero-shot recognition**: Extract any entity type without retraining
- **Privacy-first**: Process sensitive data locally without API calls
- **Lightweight**: Much faster than large language models for NER tasks
- **Production-ready**: Quantization-aware training with FP16 and UINT8 ONNX models
- **Comprehensive**: 60+ predefined PII categories with custom entity support
### How GLiNER Works
Instead of predicting from a fixed set of entity classes, GLiNER takes both text and a list of desired entity types as input, then identifies spans that match those categories:
```python
text = "John Smith called from 415-555-1234 to discuss his account."
entities = ["name", "phone number", "account number"]
# GLiNER finds: "John Smith" β†’ name, "415-555-1234" β†’ phone number
```
## 🐍 Python Implementation
The primary GLiNER implementation provides comprehensive PII detection with 60+ entity categories, fine-tuned specifically for privacy and compliance use cases.
### Installation
```bash
pip install gliner
```
### Quick Start
```python
from gliner import GLiNER
# Load the model (downloads automatically on first use)
model = GLiNER.from_pretrained("knowledgator/gliner-pii-base-v1.0")
text = "John Smith called from 415-555-1234 to discuss his account number 12345678."
labels = ["name", "phone number", "account number"]
entities = model.predict_entities(text, labels, threshold=0.3)
for entity in entities:
print(f"{entity['text']} => {entity['label']} (confidence: {entity['score']:.2f})")
```
Output:
```
John Smith => name (confidence: 0.95)
415-555-1234 => phone number (confidence: 0.92)
12345678 => account number (confidence: 0.88)
```
### Comprehensive PII Detection
The model was specifically optimized for 60+ predefined PII categories organized by domain, but it can work in zero-shot as well, meaning you can put any labels you need:
#### Personal Identifiers
```python
personal_labels = [
"name", # Full names
"first name", # First names
"last name", # Last names
"name medical professional", # Healthcare provider names
"dob", # Date of birth
"age", # Age information
"gender", # Gender identifiers
"marital status" # Marital status
]
```
#### Contact Information
```python
contact_labels = [
"email address", # Email addresses
"phone number", # Phone numbers
"ip address", # IP addresses
"url", # URLs
"location address", # Street addresses
"location street", # Street names
"location city", # City names
"location state", # State/province names
"location country", # Country names
"location zip" # ZIP/postal codes
]
```
#### Financial Information
```python
financial_labels = [
"account number", # Account numbers
"bank account", # Bank account numbers
"routing number", # Routing numbers
"credit card", # Credit card numbers
"credit card expiration", # Card expiration dates
"cvv", # CVV/security codes
"ssn", # Social Security Numbers
"money" # Monetary amounts
]
```
#### Healthcare Information
```python
healthcare_labels = [
"condition", # Medical conditions
"medical process", # Medical procedures
"drug", # Drugs
"dose", # Dosage information
"blood type", # Blood types
"injury", # Injuries
"organization medical facility",# Healthcare facility names
"healthcare number", # Healthcare numbers
"medical code" # Medical codes
]
```
#### Identification Documents
```python
id_labels = [
"passport number", # Passport numbers
"driver license", # Driver's license numbers
"username", # Usernames
"password", # Passwords
"vehicle id" # Vehicle IDs
]
```
### Advanced Usage Examples
#### Multi-Category Detection
```python
text = """
Patient Mary Johnson, DOB 01/15/1980, was discharged on March 10, 2024
from St. Mary's Hospital. Contact: [email protected], (555) 123-4567.
Insurance policy: POL-789456123.
"""
labels = [
"name", "dob", "discharge date", "organization medical facility",
"email address", "phone number", "policy number"
]
entities = model.predict_entities(text, labels, threshold=0.3)
for entity in entities:
print(f"Found '{entity['text']}' as {entity['label']}")
```
#### Batch Processing for High Throughput
```python
documents = [
"Customer John called about his credit card ending in 4532.",
"Sarah's SSN 123-45-6789 needs verification.",
"Email [email protected] for account 987654321 issues."
]
labels = ["name", "credit card", "ssn", "email address", "account number"]
# Process multiple documents efficiently
results = model.run(documents, labels, threshold=0.3, batch_size=8)
for doc_idx, entities in enumerate(results):
print(f"\nDocument {doc_idx + 1}:")
for entity in entities:
print(f" {entity['text']} => {entity['label']}")
```
#### Custom Entity Detection
```python
# GLiNER isn't limited to PII - you can detect any entities
text = "The MacBook Pro with M2 chip costs $1,999 at the Apple Store in Manhattan."
custom_labels = ["product", "processor", "price", "store", "location"]
entities = model.predict_entities(text, custom_labels, threshold=0.3)
```
#### Threshold Optimization
```python
# Lower threshold: Higher recall, more false positives
high_recall = model.predict_entities(text, labels, threshold=0.2)
# Higher threshold: Higher precision, fewer false positives
high_precision = model.predict_entities(text, labels, threshold=0.6)
# Recommended starting point for production
balanced = model.predict_entities(text, labels, threshold=0.3)
```
## πŸ’‘ Use Cases
GLiNER excels in privacy-focused applications where traditional cloud-based NER services pose compliance risks.
### 🎯 **Primary Applications**
#### Privacy-First Voice & Transcription
```python
# Automatically redact PII from voice transcriptions
transcription = "Hi, my name is Sarah Johnson and my phone number is 415-555-0123"
pii_labels = ["name", "phone number", "email address", "ssn"]
entities = model.predict_entities(transcription, pii_labels)
# Redact or anonymize detected PII before storage
```
#### Compliance-Ready Document Processing
```python
# Healthcare: HIPAA-compliant note processing
medical_note = "Patient John Doe, MRN 123456, diagnosed with diabetes..."
phi_labels = ["name", "medical record number", "condition", "dob"]
# Finance: PCI-DSS compliant transaction logs
transaction_log = "Card ****4532 charged $299.99 to John Smith"
pci_labels = ["credit card", "money", "name"]
# Legal: Attorney-client privilege protection
legal_doc = "Client Jane Doe vs. Corporation ABC, case #2024-CV-001"
legal_labels = ["name", "organization", "case number"]
```
#### Real-Time Data Anonymization
```python
def anonymize_text(text, entity_types):
"""Anonymize PII in real-time"""
entities = model.predict_entities(text, entity_types)
# Sort by position to replace from end to start
entities.sort(key=lambda x: x['start'], reverse=True)
anonymized = text
for entity in entities:
placeholder = f"<{entity['label'].upper()}>"
anonymized = anonymized[:entity['start']] + placeholder + anonymized[entity['end']:]
return anonymized
original = "John Smith's SSN is 123-45-6789"
anonymized = anonymize_text(original, ["name", "ssn"])
print(anonymized) # "<NAME>'s SSN is <SSN>"
```
### 🌟 **Extended Applications**
#### Enhanced Search & Content Understanding
```python
# Extract key entities from user queries for better search
query = "Find restaurants near Stanford University in Palo Alto"
search_entities = ["organization", "location city", "business type"]
# Intelligent document tagging
document = "This quarterly report discusses Microsoft's Azure growth..."
doc_entities = ["organization", "product", "time period"]
```
#### GDPR-Compliant Chatbot Logs
```python
def sanitize_chat_log(message):
"""Remove PII from chat logs per GDPR requirements"""
sensitive_types = [
"name", "email address", "phone number", "location address",
"credit card", "ssn", "passport number"
]
entities = model.predict_entities(message, sensitive_types)
if entities:
# Log anonymized version, alert compliance team
return anonymize_text(message, sensitive_types)
return message
```
#### Secure Mobile & Edge Processing
```python
# Process sensitive data entirely on-device
def process_locally(user_input):
"""Process PII detection without cloud APIs"""
pii_types = ["name", "phone number", "email address", "ssn", "credit card"]
# All processing happens locally - no data leaves device
detected_pii = model.predict_entities(user_input, pii_types)
if detected_pii:
return "⚠️ Sensitive information detected - proceed with caution"
return "βœ… No PII detected - safe to share"
```
## πŸ“Š Performance Benchmarks
### Accuracy Evaluation
The following benchmarks were run on the **synthetic-multi-pii-ner-v1** dataset.
We compare multiple GLiNER-based PII models, including our new **Knowledgator GLiNER PII Edge v1.0**.
| Model Path | Precision | Recall | F1 Score |
| ---------------------------------------------------------------------- | --------- | ------ | ---------- |
| **knowledgator/gliner-pii-edge-v1.0** | 78.96% | 72.34% | **75.50%** |
| **knowledgator/gliner-pii-small-v1.0** | 78.99% | 74.80% | **76.84%** |
| **knowledgator/gliner-pii-base-v1.0** | 79.28% | 82.78% | **80.99%** |
| **knowledgator/gliner-pii-large-v1.0** | 87.42% | 79.4% | **83.25%** |
| **urchade/gliner\_multi\_pii-v1** | 79.19% | 74.67% | **76.86%** |
| **E3-JSI/gliner-multi-pii-domains-v1** | 78.35% | 74.46% | **76.36%** |
| **gravitee-io/gliner-pii-detection** | 81.27% | 56.76% | **66.84%** |
### Key Takeaways
* **Base Post Model** (`knowledgator/gliner-pii-base-v1.0`) achieves the **highest F1 score (80.99%)**, indicating the strongest overall performance.
* **Knowledgator Edge Model** (`knowledgator/gliner-pii-edge-v1.0`) is optimized for **edge environments**, trading a slight decrease in recall for lower latency and footprint.
* **Gravitee-io Model** shows strong precision but lower recall, indicating it is tuned for high confidence but misses more entities.
### Comparison with Alternatives
| Solution | Speed | Privacy | Accuracy | Flexibility | Cost |
| --------------------- | ----- | ------- | -------- | ----------- | -------- |
| **GLiNER** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Free |
| Cloud NER APIs | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | \$\$\$ |
| Large Language Models | ⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | \$\$\$\$ |
| Traditional NER | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐ | Free |
## πŸš€ Alternative Implementations
While Python provides the most comprehensive PII detection capabilities, GLiNER is available across multiple languages for different deployment scenarios.
### πŸ¦€ Rust Implementation (gline-rs)
**Best for**: High-performance backend services, microservices
```toml
[dependencies]
"gline-rs" = "1"
```
```rust
use gline_rs::{GLiNER, TextInput, Parameters, RuntimeParameters};
let model = GLiNER::<TokenMode>::new(
Parameters::default(),
RuntimeParameters::default(),
"tokenizer.json",
"model.onnx",
)?;
let input = TextInput::from_str(
&["My name is James Bond."],
&["person"],
)?;
let output = model.inference(input)?;
```
**Performance**: 4x faster than Python on CPU, 37x faster with GPU acceleration.
### ⚑ C++ Implementation (GLiNER.cpp)
**Best for**: Embedded systems, mobile apps, edge devices
```cpp
#include "GLiNER/model.hpp"
gliner::Config config{12, 512};
gliner::Model model("./model.onnx", "./tokenizer.json", config);
std::vector<std::string> texts = {"John works at Microsoft"};
std::vector<std::string> entities = {"person", "organization"};
auto output = model.inference(texts, entities);
```
### 🌐 JavaScript Implementation (GLiNER.js)
**Best for**: Web applications, browser-based processing
```bash
npm install gliner
```
```javascript
import { Gliner } from 'gliner';
const gliner = new Gliner({
tokenizerPath: "onnx-community/gliner_small-v2",
onnxSettings: {
modelPath: "public/model.onnx",
executionProvider: "webgpu",
}
});
await gliner.initialize();
const results = await gliner.inference({
texts: ["John Smith works at Microsoft"],
entities: ["person", "organization"],
threshold: 0.1,
});
```
## πŸ—οΈ Model Architecture & Training
### Quantization-Aware Pretraining
GLiNER models use quantization-aware pretraining, which optimizes performance while maintaining accuracy. This allows efficient inference even with quantized models.
### Available ONNX Formats
| Format | Size | Use Case |
|--------|------|----------|
| **FP16** | 330MB | Balanced performance/accuracy |
| **UINT8** | 197MB | Maximum efficiency |
### Model Conversion
```bash
python convert_to_onnx.py \
--model_path knowledgator/gliner-pii-base-v1.0 \
--save_path ./model \
--quantize True # For UINT8 quantization
```
## πŸ“„ References
- [GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer](https://arxiv.org/abs/2311.08526)
- [GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks](https://arxiv.org/abs/2406.12925)
- [Named Entity Recognition as Structured Span Prediction](https://arxiv.org/abs/2212.13415)
## πŸ™ Acknowledgments
Special thanks to the all GLiNER contributors, the Wordcab team and additional thanks to maintainers of the Rust, C++, and JavaScript implementations.
## πŸ“ž Support
- **Hugging Face**: [Ihor/gliner-pii-small](https://huggingface.co/Ihor/gliner-pii-small)
- **GitHub Issues**: [Report bugs and request features](https://github.com/info-wordcab/wordcab-pii)
- **Discord**: [Join community discussions](https://discord.gg/wRF7tuY9)
---
*GLiNER: Open-source privacy-first entity recognition for production applications.*