GLiNER-PII: Zero-shot PII model
A production-grade open-source model for privacy-focused PII, PHI, and PCI detection with zero-shot entity recognition capabilities. This model was developed in collaboration between Wordcab and Knowledgator. For enterprise-ready, specialized PII/PHI/PCI models, contact us at [email protected].
π§ What is GLiNER?
GLiNER (Generalist and Lightweight Named Entity Recognition) is a bidirectional transformer model that can identify any entity type without predefined categories. Unlike traditional NER models that are limited to specific entity classes, GLiNER allows you to specify exactly what entities you want to extract at runtime.
Key Advantages
- Zero-shot recognition: Extract any entity type without retraining
- Privacy-first: Process sensitive data locally without API calls
- Lightweight: Much faster than large language models for NER tasks
- Production-ready: Quantization-aware training with FP16 and UINT8 ONNX models
- Comprehensive: 60+ predefined PII categories with custom entity support
How GLiNER Works
Instead of predicting from a fixed set of entity classes, GLiNER takes both text and a list of desired entity types as input, then identifies spans that match those categories:
text = "John Smith called from 415-555-1234 to discuss his account."
entities = ["name", "phone number", "account number"]
# GLiNER finds: "John Smith" β name, "415-555-1234" β phone number
π Python Implementation
The primary GLiNER implementation provides comprehensive PII detection with 60+ entity categories, fine-tuned specifically for privacy and compliance use cases.
Installation
pip install gliner
Quick Start
from gliner import GLiNER
# Load the model (downloads automatically on first use)
model = GLiNER.from_pretrained("knowledgator/gliner-pii-base-v1.0")
text = "John Smith called from 415-555-1234 to discuss his account number 12345678."
labels = ["name", "phone number", "account number"]
entities = model.predict_entities(text, labels, threshold=0.3)
for entity in entities:
print(f"{entity['text']} => {entity['label']} (confidence: {entity['score']:.2f})")
Output:
John Smith => name (confidence: 0.95)
415-555-1234 => phone number (confidence: 0.92)
12345678 => account number (confidence: 0.88)
Comprehensive PII Detection
The model was specifically optimized for 60+ predefined PII categories organized by domain, but it can work in zero-shot as well, meaning you can put any labels you need:
Personal Identifiers
personal_labels = [
"name", # Full names
"first name", # First names
"last name", # Last names
"name medical professional", # Healthcare provider names
"dob", # Date of birth
"age", # Age information
"gender", # Gender identifiers
"marital status" # Marital status
]
Contact Information
contact_labels = [
"email address", # Email addresses
"phone number", # Phone numbers
"ip address", # IP addresses
"url", # URLs
"location address", # Street addresses
"location street", # Street names
"location city", # City names
"location state", # State/province names
"location country", # Country names
"location zip" # ZIP/postal codes
]
Financial Information
financial_labels = [
"account number", # Account numbers
"bank account", # Bank account numbers
"routing number", # Routing numbers
"credit card", # Credit card numbers
"credit card expiration", # Card expiration dates
"cvv", # CVV/security codes
"ssn", # Social Security Numbers
"money" # Monetary amounts
]
Healthcare Information
healthcare_labels = [
"condition", # Medical conditions
"medical process", # Medical procedures
"drug", # Drugs
"dose", # Dosage information
"blood type", # Blood types
"injury", # Injuries
"organization medical facility",# Healthcare facility names
"healthcare number", # Healthcare numbers
"medical code" # Medical codes
]
Identification Documents
id_labels = [
"passport number", # Passport numbers
"driver license", # Driver's license numbers
"username", # Usernames
"password", # Passwords
"vehicle id" # Vehicle IDs
]
Advanced Usage Examples
Multi-Category Detection
text = """
Patient Mary Johnson, DOB 01/15/1980, was discharged on March 10, 2024
from St. Mary's Hospital. Contact: [email protected], (555) 123-4567.
Insurance policy: POL-789456123.
"""
labels = [
"name", "dob", "discharge date", "organization medical facility",
"email address", "phone number", "policy number"
]
entities = model.predict_entities(text, labels, threshold=0.3)
for entity in entities:
print(f"Found '{entity['text']}' as {entity['label']}")
Batch Processing for High Throughput
documents = [
"Customer John called about his credit card ending in 4532.",
"Sarah's SSN 123-45-6789 needs verification.",
"Email [email protected] for account 987654321 issues."
]
labels = ["name", "credit card", "ssn", "email address", "account number"]
# Process multiple documents efficiently
results = model.run(documents, labels, threshold=0.3, batch_size=8)
for doc_idx, entities in enumerate(results):
print(f"\nDocument {doc_idx + 1}:")
for entity in entities:
print(f" {entity['text']} => {entity['label']}")
Custom Entity Detection
# GLiNER isn't limited to PII - you can detect any entities
text = "The MacBook Pro with M2 chip costs $1,999 at the Apple Store in Manhattan."
custom_labels = ["product", "processor", "price", "store", "location"]
entities = model.predict_entities(text, custom_labels, threshold=0.3)
Threshold Optimization
# Lower threshold: Higher recall, more false positives
high_recall = model.predict_entities(text, labels, threshold=0.2)
# Higher threshold: Higher precision, fewer false positives
high_precision = model.predict_entities(text, labels, threshold=0.6)
# Recommended starting point for production
balanced = model.predict_entities(text, labels, threshold=0.3)
π‘ Use Cases
GLiNER excels in privacy-focused applications where traditional cloud-based NER services pose compliance risks.
π― Primary Applications
Privacy-First Voice & Transcription
# Automatically redact PII from voice transcriptions
transcription = "Hi, my name is Sarah Johnson and my phone number is 415-555-0123"
pii_labels = ["name", "phone number", "email address", "ssn"]
entities = model.predict_entities(transcription, pii_labels)
# Redact or anonymize detected PII before storage
Compliance-Ready Document Processing
# Healthcare: HIPAA-compliant note processing
medical_note = "Patient John Doe, MRN 123456, diagnosed with diabetes..."
phi_labels = ["name", "medical record number", "condition", "dob"]
# Finance: PCI-DSS compliant transaction logs
transaction_log = "Card ****4532 charged $299.99 to John Smith"
pci_labels = ["credit card", "money", "name"]
# Legal: Attorney-client privilege protection
legal_doc = "Client Jane Doe vs. Corporation ABC, case #2024-CV-001"
legal_labels = ["name", "organization", "case number"]
Real-Time Data Anonymization
def anonymize_text(text, entity_types):
"""Anonymize PII in real-time"""
entities = model.predict_entities(text, entity_types)
# Sort by position to replace from end to start
entities.sort(key=lambda x: x['start'], reverse=True)
anonymized = text
for entity in entities:
placeholder = f"<{entity['label'].upper()}>"
anonymized = anonymized[:entity['start']] + placeholder + anonymized[entity['end']:]
return anonymized
original = "John Smith's SSN is 123-45-6789"
anonymized = anonymize_text(original, ["name", "ssn"])
print(anonymized) # "<NAME>'s SSN is <SSN>"
π Extended Applications
Enhanced Search & Content Understanding
# Extract key entities from user queries for better search
query = "Find restaurants near Stanford University in Palo Alto"
search_entities = ["organization", "location city", "business type"]
# Intelligent document tagging
document = "This quarterly report discusses Microsoft's Azure growth..."
doc_entities = ["organization", "product", "time period"]
GDPR-Compliant Chatbot Logs
def sanitize_chat_log(message):
"""Remove PII from chat logs per GDPR requirements"""
sensitive_types = [
"name", "email address", "phone number", "location address",
"credit card", "ssn", "passport number"
]
entities = model.predict_entities(message, sensitive_types)
if entities:
# Log anonymized version, alert compliance team
return anonymize_text(message, sensitive_types)
return message
Secure Mobile & Edge Processing
# Process sensitive data entirely on-device
def process_locally(user_input):
"""Process PII detection without cloud APIs"""
pii_types = ["name", "phone number", "email address", "ssn", "credit card"]
# All processing happens locally - no data leaves device
detected_pii = model.predict_entities(user_input, pii_types)
if detected_pii:
return "β οΈ Sensitive information detected - proceed with caution"
return "β
No PII detected - safe to share"
π Performance Benchmarks
Accuracy Evaluation
The following benchmarks were run on the synthetic-multi-pii-ner-v1 dataset. We compare multiple GLiNER-based PII models, including our new Knowledgator GLiNER PII Edge v1.0.
Model Path | Precision | Recall | F1 Score |
---|---|---|---|
knowledgator/gliner-pii-edge-v1.0 | 78.96% | 72.34% | 75.50% |
knowledgator/gliner-pii-small-v1.0 | 78.99% | 74.80% | 76.84% |
knowledgator/gliner-pii-base-v1.0 | 79.28% | 82.78% | 80.99% |
knowledgator/gliner-pii-large-v1.0 | 87.42% | 79.4% | 83.25% |
urchade/gliner_multi_pii-v1 | 79.19% | 74.67% | 76.86% |
E3-JSI/gliner-multi-pii-domains-v1 | 78.35% | 74.46% | 76.36% |
gravitee-io/gliner-pii-detection | 81.27% | 56.76% | 66.84% |
Key Takeaways
- Base Post Model (
knowledgator/gliner-pii-base-v1.0
) achieves the highest F1 score (80.99%), indicating the strongest overall performance. - Knowledgator Edge Model (
knowledgator/gliner-pii-edge-v1.0
) is optimized for edge environments, trading a slight decrease in recall for lower latency and footprint. - Gravitee-io Model shows strong precision but lower recall, indicating it is tuned for high confidence but misses more entities.
Comparison with Alternatives
Solution | Speed | Privacy | Accuracy | Flexibility | Cost |
---|---|---|---|---|---|
GLiNER | ββββ | βββββ | βββββ | ββββ | Free |
Cloud NER APIs | βββ | βββ | βββββ | βββ | $$$ |
Large Language Models | ββ | ββ | ββββ | ββββ | $$$$ |
Traditional NER | βββββ | βββββ | ββββ | β | Free |
π Alternative Implementations
While Python provides the most comprehensive PII detection capabilities, GLiNER is available across multiple languages for different deployment scenarios.
π¦ Rust Implementation (gline-rs)
Best for: High-performance backend services, microservices
[dependencies]
"gline-rs" = "1"
use gline_rs::{GLiNER, TextInput, Parameters, RuntimeParameters};
let model = GLiNER::<TokenMode>::new(
Parameters::default(),
RuntimeParameters::default(),
"tokenizer.json",
"model.onnx",
)?;
let input = TextInput::from_str(
&["My name is James Bond."],
&["person"],
)?;
let output = model.inference(input)?;
Performance: 4x faster than Python on CPU, 37x faster with GPU acceleration.
β‘ C++ Implementation (GLiNER.cpp)
Best for: Embedded systems, mobile apps, edge devices
#include "GLiNER/model.hpp"
gliner::Config config{12, 512};
gliner::Model model("./model.onnx", "./tokenizer.json", config);
std::vector<std::string> texts = {"John works at Microsoft"};
std::vector<std::string> entities = {"person", "organization"};
auto output = model.inference(texts, entities);
π JavaScript Implementation (GLiNER.js)
Best for: Web applications, browser-based processing
npm install gliner
import { Gliner } from 'gliner';
const gliner = new Gliner({
tokenizerPath: "onnx-community/gliner_small-v2",
onnxSettings: {
modelPath: "public/model.onnx",
executionProvider: "webgpu",
}
});
await gliner.initialize();
const results = await gliner.inference({
texts: ["John Smith works at Microsoft"],
entities: ["person", "organization"],
threshold: 0.1,
});
ποΈ Model Architecture & Training
Quantization-Aware Pretraining
GLiNER models use quantization-aware pretraining, which optimizes performance while maintaining accuracy. This allows efficient inference even with quantized models.
Available ONNX Formats
Format | Size | Use Case |
---|---|---|
FP16 | 330MB | Balanced performance/accuracy |
UINT8 | 197MB | Maximum efficiency |
Model Conversion
python convert_to_onnx.py \
--model_path knowledgator/gliner-pii-base-v1.0 \
--save_path ./model \
--quantize True # For UINT8 quantization
π References
- GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer
- GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks
- Named Entity Recognition as Structured Span Prediction
π Acknowledgments
Special thanks to the all GLiNER contributors, the Wordcab team and additional thanks to maintainers of the Rust, C++, and JavaScript implementations.
π Support
- Hugging Face: Ihor/gliner-pii-small
- GitHub Issues: Report bugs and request features
- Discord: Join community discussions
GLiNER: Open-source privacy-first entity recognition for production applications.
- Downloads last month
- 15