|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
library_name: gliner |
|
pipeline_tag: token-classification |
|
tags: |
|
- NER |
|
- GLiNER |
|
- information extraction |
|
- PII |
|
- PHI |
|
- PCI |
|
- entity recognition |
|
- multilingual |
|
--- |
|
|
|
|
|
# GLiNER-PII: Zero-shot PII model |
|
|
|
A production-grade open-source model for privacy-focused PII, PHI, and PCI detection with zero-shot entity recognition capabilities. |
|
This model was developed in collaboration between [Wordcab](https://wordcab.com/) and [Knowledgator](https://www.knowledgator.com/). For enterprise-ready, specialized PII/PHI/PCI models, contact us at [email protected]. |
|
|
|
## π§ What is GLiNER? |
|
|
|
GLiNER (Generalist and Lightweight Named Entity Recognition) is a bidirectional transformer model that can identify **any entity type** without predefined categories. Unlike traditional NER models that are limited to specific entity classes, GLiNER allows you to specify exactly what entities you want to extract at runtime. |
|
|
|
### Key Advantages |
|
|
|
- **Zero-shot recognition**: Extract any entity type without retraining |
|
- **Privacy-first**: Process sensitive data locally without API calls |
|
- **Lightweight**: Much faster than large language models for NER tasks |
|
- **Production-ready**: Quantization-aware training with FP16 and UINT8 ONNX models |
|
- **Comprehensive**: 60+ predefined PII categories with custom entity support |
|
|
|
### How GLiNER Works |
|
|
|
Instead of predicting from a fixed set of entity classes, GLiNER takes both text and a list of desired entity types as input, then identifies spans that match those categories: |
|
|
|
```python |
|
text = "John Smith called from 415-555-1234 to discuss his account." |
|
entities = ["name", "phone number", "account number"] |
|
# GLiNER finds: "John Smith" β name, "415-555-1234" β phone number |
|
``` |
|
|
|
## π Python Implementation |
|
|
|
The primary GLiNER implementation provides comprehensive PII detection with 60+ entity categories, fine-tuned specifically for privacy and compliance use cases. |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install gliner |
|
``` |
|
|
|
### Quick Start |
|
|
|
```python |
|
from gliner import GLiNER |
|
|
|
# Load the model (downloads automatically on first use) |
|
model = GLiNER.from_pretrained("knowledgator/gliner-pii-base-v1.0") |
|
|
|
text = "John Smith called from 415-555-1234 to discuss his account number 12345678." |
|
labels = ["name", "phone number", "account number"] |
|
|
|
entities = model.predict_entities(text, labels, threshold=0.3) |
|
|
|
for entity in entities: |
|
print(f"{entity['text']} => {entity['label']} (confidence: {entity['score']:.2f})") |
|
``` |
|
|
|
Output: |
|
``` |
|
John Smith => name (confidence: 0.95) |
|
415-555-1234 => phone number (confidence: 0.92) |
|
12345678 => account number (confidence: 0.88) |
|
``` |
|
|
|
### Comprehensive PII Detection |
|
|
|
The model was specifically optimized for 60+ predefined PII categories organized by domain, but it can work in zero-shot as well, meaning you can put any labels you need: |
|
|
|
#### Personal Identifiers |
|
|
|
```python |
|
personal_labels = [ |
|
"name", # Full names |
|
"first name", # First names |
|
"last name", # Last names |
|
"name medical professional", # Healthcare provider names |
|
"dob", # Date of birth |
|
"age", # Age information |
|
"gender", # Gender identifiers |
|
"marital status" # Marital status |
|
] |
|
``` |
|
|
|
#### Contact Information |
|
|
|
```python |
|
contact_labels = [ |
|
"email address", # Email addresses |
|
"phone number", # Phone numbers |
|
"ip address", # IP addresses |
|
"url", # URLs |
|
"location address", # Street addresses |
|
"location street", # Street names |
|
"location city", # City names |
|
"location state", # State/province names |
|
"location country", # Country names |
|
"location zip" # ZIP/postal codes |
|
] |
|
``` |
|
|
|
#### Financial Information |
|
|
|
```python |
|
financial_labels = [ |
|
"account number", # Account numbers |
|
"bank account", # Bank account numbers |
|
"routing number", # Routing numbers |
|
"credit card", # Credit card numbers |
|
"credit card expiration", # Card expiration dates |
|
"cvv", # CVV/security codes |
|
"ssn", # Social Security Numbers |
|
"money" # Monetary amounts |
|
] |
|
``` |
|
|
|
#### Healthcare Information |
|
|
|
```python |
|
healthcare_labels = [ |
|
"condition", # Medical conditions |
|
"medical process", # Medical procedures |
|
"drug", # Drugs |
|
"dose", # Dosage information |
|
"blood type", # Blood types |
|
"injury", # Injuries |
|
"organization medical facility",# Healthcare facility names |
|
"healthcare number", # Healthcare numbers |
|
"medical code" # Medical codes |
|
] |
|
``` |
|
|
|
#### Identification Documents |
|
|
|
```python |
|
id_labels = [ |
|
"passport number", # Passport numbers |
|
"driver license", # Driver's license numbers |
|
"username", # Usernames |
|
"password", # Passwords |
|
"vehicle id" # Vehicle IDs |
|
] |
|
``` |
|
|
|
### Advanced Usage Examples |
|
|
|
#### Multi-Category Detection |
|
```python |
|
text = """ |
|
Patient Mary Johnson, DOB 01/15/1980, was discharged on March 10, 2024 |
|
from St. Mary's Hospital. Contact: [email protected], (555) 123-4567. |
|
Insurance policy: POL-789456123. |
|
""" |
|
|
|
labels = [ |
|
"name", "dob", "discharge date", "organization medical facility", |
|
"email address", "phone number", "policy number" |
|
] |
|
|
|
entities = model.predict_entities(text, labels, threshold=0.3) |
|
|
|
for entity in entities: |
|
print(f"Found '{entity['text']}' as {entity['label']}") |
|
``` |
|
|
|
#### Batch Processing for High Throughput |
|
```python |
|
documents = [ |
|
"Customer John called about his credit card ending in 4532.", |
|
"Sarah's SSN 123-45-6789 needs verification.", |
|
"Email [email protected] for account 987654321 issues." |
|
] |
|
|
|
labels = ["name", "credit card", "ssn", "email address", "account number"] |
|
|
|
# Process multiple documents efficiently |
|
results = model.run(documents, labels, threshold=0.3, batch_size=8) |
|
|
|
for doc_idx, entities in enumerate(results): |
|
print(f"\nDocument {doc_idx + 1}:") |
|
for entity in entities: |
|
print(f" {entity['text']} => {entity['label']}") |
|
``` |
|
|
|
#### Custom Entity Detection |
|
```python |
|
# GLiNER isn't limited to PII - you can detect any entities |
|
text = "The MacBook Pro with M2 chip costs $1,999 at the Apple Store in Manhattan." |
|
custom_labels = ["product", "processor", "price", "store", "location"] |
|
|
|
entities = model.predict_entities(text, custom_labels, threshold=0.3) |
|
``` |
|
|
|
#### Threshold Optimization |
|
```python |
|
# Lower threshold: Higher recall, more false positives |
|
high_recall = model.predict_entities(text, labels, threshold=0.2) |
|
|
|
# Higher threshold: Higher precision, fewer false positives |
|
high_precision = model.predict_entities(text, labels, threshold=0.6) |
|
|
|
# Recommended starting point for production |
|
balanced = model.predict_entities(text, labels, threshold=0.3) |
|
``` |
|
|
|
## π‘ Use Cases |
|
|
|
GLiNER excels in privacy-focused applications where traditional cloud-based NER services pose compliance risks. |
|
|
|
### π― **Primary Applications** |
|
|
|
#### Privacy-First Voice & Transcription |
|
```python |
|
# Automatically redact PII from voice transcriptions |
|
transcription = "Hi, my name is Sarah Johnson and my phone number is 415-555-0123" |
|
pii_labels = ["name", "phone number", "email address", "ssn"] |
|
|
|
entities = model.predict_entities(transcription, pii_labels) |
|
# Redact or anonymize detected PII before storage |
|
``` |
|
|
|
#### Compliance-Ready Document Processing |
|
```python |
|
# Healthcare: HIPAA-compliant note processing |
|
medical_note = "Patient John Doe, MRN 123456, diagnosed with diabetes..." |
|
phi_labels = ["name", "medical record number", "condition", "dob"] |
|
|
|
# Finance: PCI-DSS compliant transaction logs |
|
transaction_log = "Card ****4532 charged $299.99 to John Smith" |
|
pci_labels = ["credit card", "money", "name"] |
|
|
|
# Legal: Attorney-client privilege protection |
|
legal_doc = "Client Jane Doe vs. Corporation ABC, case #2024-CV-001" |
|
legal_labels = ["name", "organization", "case number"] |
|
``` |
|
|
|
#### Real-Time Data Anonymization |
|
```python |
|
def anonymize_text(text, entity_types): |
|
"""Anonymize PII in real-time""" |
|
entities = model.predict_entities(text, entity_types) |
|
|
|
# Sort by position to replace from end to start |
|
entities.sort(key=lambda x: x['start'], reverse=True) |
|
|
|
anonymized = text |
|
for entity in entities: |
|
placeholder = f"<{entity['label'].upper()}>" |
|
anonymized = anonymized[:entity['start']] + placeholder + anonymized[entity['end']:] |
|
|
|
return anonymized |
|
|
|
original = "John Smith's SSN is 123-45-6789" |
|
anonymized = anonymize_text(original, ["name", "ssn"]) |
|
print(anonymized) # "<NAME>'s SSN is <SSN>" |
|
``` |
|
|
|
### π **Extended Applications** |
|
|
|
#### Enhanced Search & Content Understanding |
|
```python |
|
# Extract key entities from user queries for better search |
|
query = "Find restaurants near Stanford University in Palo Alto" |
|
search_entities = ["organization", "location city", "business type"] |
|
|
|
# Intelligent document tagging |
|
document = "This quarterly report discusses Microsoft's Azure growth..." |
|
doc_entities = ["organization", "product", "time period"] |
|
``` |
|
|
|
#### GDPR-Compliant Chatbot Logs |
|
```python |
|
def sanitize_chat_log(message): |
|
"""Remove PII from chat logs per GDPR requirements""" |
|
sensitive_types = [ |
|
"name", "email address", "phone number", "location address", |
|
"credit card", "ssn", "passport number" |
|
] |
|
|
|
entities = model.predict_entities(message, sensitive_types) |
|
if entities: |
|
# Log anonymized version, alert compliance team |
|
return anonymize_text(message, sensitive_types) |
|
return message |
|
``` |
|
|
|
#### Secure Mobile & Edge Processing |
|
```python |
|
# Process sensitive data entirely on-device |
|
def process_locally(user_input): |
|
"""Process PII detection without cloud APIs""" |
|
pii_types = ["name", "phone number", "email address", "ssn", "credit card"] |
|
|
|
# All processing happens locally - no data leaves device |
|
detected_pii = model.predict_entities(user_input, pii_types) |
|
|
|
if detected_pii: |
|
return "β οΈ Sensitive information detected - proceed with caution" |
|
return "β
No PII detected - safe to share" |
|
``` |
|
|
|
## π Performance Benchmarks |
|
|
|
### Accuracy Evaluation |
|
|
|
The following benchmarks were run on the **synthetic-multi-pii-ner-v1** dataset. |
|
We compare multiple GLiNER-based PII models, including our new **Knowledgator GLiNER PII Edge v1.0**. |
|
|
|
| Model Path | Precision | Recall | F1 Score | |
|
| ---------------------------------------------------------------------- | --------- | ------ | ---------- | |
|
| **knowledgator/gliner-pii-edge-v1.0** | 78.96% | 72.34% | **75.50%** | |
|
| **knowledgator/gliner-pii-small-v1.0** | 78.99% | 74.80% | **76.84%** | |
|
| **knowledgator/gliner-pii-base-v1.0** | 79.28% | 82.78% | **80.99%** | |
|
| **knowledgator/gliner-pii-large-v1.0** | 87.42% | 79.4% | **83.25%** | |
|
| **urchade/gliner\_multi\_pii-v1** | 79.19% | 74.67% | **76.86%** | |
|
| **E3-JSI/gliner-multi-pii-domains-v1** | 78.35% | 74.46% | **76.36%** | |
|
| **gravitee-io/gliner-pii-detection** | 81.27% | 56.76% | **66.84%** | |
|
|
|
### Key Takeaways |
|
|
|
* **Base Post Model** (`knowledgator/gliner-pii-base-v1.0`) achieves the **highest F1 score (80.99%)**, indicating the strongest overall performance. |
|
* **Knowledgator Edge Model** (`knowledgator/gliner-pii-edge-v1.0`) is optimized for **edge environments**, trading a slight decrease in recall for lower latency and footprint. |
|
* **Gravitee-io Model** shows strong precision but lower recall, indicating it is tuned for high confidence but misses more entities. |
|
|
|
|
|
### Comparison with Alternatives |
|
|
|
| Solution | Speed | Privacy | Accuracy | Flexibility | Cost | |
|
| --------------------- | ----- | ------- | -------- | ----------- | -------- | |
|
| **GLiNER** | ββββ | βββββ | βββββ | ββββ | Free | |
|
| Cloud NER APIs | βββ | βββ | βββββ | βββ | \$\$\$ | |
|
| Large Language Models | ββ | ββ | ββββ | ββββ | \$\$\$\$ | |
|
| Traditional NER | βββββ | βββββ | ββββ | β | Free | |
|
|
|
|
|
## π Alternative Implementations |
|
|
|
While Python provides the most comprehensive PII detection capabilities, GLiNER is available across multiple languages for different deployment scenarios. |
|
|
|
### π¦ Rust Implementation (gline-rs) |
|
|
|
**Best for**: High-performance backend services, microservices |
|
|
|
```toml |
|
[dependencies] |
|
"gline-rs" = "1" |
|
``` |
|
|
|
```rust |
|
use gline_rs::{GLiNER, TextInput, Parameters, RuntimeParameters}; |
|
|
|
let model = GLiNER::<TokenMode>::new( |
|
Parameters::default(), |
|
RuntimeParameters::default(), |
|
"tokenizer.json", |
|
"model.onnx", |
|
)?; |
|
|
|
let input = TextInput::from_str( |
|
&["My name is James Bond."], |
|
&["person"], |
|
)?; |
|
|
|
let output = model.inference(input)?; |
|
``` |
|
|
|
**Performance**: 4x faster than Python on CPU, 37x faster with GPU acceleration. |
|
|
|
### β‘ C++ Implementation (GLiNER.cpp) |
|
|
|
**Best for**: Embedded systems, mobile apps, edge devices |
|
|
|
```cpp |
|
#include "GLiNER/model.hpp" |
|
|
|
gliner::Config config{12, 512}; |
|
gliner::Model model("./model.onnx", "./tokenizer.json", config); |
|
|
|
std::vector<std::string> texts = {"John works at Microsoft"}; |
|
std::vector<std::string> entities = {"person", "organization"}; |
|
|
|
auto output = model.inference(texts, entities); |
|
``` |
|
|
|
### π JavaScript Implementation (GLiNER.js) |
|
|
|
**Best for**: Web applications, browser-based processing |
|
|
|
```bash |
|
npm install gliner |
|
``` |
|
|
|
```javascript |
|
import { Gliner } from 'gliner'; |
|
|
|
const gliner = new Gliner({ |
|
tokenizerPath: "onnx-community/gliner_small-v2", |
|
onnxSettings: { |
|
modelPath: "public/model.onnx", |
|
executionProvider: "webgpu", |
|
} |
|
}); |
|
|
|
await gliner.initialize(); |
|
|
|
const results = await gliner.inference({ |
|
texts: ["John Smith works at Microsoft"], |
|
entities: ["person", "organization"], |
|
threshold: 0.1, |
|
}); |
|
``` |
|
|
|
## ποΈ Model Architecture & Training |
|
|
|
### Quantization-Aware Pretraining |
|
|
|
GLiNER models use quantization-aware pretraining, which optimizes performance while maintaining accuracy. This allows efficient inference even with quantized models. |
|
|
|
### Available ONNX Formats |
|
|
|
| Format | Size | Use Case | |
|
|--------|------|----------| |
|
| **FP16** | 330MB | Balanced performance/accuracy | |
|
| **UINT8** | 197MB | Maximum efficiency | |
|
|
|
### Model Conversion |
|
|
|
```bash |
|
python convert_to_onnx.py \ |
|
--model_path knowledgator/gliner-pii-base-v1.0 \ |
|
--save_path ./model \ |
|
--quantize True # For UINT8 quantization |
|
``` |
|
|
|
|
|
## π References |
|
|
|
- [GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer](https://arxiv.org/abs/2311.08526) |
|
- [GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks](https://arxiv.org/abs/2406.12925) |
|
- [Named Entity Recognition as Structured Span Prediction](https://arxiv.org/abs/2212.13415) |
|
|
|
## π Acknowledgments |
|
|
|
Special thanks to the all GLiNER contributors, the Wordcab team and additional thanks to maintainers of the Rust, C++, and JavaScript implementations. |
|
|
|
## π Support |
|
|
|
- **Hugging Face**: [Ihor/gliner-pii-small](https://huggingface.co/Ihor/gliner-pii-small) |
|
- **GitHub Issues**: [Report bugs and request features](https://github.com/info-wordcab/wordcab-pii) |
|
- **Discord**: [Join community discussions](https://discord.gg/wRF7tuY9) |
|
|
|
--- |
|
|
|
*GLiNER: Open-source privacy-first entity recognition for production applications.* |