knowledgator
/

gliner-pii-edge-v1.0

+---
+license: apache-2.0
+language:
+- en
+library_name: gliner
+pipeline_tag: token-classification
+tags:
+- NER
+- GLiNER
+- information extraction
+- PII
+- PHI
+- PCI
+- entity recognition
+- multilingual
+---
+# GLiNER-PII: Zero-shot PII model
+A production-grade open-source model for privacy-focused PII, PHI, and PCI detection with zero-shot entity recognition capabilities.
+This model was developed in collaboration between [Wordcab](https://wordcab.com/) and [Knowledgator](https://www.knowledgator.com/). For enterprise-ready, specialized PII/PHI/PCI models, contact us at [email protected].
+## 🧠 What is GLiNER?
+GLiNER (Generalist and Lightweight Named Entity Recognition) is a bidirectional transformer model that can identify **any entity type** without predefined categories. Unlike traditional NER models that are limited to specific entity classes, GLiNER allows you to specify exactly what entities you want to extract at runtime.
+### Key Advantages
+- **Zero-shot recognition**: Extract any entity type without retraining
+- **Privacy-first**: Process sensitive data locally without API calls
+- **Lightweight**: Much faster than large language models for NER tasks
+- **Production-ready**: Quantization-aware training with FP16 and UINT8 ONNX models
+- **Comprehensive**: 60+ predefined PII categories with custom entity support
+### How GLiNER Works
+Instead of predicting from a fixed set of entity classes, GLiNER takes both text and a list of desired entity types as input, then identifies spans that match those categories:
+```python
+text = "John Smith called from 415-555-1234 to discuss his account."
+entities = ["name", "phone number", "account number"]
+# GLiNER finds: "John Smith" → name, "415-555-1234" → phone number
+```
+## 🐍 Python Implementation
+The primary GLiNER implementation provides comprehensive PII detection with 60+ entity categories, fine-tuned specifically for privacy and compliance use cases.
+### Installation
+```bash
+pip install gliner
+```
+### Quick Start
+```python
+from gliner import GLiNER
+# Load the model (downloads automatically on first use)
+model = GLiNER.from_pretrained("knowledgator/gliner-pii-base-v1.0")
+text = "John Smith called from 415-555-1234 to discuss his account number 12345678."
+labels = ["name", "phone number", "account number"]
+entities = model.predict_entities(text, labels, threshold=0.3)
+for entity in entities:
+    print(f"{entity['text']} => {entity['label']} (confidence: {entity['score']:.2f})")
+```
+Output:
+```
+John Smith => name (confidence: 0.95)
+415-555-1234 => phone number (confidence: 0.92)
+12345678 => account number (confidence: 0.88)
+```
+### Comprehensive PII Detection
+The model was specifically optimized for 60+ predefined PII categories organized by domain, but it can work in zero-shot as well, meaning you can put any labels you need:
+#### Personal Identifiers
+```python
+personal_labels = [
+    "name",                       # Full names
+    "first name",                 # First names
+    "last name",                  # Last names
+    "name medical professional",  # Healthcare provider names
+    "dob",                        # Date of birth
+    "age",                        # Age information
+    "gender",                     # Gender identifiers
+    "marital status"              # Marital status
+]
+```
+#### Contact Information
+```python
+contact_labels = [
+    "email address",          # Email addresses
+    "phone number",           # Phone numbers
+    "ip address",             # IP addresses
+    "url",                    # URLs
+    "location address",       # Street addresses
+    "location street",        # Street names
+    "location city",          # City names
+    "location state",         # State/province names
+    "location country",       # Country names
+    "location zip"            # ZIP/postal codes
+]
+```
+#### Financial Information
+```python
+financial_labels = [
+    "account number",         # Account numbers
+    "bank account",           # Bank account numbers
+    "routing number",         # Routing numbers
+    "credit card",            # Credit card numbers
+    "credit card expiration", # Card expiration dates
+    "cvv",                    # CVV/security codes
+    "ssn",                    # Social Security Numbers
+    "money"                   # Monetary amounts
+]
+```
+#### Healthcare Information
+```python
+healthcare_labels = [
+    "condition",                    # Medical conditions
+    "medical process",              # Medical procedures
+    "drug",                         # Drugs
+    "dose",                         # Dosage information
+    "blood type",                   # Blood types
+    "injury",                       # Injuries
+    "organization medical facility",# Healthcare facility names
+    "healthcare number",            # Healthcare numbers
+    "medical code"                  # Medical codes
+]
+```
+#### Identification Documents
+```python
+id_labels = [
+    "passport number",       # Passport numbers
+    "driver license",        # Driver's license numbers
+    "username",              # Usernames
+    "password",              # Passwords
+    "vehicle id"             # Vehicle IDs
+]
+```
+### Advanced Usage Examples
+#### Multi-Category Detection
+```python
+text = """
+Patient Mary Johnson, DOB 01/15/1980, was discharged on March 10, 2024
+from St. Mary's Hospital. Contact: [email protected], (555) 123-4567.
+Insurance policy: POL-789456123.
+"""
+labels = [
+    "name", "dob", "discharge date", "organization medical facility",
+    "email address", "phone number", "policy number"
+]
+entities = model.predict_entities(text, labels, threshold=0.3)
+for entity in entities:
+    print(f"Found '{entity['text']}' as {entity['label']}")
+```
+#### Batch Processing for High Throughput
+```python
+documents = [
+    "Customer John called about his credit card ending in 4532.",
+    "Sarah's SSN 123-45-6789 needs verification.",
+    "Email [email protected] for account 987654321 issues."
+]
+labels = ["name", "credit card", "ssn", "email address", "account number"]
+# Process multiple documents efficiently
+results = model.run(documents, labels, threshold=0.3, batch_size=8)
+for doc_idx, entities in enumerate(results):
+    print(f"\nDocument {doc_idx + 1}:")
+    for entity in entities:
+        print(f"  {entity['text']} => {entity['label']}")
+```
+#### Custom Entity Detection
+```python
+# GLiNER isn't limited to PII - you can detect any entities
+text = "The MacBook Pro with M2 chip costs $1,999 at the Apple Store in Manhattan."
+custom_labels = ["product", "processor", "price", "store", "location"]
+entities = model.predict_entities(text, custom_labels, threshold=0.3)
+```
+#### Threshold Optimization
+```python
+# Lower threshold: Higher recall, more false positives
+high_recall = model.predict_entities(text, labels, threshold=0.2)
+# Higher threshold: Higher precision, fewer false positives
+high_precision = model.predict_entities(text, labels, threshold=0.6)
+# Recommended starting point for production
+balanced = model.predict_entities(text, labels, threshold=0.3)
+```
+## 💡 Use Cases
+GLiNER excels in privacy-focused applications where traditional cloud-based NER services pose compliance risks.
+### 🎯 **Primary Applications**
+#### Privacy-First Voice & Transcription
+```python
+# Automatically redact PII from voice transcriptions
+transcription = "Hi, my name is Sarah Johnson and my phone number is 415-555-0123"
+pii_labels = ["name", "phone number", "email address", "ssn"]
+entities = model.predict_entities(transcription, pii_labels)
+# Redact or anonymize detected PII before storage
+```
+#### Compliance-Ready Document Processing
+```python
+# Healthcare: HIPAA-compliant note processing
+medical_note = "Patient John Doe, MRN 123456, diagnosed with diabetes..."
+phi_labels = ["name", "medical record number", "condition", "dob"]
+# Finance: PCI-DSS compliant transaction logs
+transaction_log = "Card ****4532 charged $299.99 to John Smith"
+pci_labels = ["credit card", "money", "name"]
+# Legal: Attorney-client privilege protection
+legal_doc = "Client Jane Doe vs. Corporation ABC, case #2024-CV-001"
+legal_labels = ["name", "organization", "case number"]
+```
+#### Real-Time Data Anonymization
+```python
+def anonymize_text(text, entity_types):
+    """Anonymize PII in real-time"""
+    entities = model.predict_entities(text, entity_types)
+    # Sort by position to replace from end to start
+    entities.sort(key=lambda x: x['start'], reverse=True)
+    anonymized = text
+    for entity in entities:
+        placeholder = f"<{entity['label'].upper()}>"
+        anonymized = anonymized[:entity['start']] + placeholder + anonymized[entity['end']:]
+    return anonymized
+original = "John Smith's SSN is 123-45-6789"
+anonymized = anonymize_text(original, ["name", "ssn"])
+print(anonymized)  # "<NAME>'s SSN is <SSN>"
+```
+### 🌟 **Extended Applications**
+#### Enhanced Search & Content Understanding
+```python
+# Extract key entities from user queries for better search
+query = "Find restaurants near Stanford University in Palo Alto"
+search_entities = ["organization", "location city", "business type"]
+# Intelligent document tagging
+document = "This quarterly report discusses Microsoft's Azure growth..."
+doc_entities = ["organization", "product", "time period"]
+```
+#### GDPR-Compliant Chatbot Logs
+```python
+def sanitize_chat_log(message):
+    """Remove PII from chat logs per GDPR requirements"""
+    sensitive_types = [
+        "name", "email address", "phone number", "location address",
+        "credit card", "ssn", "passport number"
+    ]
+    entities = model.predict_entities(message, sensitive_types)
+    if entities:
+        # Log anonymized version, alert compliance team
+        return anonymize_text(message, sensitive_types)
+    return message
+```
+#### Secure Mobile & Edge Processing
+```python
+# Process sensitive data entirely on-device
+def process_locally(user_input):
+    """Process PII detection without cloud APIs"""
+    pii_types = ["name", "phone number", "email address", "ssn", "credit card"]
+    # All processing happens locally - no data leaves device
+    detected_pii = model.predict_entities(user_input, pii_types)
+    if detected_pii:
+        return "⚠️ Sensitive information detected - proceed with caution"
+    return "✅ No PII detected - safe to share"
+```
+## 📊 Performance Benchmarks
+### Accuracy Evaluation
+The following benchmarks were run on the **synthetic-multi-pii-ner-v1** dataset.
+We compare multiple GLiNER-based PII models, including our new **Knowledgator GLiNER PII Edge v1.0**.
+| Model Path                                                             | Precision | Recall | F1 Score   |
+| ---------------------------------------------------------------------- | --------- | ------ | ---------- |
+| **knowledgator/gliner-pii-edge-v1.0** | 78.96%    | 72.34% | **75.50%** |
+| **knowledgator/gliner-pii-small-v1.0**                                             | 78.99%    | 74.80% | **76.84%** |
+| **knowledgator/gliner-pii-base-v1.0**                                              | 79.28%    | 82.78% | **80.99%** |
+| **knowledgator/gliner-pii-large-v1.0**             | 87.42% | 79.4% | **83.25%** |
+| **urchade/gliner\_multi\_pii-v1**                                      | 79.19%    | 74.67% | **76.86%** |
+| **E3-JSI/gliner-multi-pii-domains-v1**                                 | 78.35%    | 74.46% | **76.36%** |
+| **gravitee-io/gliner-pii-detection**                                   | 81.27%    | 56.76% | **66.84%** |
+### Key Takeaways
+* **Base Post Model** (`knowledgator/gliner-pii-base-v1.0`) achieves the **highest F1 score (80.99%)**, indicating the strongest overall performance.
+* **Knowledgator Edge Model** (`knowledgator/gliner-pii-edge-v1.0`) is optimized for **edge environments**, trading a slight decrease in recall for lower latency and footprint.
+* **Gravitee-io Model** shows strong precision but lower recall, indicating it is tuned for high confidence but misses more entities.
+### Comparison with Alternatives
+| Solution              | Speed | Privacy | Accuracy | Flexibility | Cost     |
+| --------------------- | ----- | ------- | -------- | ----------- | -------- |
+| **GLiNER**            | ⭐⭐⭐⭐  | ⭐⭐⭐⭐⭐   | ⭐⭐⭐⭐⭐    | ⭐⭐⭐⭐        | Free     |
+| Cloud NER APIs        | ⭐⭐⭐   | ⭐⭐⭐     | ⭐⭐⭐⭐⭐    | ⭐⭐⭐         | \$\$\$   |
+| Large Language Models | ⭐⭐    | ⭐⭐      | ⭐⭐⭐⭐     | ⭐⭐⭐⭐        | \$\$\$\$ |
+| Traditional NER       | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐   | ⭐⭐⭐⭐     | ⭐           | Free     |
+## 🚀 Alternative Implementations
+While Python provides the most comprehensive PII detection capabilities, GLiNER is available across multiple languages for different deployment scenarios.
+### 🦀 Rust Implementation (gline-rs)
+**Best for**: High-performance backend services, microservices
+```toml
+[dependencies]
+"gline-rs" = "1"
+```
+```rust
+use gline_rs::{GLiNER, TextInput, Parameters, RuntimeParameters};
+let model = GLiNER::<TokenMode>::new(
+    Parameters::default(),
+    RuntimeParameters::default(),
+    "tokenizer.json",
+    "model.onnx",
+)?;
+let input = TextInput::from_str(
+    &["My name is James Bond."],
+    &["person"],
+)?;
+let output = model.inference(input)?;
+```
+**Performance**: 4x faster than Python on CPU, 37x faster with GPU acceleration.
+### ⚡ C++ Implementation (GLiNER.cpp)
+**Best for**: Embedded systems, mobile apps, edge devices
+```cpp
+#include "GLiNER/model.hpp"
+gliner::Config config{12, 512};
+gliner::Model model("./model.onnx", "./tokenizer.json", config);
+std::vector<std::string> texts = {"John works at Microsoft"};
+std::vector<std::string> entities = {"person", "organization"};
+auto output = model.inference(texts, entities);
+```
+### 🌐 JavaScript Implementation (GLiNER.js)
+**Best for**: Web applications, browser-based processing
+```bash
+npm install gliner
+```
+```javascript
+import { Gliner } from 'gliner';
+const gliner = new Gliner({
+  tokenizerPath: "onnx-community/gliner_small-v2",
+  onnxSettings: {
+    modelPath: "public/model.onnx",
+    executionProvider: "webgpu",
+  }
+});
+await gliner.initialize();
+const results = await gliner.inference({
+  texts: ["John Smith works at Microsoft"],
+  entities: ["person", "organization"],
+  threshold: 0.1,
+});
+```
+## 🏗️ Model Architecture & Training
+### Quantization-Aware Pretraining
+GLiNER models use quantization-aware pretraining, which optimizes performance while maintaining accuracy. This allows efficient inference even with quantized models.
+### Available ONNX Formats
+| Format | Size | Use Case |
+|--------|------|----------|
+| **FP16** | 330MB | Balanced performance/accuracy |
+| **UINT8** | 197MB | Maximum efficiency |
+### Model Conversion
+```bash
+python convert_to_onnx.py \
+  --model_path knowledgator/gliner-pii-base-v1.0 \
+  --save_path ./model \
+  --quantize True  # For UINT8 quantization
+```
+## 📄 References
+- [GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer](https://arxiv.org/abs/2311.08526)
+- [GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks](https://arxiv.org/abs/2406.12925)
+- [Named Entity Recognition as Structured Span Prediction](https://arxiv.org/abs/2212.13415)
+## 🙏 Acknowledgments
+Special thanks to the all GLiNER contributors, the Wordcab team and additional thanks to maintainers of the Rust, C++, and JavaScript implementations.
+## 📞 Support
+- **Hugging Face**: [Ihor/gliner-pii-small](https://huggingface.co/Ihor/gliner-pii-small)
+- **GitHub Issues**: [Report bugs and request features](https://github.com/info-wordcab/wordcab-pii)
+- **Discord**: [Join community discussions](https://discord.gg/wRF7tuY9)
+---
+*GLiNER: Open-source privacy-first entity recognition for production applications.*