File size: 15,896 Bytes
de016fa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
---
license: apache-2.0
language:
- en
library_name: gliner
pipeline_tag: token-classification
tags:
- NER
- GLiNER
- information extraction
- PII
- PHI
- PCI
- entity recognition
- multilingual
---


# GLiNER-PII: Zero-shot PII model

A production-grade open-source model for privacy-focused PII, PHI, and PCI detection with zero-shot entity recognition capabilities. 
This model was developed in collaboration between [Wordcab](https://wordcab.com/) and [Knowledgator](https://www.knowledgator.com/). For enterprise-ready, specialized PII/PHI/PCI models, contact us at [email protected].

## 🧠 What is GLiNER?

GLiNER (Generalist and Lightweight Named Entity Recognition) is a bidirectional transformer model that can identify **any entity type** without predefined categories. Unlike traditional NER models that are limited to specific entity classes, GLiNER allows you to specify exactly what entities you want to extract at runtime.

### Key Advantages

- **Zero-shot recognition**: Extract any entity type without retraining
- **Privacy-first**: Process sensitive data locally without API calls
- **Lightweight**: Much faster than large language models for NER tasks
- **Production-ready**: Quantization-aware training with FP16 and UINT8 ONNX models
- **Comprehensive**: 60+ predefined PII categories with custom entity support

### How GLiNER Works

Instead of predicting from a fixed set of entity classes, GLiNER takes both text and a list of desired entity types as input, then identifies spans that match those categories:

```python
text = "John Smith called from 415-555-1234 to discuss his account."
entities = ["name", "phone number", "account number"]
# GLiNER finds: "John Smith" β†’ name, "415-555-1234" β†’ phone number
```

## 🐍 Python Implementation

The primary GLiNER implementation provides comprehensive PII detection with 60+ entity categories, fine-tuned specifically for privacy and compliance use cases.

### Installation

```bash
pip install gliner
```

### Quick Start

```python
from gliner import GLiNER

# Load the model (downloads automatically on first use)
model = GLiNER.from_pretrained("knowledgator/gliner-pii-base-v1.0")

text = "John Smith called from 415-555-1234 to discuss his account number 12345678."
labels = ["name", "phone number", "account number"]

entities = model.predict_entities(text, labels, threshold=0.3)

for entity in entities:
    print(f"{entity['text']} => {entity['label']} (confidence: {entity['score']:.2f})")
```

Output:
```
John Smith => name (confidence: 0.95)
415-555-1234 => phone number (confidence: 0.92)
12345678 => account number (confidence: 0.88)
```

### Comprehensive PII Detection

The model was specifically optimized for 60+ predefined PII categories organized by domain, but it can work in zero-shot as well, meaning you can put any labels you need:

#### Personal Identifiers

```python
personal_labels = [
    "name",                       # Full names
    "first name",                 # First names  
    "last name",                  # Last names
    "name medical professional",  # Healthcare provider names
    "dob",                        # Date of birth
    "age",                        # Age information
    "gender",                     # Gender identifiers
    "marital status"              # Marital status
]
```

#### Contact Information

```python
contact_labels = [
    "email address",          # Email addresses
    "phone number",           # Phone numbers
    "ip address",             # IP addresses
    "url",                    # URLs
    "location address",       # Street addresses
    "location street",        # Street names
    "location city",          # City names
    "location state",         # State/province names
    "location country",       # Country names
    "location zip"            # ZIP/postal codes
]
```

#### Financial Information

```python
financial_labels = [
    "account number",         # Account numbers
    "bank account",           # Bank account numbers
    "routing number",         # Routing numbers
    "credit card",            # Credit card numbers
    "credit card expiration", # Card expiration dates  
    "cvv",                    # CVV/security codes
    "ssn",                    # Social Security Numbers
    "money"                   # Monetary amounts
]
```

#### Healthcare Information

```python
healthcare_labels = [
    "condition",                    # Medical conditions
    "medical process",              # Medical procedures
    "drug",                         # Drugs
    "dose",                         # Dosage information
    "blood type",                   # Blood types
    "injury",                       # Injuries
    "organization medical facility",# Healthcare facility names
    "healthcare number",            # Healthcare numbers
    "medical code"                  # Medical codes
]
```

#### Identification Documents

```python
id_labels = [
    "passport number",       # Passport numbers
    "driver license",        # Driver's license numbers
    "username",              # Usernames
    "password",              # Passwords
    "vehicle id"             # Vehicle IDs
]
```

### Advanced Usage Examples

#### Multi-Category Detection
```python
text = """
Patient Mary Johnson, DOB 01/15/1980, was discharged on March 10, 2024 
from St. Mary's Hospital. Contact: [email protected], (555) 123-4567.
Insurance policy: POL-789456123.
"""

labels = [
    "name", "dob", "discharge date", "organization medical facility",
    "email address", "phone number", "policy number"
]

entities = model.predict_entities(text, labels, threshold=0.3)

for entity in entities:
    print(f"Found '{entity['text']}' as {entity['label']}")
```

#### Batch Processing for High Throughput
```python
documents = [
    "Customer John called about his credit card ending in 4532.",
    "Sarah's SSN 123-45-6789 needs verification.",
    "Email [email protected] for account 987654321 issues."
]

labels = ["name", "credit card", "ssn", "email address", "account number"]

# Process multiple documents efficiently
results = model.run(documents, labels, threshold=0.3, batch_size=8)

for doc_idx, entities in enumerate(results):
    print(f"\nDocument {doc_idx + 1}:")
    for entity in entities:
        print(f"  {entity['text']} => {entity['label']}")
```

#### Custom Entity Detection
```python
# GLiNER isn't limited to PII - you can detect any entities
text = "The MacBook Pro with M2 chip costs $1,999 at the Apple Store in Manhattan."
custom_labels = ["product", "processor", "price", "store", "location"]

entities = model.predict_entities(text, custom_labels, threshold=0.3)
```

#### Threshold Optimization
```python
# Lower threshold: Higher recall, more false positives
high_recall = model.predict_entities(text, labels, threshold=0.2)

# Higher threshold: Higher precision, fewer false positives
high_precision = model.predict_entities(text, labels, threshold=0.6)

# Recommended starting point for production
balanced = model.predict_entities(text, labels, threshold=0.3)
```

## πŸ’‘ Use Cases

GLiNER excels in privacy-focused applications where traditional cloud-based NER services pose compliance risks.

### 🎯 **Primary Applications**

#### Privacy-First Voice & Transcription
```python
# Automatically redact PII from voice transcriptions
transcription = "Hi, my name is Sarah Johnson and my phone number is 415-555-0123"
pii_labels = ["name", "phone number", "email address", "ssn"]

entities = model.predict_entities(transcription, pii_labels)
# Redact or anonymize detected PII before storage
```

#### Compliance-Ready Document Processing  
```python
# Healthcare: HIPAA-compliant note processing
medical_note = "Patient John Doe, MRN 123456, diagnosed with diabetes..."
phi_labels = ["name", "medical record number", "condition", "dob"]

# Finance: PCI-DSS compliant transaction logs
transaction_log = "Card ****4532 charged $299.99 to John Smith"
pci_labels = ["credit card", "money", "name"]

# Legal: Attorney-client privilege protection
legal_doc = "Client Jane Doe vs. Corporation ABC, case #2024-CV-001"
legal_labels = ["name", "organization", "case number"]
```

#### Real-Time Data Anonymization
```python
def anonymize_text(text, entity_types):
    """Anonymize PII in real-time"""
    entities = model.predict_entities(text, entity_types)
    
    # Sort by position to replace from end to start
    entities.sort(key=lambda x: x['start'], reverse=True)
    
    anonymized = text
    for entity in entities:
        placeholder = f"<{entity['label'].upper()}>"
        anonymized = anonymized[:entity['start']] + placeholder + anonymized[entity['end']:]
    
    return anonymized

original = "John Smith's SSN is 123-45-6789"
anonymized = anonymize_text(original, ["name", "ssn"])
print(anonymized)  # "<NAME>'s SSN is <SSN>"
```

### 🌟 **Extended Applications**

#### Enhanced Search & Content Understanding
```python
# Extract key entities from user queries for better search
query = "Find restaurants near Stanford University in Palo Alto"
search_entities = ["organization", "location city", "business type"]

# Intelligent document tagging
document = "This quarterly report discusses Microsoft's Azure growth..."
doc_entities = ["organization", "product", "time period"]
```

#### GDPR-Compliant Chatbot Logs
```python
def sanitize_chat_log(message):
    """Remove PII from chat logs per GDPR requirements"""
    sensitive_types = [
        "name", "email address", "phone number", "location address",
        "credit card", "ssn", "passport number"
    ]
    
    entities = model.predict_entities(message, sensitive_types)
    if entities:
        # Log anonymized version, alert compliance team
        return anonymize_text(message, sensitive_types)
    return message
```

#### Secure Mobile & Edge Processing
```python
# Process sensitive data entirely on-device
def process_locally(user_input):
    """Process PII detection without cloud APIs"""
    pii_types = ["name", "phone number", "email address", "ssn", "credit card"]
    
    # All processing happens locally - no data leaves device
    detected_pii = model.predict_entities(user_input, pii_types)
    
    if detected_pii:
        return "⚠️ Sensitive information detected - proceed with caution"
    return "βœ… No PII detected - safe to share"
```

## πŸ“Š Performance Benchmarks

### Accuracy Evaluation

The following benchmarks were run on the **synthetic-multi-pii-ner-v1** dataset.
We compare multiple GLiNER-based PII models, including our new **Knowledgator GLiNER PII Edge v1.0**.

| Model Path                                                             | Precision | Recall | F1 Score   |
| ---------------------------------------------------------------------- | --------- | ------ | ---------- |
| **knowledgator/gliner-pii-edge-v1.0** | 78.96%    | 72.34% | **75.50%** |
| **knowledgator/gliner-pii-small-v1.0**                                             | 78.99%    | 74.80% | **76.84%** |
| **knowledgator/gliner-pii-base-v1.0**                                              | 79.28%    | 82.78% | **80.99%** |
| **knowledgator/gliner-pii-large-v1.0**             | 87.42% | 79.4% | **83.25%** |
| **urchade/gliner\_multi\_pii-v1**                                      | 79.19%    | 74.67% | **76.86%** |
| **E3-JSI/gliner-multi-pii-domains-v1**                                 | 78.35%    | 74.46% | **76.36%** |
| **gravitee-io/gliner-pii-detection**                                   | 81.27%    | 56.76% | **66.84%** |

### Key Takeaways

* **Base Post Model** (`knowledgator/gliner-pii-base-v1.0`) achieves the **highest F1 score (80.99%)**, indicating the strongest overall performance.
* **Knowledgator Edge Model** (`knowledgator/gliner-pii-edge-v1.0`) is optimized for **edge environments**, trading a slight decrease in recall for lower latency and footprint.
* **Gravitee-io Model** shows strong precision but lower recall, indicating it is tuned for high confidence but misses more entities.


### Comparison with Alternatives

| Solution              | Speed | Privacy | Accuracy | Flexibility | Cost     |
| --------------------- | ----- | ------- | -------- | ----------- | -------- |
| **GLiNER**            | ⭐⭐⭐⭐  | ⭐⭐⭐⭐⭐   | ⭐⭐⭐⭐⭐    | ⭐⭐⭐⭐        | Free     |
| Cloud NER APIs        | ⭐⭐⭐   | ⭐⭐⭐     | ⭐⭐⭐⭐⭐    | ⭐⭐⭐         | \$\$\$   |
| Large Language Models | ⭐⭐    | ⭐⭐      | ⭐⭐⭐⭐     | ⭐⭐⭐⭐        | \$\$\$\$ |
| Traditional NER       | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐   | ⭐⭐⭐⭐     | ⭐           | Free     |


## πŸš€ Alternative Implementations

While Python provides the most comprehensive PII detection capabilities, GLiNER is available across multiple languages for different deployment scenarios.

### πŸ¦€ Rust Implementation (gline-rs)

**Best for**: High-performance backend services, microservices

```toml
[dependencies]
"gline-rs" = "1"
```

```rust
use gline_rs::{GLiNER, TextInput, Parameters, RuntimeParameters};

let model = GLiNER::<TokenMode>::new(
    Parameters::default(),
    RuntimeParameters::default(),
    "tokenizer.json",
    "model.onnx",
)?;

let input = TextInput::from_str(
    &["My name is James Bond."],
    &["person"],
)?;

let output = model.inference(input)?;
```

**Performance**: 4x faster than Python on CPU, 37x faster with GPU acceleration.

### ⚑ C++ Implementation (GLiNER.cpp)

**Best for**: Embedded systems, mobile apps, edge devices

```cpp
#include "GLiNER/model.hpp"

gliner::Config config{12, 512};
gliner::Model model("./model.onnx", "./tokenizer.json", config);

std::vector<std::string> texts = {"John works at Microsoft"};
std::vector<std::string> entities = {"person", "organization"};

auto output = model.inference(texts, entities);
```

### 🌐 JavaScript Implementation (GLiNER.js) 

**Best for**: Web applications, browser-based processing

```bash
npm install gliner
```

```javascript
import { Gliner } from 'gliner';

const gliner = new Gliner({
  tokenizerPath: "onnx-community/gliner_small-v2",
  onnxSettings: {
    modelPath: "public/model.onnx",
    executionProvider: "webgpu",
  }
});

await gliner.initialize();

const results = await gliner.inference({
  texts: ["John Smith works at Microsoft"],
  entities: ["person", "organization"],
  threshold: 0.1,
});
```

## πŸ—οΈ Model Architecture & Training

### Quantization-Aware Pretraining

GLiNER models use quantization-aware pretraining, which optimizes performance while maintaining accuracy. This allows efficient inference even with quantized models.

### Available ONNX Formats

| Format | Size | Use Case |
|--------|------|----------|
| **FP16** | 330MB | Balanced performance/accuracy |
| **UINT8** | 197MB | Maximum efficiency |

### Model Conversion

```bash
python convert_to_onnx.py \
  --model_path knowledgator/gliner-pii-base-v1.0 \
  --save_path ./model \
  --quantize True  # For UINT8 quantization
```


## πŸ“„ References

- [GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer](https://arxiv.org/abs/2311.08526)
- [GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks](https://arxiv.org/abs/2406.12925) 
- [Named Entity Recognition as Structured Span Prediction](https://arxiv.org/abs/2212.13415)

## πŸ™ Acknowledgments

Special thanks to the all GLiNER contributors, the Wordcab team and additional thanks to maintainers of the Rust, C++, and JavaScript implementations.

## πŸ“ž Support

- **Hugging Face**: [Ihor/gliner-pii-small](https://huggingface.co/Ihor/gliner-pii-small)
- **GitHub Issues**: [Report bugs and request features](https://github.com/info-wordcab/wordcab-pii)
- **Discord**: [Join community discussions](https://discord.gg/wRF7tuY9)

---

*GLiNER: Open-source privacy-first entity recognition for production applications.*