alexandrlukashov commited on
Commit
de016fa
Β·
verified Β·
1 Parent(s): e10568a

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +468 -3
README.md CHANGED
@@ -1,3 +1,468 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: gliner
6
+ pipeline_tag: token-classification
7
+ tags:
8
+ - NER
9
+ - GLiNER
10
+ - information extraction
11
+ - PII
12
+ - PHI
13
+ - PCI
14
+ - entity recognition
15
+ - multilingual
16
+ ---
17
+
18
+
19
+ # GLiNER-PII: Zero-shot PII model
20
+
21
+ A production-grade open-source model for privacy-focused PII, PHI, and PCI detection with zero-shot entity recognition capabilities.
22
+ This model was developed in collaboration between [Wordcab](https://wordcab.com/) and [Knowledgator](https://www.knowledgator.com/). For enterprise-ready, specialized PII/PHI/PCI models, contact us at [email protected].
23
+
24
+ ## 🧠 What is GLiNER?
25
+
26
+ GLiNER (Generalist and Lightweight Named Entity Recognition) is a bidirectional transformer model that can identify **any entity type** without predefined categories. Unlike traditional NER models that are limited to specific entity classes, GLiNER allows you to specify exactly what entities you want to extract at runtime.
27
+
28
+ ### Key Advantages
29
+
30
+ - **Zero-shot recognition**: Extract any entity type without retraining
31
+ - **Privacy-first**: Process sensitive data locally without API calls
32
+ - **Lightweight**: Much faster than large language models for NER tasks
33
+ - **Production-ready**: Quantization-aware training with FP16 and UINT8 ONNX models
34
+ - **Comprehensive**: 60+ predefined PII categories with custom entity support
35
+
36
+ ### How GLiNER Works
37
+
38
+ Instead of predicting from a fixed set of entity classes, GLiNER takes both text and a list of desired entity types as input, then identifies spans that match those categories:
39
+
40
+ ```python
41
+ text = "John Smith called from 415-555-1234 to discuss his account."
42
+ entities = ["name", "phone number", "account number"]
43
+ # GLiNER finds: "John Smith" β†’ name, "415-555-1234" β†’ phone number
44
+ ```
45
+
46
+ ## 🐍 Python Implementation
47
+
48
+ The primary GLiNER implementation provides comprehensive PII detection with 60+ entity categories, fine-tuned specifically for privacy and compliance use cases.
49
+
50
+ ### Installation
51
+
52
+ ```bash
53
+ pip install gliner
54
+ ```
55
+
56
+ ### Quick Start
57
+
58
+ ```python
59
+ from gliner import GLiNER
60
+
61
+ # Load the model (downloads automatically on first use)
62
+ model = GLiNER.from_pretrained("knowledgator/gliner-pii-base-v1.0")
63
+
64
+ text = "John Smith called from 415-555-1234 to discuss his account number 12345678."
65
+ labels = ["name", "phone number", "account number"]
66
+
67
+ entities = model.predict_entities(text, labels, threshold=0.3)
68
+
69
+ for entity in entities:
70
+ print(f"{entity['text']} => {entity['label']} (confidence: {entity['score']:.2f})")
71
+ ```
72
+
73
+ Output:
74
+ ```
75
+ John Smith => name (confidence: 0.95)
76
+ 415-555-1234 => phone number (confidence: 0.92)
77
+ 12345678 => account number (confidence: 0.88)
78
+ ```
79
+
80
+ ### Comprehensive PII Detection
81
+
82
+ The model was specifically optimized for 60+ predefined PII categories organized by domain, but it can work in zero-shot as well, meaning you can put any labels you need:
83
+
84
+ #### Personal Identifiers
85
+
86
+ ```python
87
+ personal_labels = [
88
+ "name", # Full names
89
+ "first name", # First names
90
+ "last name", # Last names
91
+ "name medical professional", # Healthcare provider names
92
+ "dob", # Date of birth
93
+ "age", # Age information
94
+ "gender", # Gender identifiers
95
+ "marital status" # Marital status
96
+ ]
97
+ ```
98
+
99
+ #### Contact Information
100
+
101
+ ```python
102
+ contact_labels = [
103
+ "email address", # Email addresses
104
+ "phone number", # Phone numbers
105
+ "ip address", # IP addresses
106
+ "url", # URLs
107
+ "location address", # Street addresses
108
+ "location street", # Street names
109
+ "location city", # City names
110
+ "location state", # State/province names
111
+ "location country", # Country names
112
+ "location zip" # ZIP/postal codes
113
+ ]
114
+ ```
115
+
116
+ #### Financial Information
117
+
118
+ ```python
119
+ financial_labels = [
120
+ "account number", # Account numbers
121
+ "bank account", # Bank account numbers
122
+ "routing number", # Routing numbers
123
+ "credit card", # Credit card numbers
124
+ "credit card expiration", # Card expiration dates
125
+ "cvv", # CVV/security codes
126
+ "ssn", # Social Security Numbers
127
+ "money" # Monetary amounts
128
+ ]
129
+ ```
130
+
131
+ #### Healthcare Information
132
+
133
+ ```python
134
+ healthcare_labels = [
135
+ "condition", # Medical conditions
136
+ "medical process", # Medical procedures
137
+ "drug", # Drugs
138
+ "dose", # Dosage information
139
+ "blood type", # Blood types
140
+ "injury", # Injuries
141
+ "organization medical facility",# Healthcare facility names
142
+ "healthcare number", # Healthcare numbers
143
+ "medical code" # Medical codes
144
+ ]
145
+ ```
146
+
147
+ #### Identification Documents
148
+
149
+ ```python
150
+ id_labels = [
151
+ "passport number", # Passport numbers
152
+ "driver license", # Driver's license numbers
153
+ "username", # Usernames
154
+ "password", # Passwords
155
+ "vehicle id" # Vehicle IDs
156
+ ]
157
+ ```
158
+
159
+ ### Advanced Usage Examples
160
+
161
+ #### Multi-Category Detection
162
+ ```python
163
+ text = """
164
+ Patient Mary Johnson, DOB 01/15/1980, was discharged on March 10, 2024
165
+ from St. Mary's Hospital. Contact: [email protected], (555) 123-4567.
166
+ Insurance policy: POL-789456123.
167
+ """
168
+
169
+ labels = [
170
+ "name", "dob", "discharge date", "organization medical facility",
171
+ "email address", "phone number", "policy number"
172
+ ]
173
+
174
+ entities = model.predict_entities(text, labels, threshold=0.3)
175
+
176
+ for entity in entities:
177
+ print(f"Found '{entity['text']}' as {entity['label']}")
178
+ ```
179
+
180
+ #### Batch Processing for High Throughput
181
+ ```python
182
+ documents = [
183
+ "Customer John called about his credit card ending in 4532.",
184
+ "Sarah's SSN 123-45-6789 needs verification.",
185
+ "Email [email protected] for account 987654321 issues."
186
+ ]
187
+
188
+ labels = ["name", "credit card", "ssn", "email address", "account number"]
189
+
190
+ # Process multiple documents efficiently
191
+ results = model.run(documents, labels, threshold=0.3, batch_size=8)
192
+
193
+ for doc_idx, entities in enumerate(results):
194
+ print(f"\nDocument {doc_idx + 1}:")
195
+ for entity in entities:
196
+ print(f" {entity['text']} => {entity['label']}")
197
+ ```
198
+
199
+ #### Custom Entity Detection
200
+ ```python
201
+ # GLiNER isn't limited to PII - you can detect any entities
202
+ text = "The MacBook Pro with M2 chip costs $1,999 at the Apple Store in Manhattan."
203
+ custom_labels = ["product", "processor", "price", "store", "location"]
204
+
205
+ entities = model.predict_entities(text, custom_labels, threshold=0.3)
206
+ ```
207
+
208
+ #### Threshold Optimization
209
+ ```python
210
+ # Lower threshold: Higher recall, more false positives
211
+ high_recall = model.predict_entities(text, labels, threshold=0.2)
212
+
213
+ # Higher threshold: Higher precision, fewer false positives
214
+ high_precision = model.predict_entities(text, labels, threshold=0.6)
215
+
216
+ # Recommended starting point for production
217
+ balanced = model.predict_entities(text, labels, threshold=0.3)
218
+ ```
219
+
220
+ ## πŸ’‘ Use Cases
221
+
222
+ GLiNER excels in privacy-focused applications where traditional cloud-based NER services pose compliance risks.
223
+
224
+ ### 🎯 **Primary Applications**
225
+
226
+ #### Privacy-First Voice & Transcription
227
+ ```python
228
+ # Automatically redact PII from voice transcriptions
229
+ transcription = "Hi, my name is Sarah Johnson and my phone number is 415-555-0123"
230
+ pii_labels = ["name", "phone number", "email address", "ssn"]
231
+
232
+ entities = model.predict_entities(transcription, pii_labels)
233
+ # Redact or anonymize detected PII before storage
234
+ ```
235
+
236
+ #### Compliance-Ready Document Processing
237
+ ```python
238
+ # Healthcare: HIPAA-compliant note processing
239
+ medical_note = "Patient John Doe, MRN 123456, diagnosed with diabetes..."
240
+ phi_labels = ["name", "medical record number", "condition", "dob"]
241
+
242
+ # Finance: PCI-DSS compliant transaction logs
243
+ transaction_log = "Card ****4532 charged $299.99 to John Smith"
244
+ pci_labels = ["credit card", "money", "name"]
245
+
246
+ # Legal: Attorney-client privilege protection
247
+ legal_doc = "Client Jane Doe vs. Corporation ABC, case #2024-CV-001"
248
+ legal_labels = ["name", "organization", "case number"]
249
+ ```
250
+
251
+ #### Real-Time Data Anonymization
252
+ ```python
253
+ def anonymize_text(text, entity_types):
254
+ """Anonymize PII in real-time"""
255
+ entities = model.predict_entities(text, entity_types)
256
+
257
+ # Sort by position to replace from end to start
258
+ entities.sort(key=lambda x: x['start'], reverse=True)
259
+
260
+ anonymized = text
261
+ for entity in entities:
262
+ placeholder = f"<{entity['label'].upper()}>"
263
+ anonymized = anonymized[:entity['start']] + placeholder + anonymized[entity['end']:]
264
+
265
+ return anonymized
266
+
267
+ original = "John Smith's SSN is 123-45-6789"
268
+ anonymized = anonymize_text(original, ["name", "ssn"])
269
+ print(anonymized) # "<NAME>'s SSN is <SSN>"
270
+ ```
271
+
272
+ ### 🌟 **Extended Applications**
273
+
274
+ #### Enhanced Search & Content Understanding
275
+ ```python
276
+ # Extract key entities from user queries for better search
277
+ query = "Find restaurants near Stanford University in Palo Alto"
278
+ search_entities = ["organization", "location city", "business type"]
279
+
280
+ # Intelligent document tagging
281
+ document = "This quarterly report discusses Microsoft's Azure growth..."
282
+ doc_entities = ["organization", "product", "time period"]
283
+ ```
284
+
285
+ #### GDPR-Compliant Chatbot Logs
286
+ ```python
287
+ def sanitize_chat_log(message):
288
+ """Remove PII from chat logs per GDPR requirements"""
289
+ sensitive_types = [
290
+ "name", "email address", "phone number", "location address",
291
+ "credit card", "ssn", "passport number"
292
+ ]
293
+
294
+ entities = model.predict_entities(message, sensitive_types)
295
+ if entities:
296
+ # Log anonymized version, alert compliance team
297
+ return anonymize_text(message, sensitive_types)
298
+ return message
299
+ ```
300
+
301
+ #### Secure Mobile & Edge Processing
302
+ ```python
303
+ # Process sensitive data entirely on-device
304
+ def process_locally(user_input):
305
+ """Process PII detection without cloud APIs"""
306
+ pii_types = ["name", "phone number", "email address", "ssn", "credit card"]
307
+
308
+ # All processing happens locally - no data leaves device
309
+ detected_pii = model.predict_entities(user_input, pii_types)
310
+
311
+ if detected_pii:
312
+ return "⚠️ Sensitive information detected - proceed with caution"
313
+ return "βœ… No PII detected - safe to share"
314
+ ```
315
+
316
+ ## πŸ“Š Performance Benchmarks
317
+
318
+ ### Accuracy Evaluation
319
+
320
+ The following benchmarks were run on the **synthetic-multi-pii-ner-v1** dataset.
321
+ We compare multiple GLiNER-based PII models, including our new **Knowledgator GLiNER PII Edge v1.0**.
322
+
323
+ | Model Path | Precision | Recall | F1 Score |
324
+ | ---------------------------------------------------------------------- | --------- | ------ | ---------- |
325
+ | **knowledgator/gliner-pii-edge-v1.0** | 78.96% | 72.34% | **75.50%** |
326
+ | **knowledgator/gliner-pii-small-v1.0** | 78.99% | 74.80% | **76.84%** |
327
+ | **knowledgator/gliner-pii-base-v1.0** | 79.28% | 82.78% | **80.99%** |
328
+ | **knowledgator/gliner-pii-large-v1.0** | 87.42% | 79.4% | **83.25%** |
329
+ | **urchade/gliner\_multi\_pii-v1** | 79.19% | 74.67% | **76.86%** |
330
+ | **E3-JSI/gliner-multi-pii-domains-v1** | 78.35% | 74.46% | **76.36%** |
331
+ | **gravitee-io/gliner-pii-detection** | 81.27% | 56.76% | **66.84%** |
332
+
333
+ ### Key Takeaways
334
+
335
+ * **Base Post Model** (`knowledgator/gliner-pii-base-v1.0`) achieves the **highest F1 score (80.99%)**, indicating the strongest overall performance.
336
+ * **Knowledgator Edge Model** (`knowledgator/gliner-pii-edge-v1.0`) is optimized for **edge environments**, trading a slight decrease in recall for lower latency and footprint.
337
+ * **Gravitee-io Model** shows strong precision but lower recall, indicating it is tuned for high confidence but misses more entities.
338
+
339
+
340
+ ### Comparison with Alternatives
341
+
342
+ | Solution | Speed | Privacy | Accuracy | Flexibility | Cost |
343
+ | --------------------- | ----- | ------- | -------- | ----------- | -------- |
344
+ | **GLiNER** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Free |
345
+ | Cloud NER APIs | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | \$\$\$ |
346
+ | Large Language Models | ⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | \$\$\$\$ |
347
+ | Traditional NER | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐ | Free |
348
+
349
+
350
+ ## πŸš€ Alternative Implementations
351
+
352
+ While Python provides the most comprehensive PII detection capabilities, GLiNER is available across multiple languages for different deployment scenarios.
353
+
354
+ ### πŸ¦€ Rust Implementation (gline-rs)
355
+
356
+ **Best for**: High-performance backend services, microservices
357
+
358
+ ```toml
359
+ [dependencies]
360
+ "gline-rs" = "1"
361
+ ```
362
+
363
+ ```rust
364
+ use gline_rs::{GLiNER, TextInput, Parameters, RuntimeParameters};
365
+
366
+ let model = GLiNER::<TokenMode>::new(
367
+ Parameters::default(),
368
+ RuntimeParameters::default(),
369
+ "tokenizer.json",
370
+ "model.onnx",
371
+ )?;
372
+
373
+ let input = TextInput::from_str(
374
+ &["My name is James Bond."],
375
+ &["person"],
376
+ )?;
377
+
378
+ let output = model.inference(input)?;
379
+ ```
380
+
381
+ **Performance**: 4x faster than Python on CPU, 37x faster with GPU acceleration.
382
+
383
+ ### ⚑ C++ Implementation (GLiNER.cpp)
384
+
385
+ **Best for**: Embedded systems, mobile apps, edge devices
386
+
387
+ ```cpp
388
+ #include "GLiNER/model.hpp"
389
+
390
+ gliner::Config config{12, 512};
391
+ gliner::Model model("./model.onnx", "./tokenizer.json", config);
392
+
393
+ std::vector<std::string> texts = {"John works at Microsoft"};
394
+ std::vector<std::string> entities = {"person", "organization"};
395
+
396
+ auto output = model.inference(texts, entities);
397
+ ```
398
+
399
+ ### 🌐 JavaScript Implementation (GLiNER.js)
400
+
401
+ **Best for**: Web applications, browser-based processing
402
+
403
+ ```bash
404
+ npm install gliner
405
+ ```
406
+
407
+ ```javascript
408
+ import { Gliner } from 'gliner';
409
+
410
+ const gliner = new Gliner({
411
+ tokenizerPath: "onnx-community/gliner_small-v2",
412
+ onnxSettings: {
413
+ modelPath: "public/model.onnx",
414
+ executionProvider: "webgpu",
415
+ }
416
+ });
417
+
418
+ await gliner.initialize();
419
+
420
+ const results = await gliner.inference({
421
+ texts: ["John Smith works at Microsoft"],
422
+ entities: ["person", "organization"],
423
+ threshold: 0.1,
424
+ });
425
+ ```
426
+
427
+ ## πŸ—οΈ Model Architecture & Training
428
+
429
+ ### Quantization-Aware Pretraining
430
+
431
+ GLiNER models use quantization-aware pretraining, which optimizes performance while maintaining accuracy. This allows efficient inference even with quantized models.
432
+
433
+ ### Available ONNX Formats
434
+
435
+ | Format | Size | Use Case |
436
+ |--------|------|----------|
437
+ | **FP16** | 330MB | Balanced performance/accuracy |
438
+ | **UINT8** | 197MB | Maximum efficiency |
439
+
440
+ ### Model Conversion
441
+
442
+ ```bash
443
+ python convert_to_onnx.py \
444
+ --model_path knowledgator/gliner-pii-base-v1.0 \
445
+ --save_path ./model \
446
+ --quantize True # For UINT8 quantization
447
+ ```
448
+
449
+
450
+ ## πŸ“„ References
451
+
452
+ - [GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer](https://arxiv.org/abs/2311.08526)
453
+ - [GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks](https://arxiv.org/abs/2406.12925)
454
+ - [Named Entity Recognition as Structured Span Prediction](https://arxiv.org/abs/2212.13415)
455
+
456
+ ## πŸ™ Acknowledgments
457
+
458
+ Special thanks to the all GLiNER contributors, the Wordcab team and additional thanks to maintainers of the Rust, C++, and JavaScript implementations.
459
+
460
+ ## πŸ“ž Support
461
+
462
+ - **Hugging Face**: [Ihor/gliner-pii-small](https://huggingface.co/Ihor/gliner-pii-small)
463
+ - **GitHub Issues**: [Report bugs and request features](https://github.com/info-wordcab/wordcab-pii)
464
+ - **Discord**: [Join community discussions](https://discord.gg/wRF7tuY9)
465
+
466
+ ---
467
+
468
+ *GLiNER: Open-source privacy-first entity recognition for production applications.*