Upload README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,468 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
library_name: gliner
|
6 |
+
pipeline_tag: token-classification
|
7 |
+
tags:
|
8 |
+
- NER
|
9 |
+
- GLiNER
|
10 |
+
- information extraction
|
11 |
+
- PII
|
12 |
+
- PHI
|
13 |
+
- PCI
|
14 |
+
- entity recognition
|
15 |
+
- multilingual
|
16 |
+
---
|
17 |
+
|
18 |
+
|
19 |
+
# GLiNER-PII: Zero-shot PII model
|
20 |
+
|
21 |
+
A production-grade open-source model for privacy-focused PII, PHI, and PCI detection with zero-shot entity recognition capabilities.
|
22 |
+
This model was developed in collaboration between [Wordcab](https://wordcab.com/) and [Knowledgator](https://www.knowledgator.com/). For enterprise-ready, specialized PII/PHI/PCI models, contact us at [email protected].
|
23 |
+
|
24 |
+
## π§ What is GLiNER?
|
25 |
+
|
26 |
+
GLiNER (Generalist and Lightweight Named Entity Recognition) is a bidirectional transformer model that can identify **any entity type** without predefined categories. Unlike traditional NER models that are limited to specific entity classes, GLiNER allows you to specify exactly what entities you want to extract at runtime.
|
27 |
+
|
28 |
+
### Key Advantages
|
29 |
+
|
30 |
+
- **Zero-shot recognition**: Extract any entity type without retraining
|
31 |
+
- **Privacy-first**: Process sensitive data locally without API calls
|
32 |
+
- **Lightweight**: Much faster than large language models for NER tasks
|
33 |
+
- **Production-ready**: Quantization-aware training with FP16 and UINT8 ONNX models
|
34 |
+
- **Comprehensive**: 60+ predefined PII categories with custom entity support
|
35 |
+
|
36 |
+
### How GLiNER Works
|
37 |
+
|
38 |
+
Instead of predicting from a fixed set of entity classes, GLiNER takes both text and a list of desired entity types as input, then identifies spans that match those categories:
|
39 |
+
|
40 |
+
```python
|
41 |
+
text = "John Smith called from 415-555-1234 to discuss his account."
|
42 |
+
entities = ["name", "phone number", "account number"]
|
43 |
+
# GLiNER finds: "John Smith" β name, "415-555-1234" β phone number
|
44 |
+
```
|
45 |
+
|
46 |
+
## π Python Implementation
|
47 |
+
|
48 |
+
The primary GLiNER implementation provides comprehensive PII detection with 60+ entity categories, fine-tuned specifically for privacy and compliance use cases.
|
49 |
+
|
50 |
+
### Installation
|
51 |
+
|
52 |
+
```bash
|
53 |
+
pip install gliner
|
54 |
+
```
|
55 |
+
|
56 |
+
### Quick Start
|
57 |
+
|
58 |
+
```python
|
59 |
+
from gliner import GLiNER
|
60 |
+
|
61 |
+
# Load the model (downloads automatically on first use)
|
62 |
+
model = GLiNER.from_pretrained("knowledgator/gliner-pii-base-v1.0")
|
63 |
+
|
64 |
+
text = "John Smith called from 415-555-1234 to discuss his account number 12345678."
|
65 |
+
labels = ["name", "phone number", "account number"]
|
66 |
+
|
67 |
+
entities = model.predict_entities(text, labels, threshold=0.3)
|
68 |
+
|
69 |
+
for entity in entities:
|
70 |
+
print(f"{entity['text']} => {entity['label']} (confidence: {entity['score']:.2f})")
|
71 |
+
```
|
72 |
+
|
73 |
+
Output:
|
74 |
+
```
|
75 |
+
John Smith => name (confidence: 0.95)
|
76 |
+
415-555-1234 => phone number (confidence: 0.92)
|
77 |
+
12345678 => account number (confidence: 0.88)
|
78 |
+
```
|
79 |
+
|
80 |
+
### Comprehensive PII Detection
|
81 |
+
|
82 |
+
The model was specifically optimized for 60+ predefined PII categories organized by domain, but it can work in zero-shot as well, meaning you can put any labels you need:
|
83 |
+
|
84 |
+
#### Personal Identifiers
|
85 |
+
|
86 |
+
```python
|
87 |
+
personal_labels = [
|
88 |
+
"name", # Full names
|
89 |
+
"first name", # First names
|
90 |
+
"last name", # Last names
|
91 |
+
"name medical professional", # Healthcare provider names
|
92 |
+
"dob", # Date of birth
|
93 |
+
"age", # Age information
|
94 |
+
"gender", # Gender identifiers
|
95 |
+
"marital status" # Marital status
|
96 |
+
]
|
97 |
+
```
|
98 |
+
|
99 |
+
#### Contact Information
|
100 |
+
|
101 |
+
```python
|
102 |
+
contact_labels = [
|
103 |
+
"email address", # Email addresses
|
104 |
+
"phone number", # Phone numbers
|
105 |
+
"ip address", # IP addresses
|
106 |
+
"url", # URLs
|
107 |
+
"location address", # Street addresses
|
108 |
+
"location street", # Street names
|
109 |
+
"location city", # City names
|
110 |
+
"location state", # State/province names
|
111 |
+
"location country", # Country names
|
112 |
+
"location zip" # ZIP/postal codes
|
113 |
+
]
|
114 |
+
```
|
115 |
+
|
116 |
+
#### Financial Information
|
117 |
+
|
118 |
+
```python
|
119 |
+
financial_labels = [
|
120 |
+
"account number", # Account numbers
|
121 |
+
"bank account", # Bank account numbers
|
122 |
+
"routing number", # Routing numbers
|
123 |
+
"credit card", # Credit card numbers
|
124 |
+
"credit card expiration", # Card expiration dates
|
125 |
+
"cvv", # CVV/security codes
|
126 |
+
"ssn", # Social Security Numbers
|
127 |
+
"money" # Monetary amounts
|
128 |
+
]
|
129 |
+
```
|
130 |
+
|
131 |
+
#### Healthcare Information
|
132 |
+
|
133 |
+
```python
|
134 |
+
healthcare_labels = [
|
135 |
+
"condition", # Medical conditions
|
136 |
+
"medical process", # Medical procedures
|
137 |
+
"drug", # Drugs
|
138 |
+
"dose", # Dosage information
|
139 |
+
"blood type", # Blood types
|
140 |
+
"injury", # Injuries
|
141 |
+
"organization medical facility",# Healthcare facility names
|
142 |
+
"healthcare number", # Healthcare numbers
|
143 |
+
"medical code" # Medical codes
|
144 |
+
]
|
145 |
+
```
|
146 |
+
|
147 |
+
#### Identification Documents
|
148 |
+
|
149 |
+
```python
|
150 |
+
id_labels = [
|
151 |
+
"passport number", # Passport numbers
|
152 |
+
"driver license", # Driver's license numbers
|
153 |
+
"username", # Usernames
|
154 |
+
"password", # Passwords
|
155 |
+
"vehicle id" # Vehicle IDs
|
156 |
+
]
|
157 |
+
```
|
158 |
+
|
159 |
+
### Advanced Usage Examples
|
160 |
+
|
161 |
+
#### Multi-Category Detection
|
162 |
+
```python
|
163 |
+
text = """
|
164 |
+
Patient Mary Johnson, DOB 01/15/1980, was discharged on March 10, 2024
|
165 |
+
from St. Mary's Hospital. Contact: [email protected], (555) 123-4567.
|
166 |
+
Insurance policy: POL-789456123.
|
167 |
+
"""
|
168 |
+
|
169 |
+
labels = [
|
170 |
+
"name", "dob", "discharge date", "organization medical facility",
|
171 |
+
"email address", "phone number", "policy number"
|
172 |
+
]
|
173 |
+
|
174 |
+
entities = model.predict_entities(text, labels, threshold=0.3)
|
175 |
+
|
176 |
+
for entity in entities:
|
177 |
+
print(f"Found '{entity['text']}' as {entity['label']}")
|
178 |
+
```
|
179 |
+
|
180 |
+
#### Batch Processing for High Throughput
|
181 |
+
```python
|
182 |
+
documents = [
|
183 |
+
"Customer John called about his credit card ending in 4532.",
|
184 |
+
"Sarah's SSN 123-45-6789 needs verification.",
|
185 |
+
"Email [email protected] for account 987654321 issues."
|
186 |
+
]
|
187 |
+
|
188 |
+
labels = ["name", "credit card", "ssn", "email address", "account number"]
|
189 |
+
|
190 |
+
# Process multiple documents efficiently
|
191 |
+
results = model.run(documents, labels, threshold=0.3, batch_size=8)
|
192 |
+
|
193 |
+
for doc_idx, entities in enumerate(results):
|
194 |
+
print(f"\nDocument {doc_idx + 1}:")
|
195 |
+
for entity in entities:
|
196 |
+
print(f" {entity['text']} => {entity['label']}")
|
197 |
+
```
|
198 |
+
|
199 |
+
#### Custom Entity Detection
|
200 |
+
```python
|
201 |
+
# GLiNER isn't limited to PII - you can detect any entities
|
202 |
+
text = "The MacBook Pro with M2 chip costs $1,999 at the Apple Store in Manhattan."
|
203 |
+
custom_labels = ["product", "processor", "price", "store", "location"]
|
204 |
+
|
205 |
+
entities = model.predict_entities(text, custom_labels, threshold=0.3)
|
206 |
+
```
|
207 |
+
|
208 |
+
#### Threshold Optimization
|
209 |
+
```python
|
210 |
+
# Lower threshold: Higher recall, more false positives
|
211 |
+
high_recall = model.predict_entities(text, labels, threshold=0.2)
|
212 |
+
|
213 |
+
# Higher threshold: Higher precision, fewer false positives
|
214 |
+
high_precision = model.predict_entities(text, labels, threshold=0.6)
|
215 |
+
|
216 |
+
# Recommended starting point for production
|
217 |
+
balanced = model.predict_entities(text, labels, threshold=0.3)
|
218 |
+
```
|
219 |
+
|
220 |
+
## π‘ Use Cases
|
221 |
+
|
222 |
+
GLiNER excels in privacy-focused applications where traditional cloud-based NER services pose compliance risks.
|
223 |
+
|
224 |
+
### π― **Primary Applications**
|
225 |
+
|
226 |
+
#### Privacy-First Voice & Transcription
|
227 |
+
```python
|
228 |
+
# Automatically redact PII from voice transcriptions
|
229 |
+
transcription = "Hi, my name is Sarah Johnson and my phone number is 415-555-0123"
|
230 |
+
pii_labels = ["name", "phone number", "email address", "ssn"]
|
231 |
+
|
232 |
+
entities = model.predict_entities(transcription, pii_labels)
|
233 |
+
# Redact or anonymize detected PII before storage
|
234 |
+
```
|
235 |
+
|
236 |
+
#### Compliance-Ready Document Processing
|
237 |
+
```python
|
238 |
+
# Healthcare: HIPAA-compliant note processing
|
239 |
+
medical_note = "Patient John Doe, MRN 123456, diagnosed with diabetes..."
|
240 |
+
phi_labels = ["name", "medical record number", "condition", "dob"]
|
241 |
+
|
242 |
+
# Finance: PCI-DSS compliant transaction logs
|
243 |
+
transaction_log = "Card ****4532 charged $299.99 to John Smith"
|
244 |
+
pci_labels = ["credit card", "money", "name"]
|
245 |
+
|
246 |
+
# Legal: Attorney-client privilege protection
|
247 |
+
legal_doc = "Client Jane Doe vs. Corporation ABC, case #2024-CV-001"
|
248 |
+
legal_labels = ["name", "organization", "case number"]
|
249 |
+
```
|
250 |
+
|
251 |
+
#### Real-Time Data Anonymization
|
252 |
+
```python
|
253 |
+
def anonymize_text(text, entity_types):
|
254 |
+
"""Anonymize PII in real-time"""
|
255 |
+
entities = model.predict_entities(text, entity_types)
|
256 |
+
|
257 |
+
# Sort by position to replace from end to start
|
258 |
+
entities.sort(key=lambda x: x['start'], reverse=True)
|
259 |
+
|
260 |
+
anonymized = text
|
261 |
+
for entity in entities:
|
262 |
+
placeholder = f"<{entity['label'].upper()}>"
|
263 |
+
anonymized = anonymized[:entity['start']] + placeholder + anonymized[entity['end']:]
|
264 |
+
|
265 |
+
return anonymized
|
266 |
+
|
267 |
+
original = "John Smith's SSN is 123-45-6789"
|
268 |
+
anonymized = anonymize_text(original, ["name", "ssn"])
|
269 |
+
print(anonymized) # "<NAME>'s SSN is <SSN>"
|
270 |
+
```
|
271 |
+
|
272 |
+
### π **Extended Applications**
|
273 |
+
|
274 |
+
#### Enhanced Search & Content Understanding
|
275 |
+
```python
|
276 |
+
# Extract key entities from user queries for better search
|
277 |
+
query = "Find restaurants near Stanford University in Palo Alto"
|
278 |
+
search_entities = ["organization", "location city", "business type"]
|
279 |
+
|
280 |
+
# Intelligent document tagging
|
281 |
+
document = "This quarterly report discusses Microsoft's Azure growth..."
|
282 |
+
doc_entities = ["organization", "product", "time period"]
|
283 |
+
```
|
284 |
+
|
285 |
+
#### GDPR-Compliant Chatbot Logs
|
286 |
+
```python
|
287 |
+
def sanitize_chat_log(message):
|
288 |
+
"""Remove PII from chat logs per GDPR requirements"""
|
289 |
+
sensitive_types = [
|
290 |
+
"name", "email address", "phone number", "location address",
|
291 |
+
"credit card", "ssn", "passport number"
|
292 |
+
]
|
293 |
+
|
294 |
+
entities = model.predict_entities(message, sensitive_types)
|
295 |
+
if entities:
|
296 |
+
# Log anonymized version, alert compliance team
|
297 |
+
return anonymize_text(message, sensitive_types)
|
298 |
+
return message
|
299 |
+
```
|
300 |
+
|
301 |
+
#### Secure Mobile & Edge Processing
|
302 |
+
```python
|
303 |
+
# Process sensitive data entirely on-device
|
304 |
+
def process_locally(user_input):
|
305 |
+
"""Process PII detection without cloud APIs"""
|
306 |
+
pii_types = ["name", "phone number", "email address", "ssn", "credit card"]
|
307 |
+
|
308 |
+
# All processing happens locally - no data leaves device
|
309 |
+
detected_pii = model.predict_entities(user_input, pii_types)
|
310 |
+
|
311 |
+
if detected_pii:
|
312 |
+
return "β οΈ Sensitive information detected - proceed with caution"
|
313 |
+
return "β
No PII detected - safe to share"
|
314 |
+
```
|
315 |
+
|
316 |
+
## π Performance Benchmarks
|
317 |
+
|
318 |
+
### Accuracy Evaluation
|
319 |
+
|
320 |
+
The following benchmarks were run on the **synthetic-multi-pii-ner-v1** dataset.
|
321 |
+
We compare multiple GLiNER-based PII models, including our new **Knowledgator GLiNER PII Edge v1.0**.
|
322 |
+
|
323 |
+
| Model Path | Precision | Recall | F1 Score |
|
324 |
+
| ---------------------------------------------------------------------- | --------- | ------ | ---------- |
|
325 |
+
| **knowledgator/gliner-pii-edge-v1.0** | 78.96% | 72.34% | **75.50%** |
|
326 |
+
| **knowledgator/gliner-pii-small-v1.0** | 78.99% | 74.80% | **76.84%** |
|
327 |
+
| **knowledgator/gliner-pii-base-v1.0** | 79.28% | 82.78% | **80.99%** |
|
328 |
+
| **knowledgator/gliner-pii-large-v1.0** | 87.42% | 79.4% | **83.25%** |
|
329 |
+
| **urchade/gliner\_multi\_pii-v1** | 79.19% | 74.67% | **76.86%** |
|
330 |
+
| **E3-JSI/gliner-multi-pii-domains-v1** | 78.35% | 74.46% | **76.36%** |
|
331 |
+
| **gravitee-io/gliner-pii-detection** | 81.27% | 56.76% | **66.84%** |
|
332 |
+
|
333 |
+
### Key Takeaways
|
334 |
+
|
335 |
+
* **Base Post Model** (`knowledgator/gliner-pii-base-v1.0`) achieves the **highest F1 score (80.99%)**, indicating the strongest overall performance.
|
336 |
+
* **Knowledgator Edge Model** (`knowledgator/gliner-pii-edge-v1.0`) is optimized for **edge environments**, trading a slight decrease in recall for lower latency and footprint.
|
337 |
+
* **Gravitee-io Model** shows strong precision but lower recall, indicating it is tuned for high confidence but misses more entities.
|
338 |
+
|
339 |
+
|
340 |
+
### Comparison with Alternatives
|
341 |
+
|
342 |
+
| Solution | Speed | Privacy | Accuracy | Flexibility | Cost |
|
343 |
+
| --------------------- | ----- | ------- | -------- | ----------- | -------- |
|
344 |
+
| **GLiNER** | ββββ | βββββ | βββββ | ββββ | Free |
|
345 |
+
| Cloud NER APIs | βββ | βββ | βββββ | βββ | \$\$\$ |
|
346 |
+
| Large Language Models | ββ | ββ | ββββ | ββββ | \$\$\$\$ |
|
347 |
+
| Traditional NER | βββββ | βββββ | ββββ | β | Free |
|
348 |
+
|
349 |
+
|
350 |
+
## π Alternative Implementations
|
351 |
+
|
352 |
+
While Python provides the most comprehensive PII detection capabilities, GLiNER is available across multiple languages for different deployment scenarios.
|
353 |
+
|
354 |
+
### π¦ Rust Implementation (gline-rs)
|
355 |
+
|
356 |
+
**Best for**: High-performance backend services, microservices
|
357 |
+
|
358 |
+
```toml
|
359 |
+
[dependencies]
|
360 |
+
"gline-rs" = "1"
|
361 |
+
```
|
362 |
+
|
363 |
+
```rust
|
364 |
+
use gline_rs::{GLiNER, TextInput, Parameters, RuntimeParameters};
|
365 |
+
|
366 |
+
let model = GLiNER::<TokenMode>::new(
|
367 |
+
Parameters::default(),
|
368 |
+
RuntimeParameters::default(),
|
369 |
+
"tokenizer.json",
|
370 |
+
"model.onnx",
|
371 |
+
)?;
|
372 |
+
|
373 |
+
let input = TextInput::from_str(
|
374 |
+
&["My name is James Bond."],
|
375 |
+
&["person"],
|
376 |
+
)?;
|
377 |
+
|
378 |
+
let output = model.inference(input)?;
|
379 |
+
```
|
380 |
+
|
381 |
+
**Performance**: 4x faster than Python on CPU, 37x faster with GPU acceleration.
|
382 |
+
|
383 |
+
### β‘ C++ Implementation (GLiNER.cpp)
|
384 |
+
|
385 |
+
**Best for**: Embedded systems, mobile apps, edge devices
|
386 |
+
|
387 |
+
```cpp
|
388 |
+
#include "GLiNER/model.hpp"
|
389 |
+
|
390 |
+
gliner::Config config{12, 512};
|
391 |
+
gliner::Model model("./model.onnx", "./tokenizer.json", config);
|
392 |
+
|
393 |
+
std::vector<std::string> texts = {"John works at Microsoft"};
|
394 |
+
std::vector<std::string> entities = {"person", "organization"};
|
395 |
+
|
396 |
+
auto output = model.inference(texts, entities);
|
397 |
+
```
|
398 |
+
|
399 |
+
### π JavaScript Implementation (GLiNER.js)
|
400 |
+
|
401 |
+
**Best for**: Web applications, browser-based processing
|
402 |
+
|
403 |
+
```bash
|
404 |
+
npm install gliner
|
405 |
+
```
|
406 |
+
|
407 |
+
```javascript
|
408 |
+
import { Gliner } from 'gliner';
|
409 |
+
|
410 |
+
const gliner = new Gliner({
|
411 |
+
tokenizerPath: "onnx-community/gliner_small-v2",
|
412 |
+
onnxSettings: {
|
413 |
+
modelPath: "public/model.onnx",
|
414 |
+
executionProvider: "webgpu",
|
415 |
+
}
|
416 |
+
});
|
417 |
+
|
418 |
+
await gliner.initialize();
|
419 |
+
|
420 |
+
const results = await gliner.inference({
|
421 |
+
texts: ["John Smith works at Microsoft"],
|
422 |
+
entities: ["person", "organization"],
|
423 |
+
threshold: 0.1,
|
424 |
+
});
|
425 |
+
```
|
426 |
+
|
427 |
+
## ποΈ Model Architecture & Training
|
428 |
+
|
429 |
+
### Quantization-Aware Pretraining
|
430 |
+
|
431 |
+
GLiNER models use quantization-aware pretraining, which optimizes performance while maintaining accuracy. This allows efficient inference even with quantized models.
|
432 |
+
|
433 |
+
### Available ONNX Formats
|
434 |
+
|
435 |
+
| Format | Size | Use Case |
|
436 |
+
|--------|------|----------|
|
437 |
+
| **FP16** | 330MB | Balanced performance/accuracy |
|
438 |
+
| **UINT8** | 197MB | Maximum efficiency |
|
439 |
+
|
440 |
+
### Model Conversion
|
441 |
+
|
442 |
+
```bash
|
443 |
+
python convert_to_onnx.py \
|
444 |
+
--model_path knowledgator/gliner-pii-base-v1.0 \
|
445 |
+
--save_path ./model \
|
446 |
+
--quantize True # For UINT8 quantization
|
447 |
+
```
|
448 |
+
|
449 |
+
|
450 |
+
## π References
|
451 |
+
|
452 |
+
- [GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer](https://arxiv.org/abs/2311.08526)
|
453 |
+
- [GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks](https://arxiv.org/abs/2406.12925)
|
454 |
+
- [Named Entity Recognition as Structured Span Prediction](https://arxiv.org/abs/2212.13415)
|
455 |
+
|
456 |
+
## π Acknowledgments
|
457 |
+
|
458 |
+
Special thanks to the all GLiNER contributors, the Wordcab team and additional thanks to maintainers of the Rust, C++, and JavaScript implementations.
|
459 |
+
|
460 |
+
## π Support
|
461 |
+
|
462 |
+
- **Hugging Face**: [Ihor/gliner-pii-small](https://huggingface.co/Ihor/gliner-pii-small)
|
463 |
+
- **GitHub Issues**: [Report bugs and request features](https://github.com/info-wordcab/wordcab-pii)
|
464 |
+
- **Discord**: [Join community discussions](https://discord.gg/wRF7tuY9)
|
465 |
+
|
466 |
+
---
|
467 |
+
|
468 |
+
*GLiNER: Open-source privacy-first entity recognition for production applications.*
|