Token Classification
ONNX
Safetensors
English
bert

gravitee-io/bert-small-pii-detection πŸš€

A more accurate PII detector fine-tuned from prajjwal1/bert-small on the datasets described in metatada.

About the dataset:

We combined various datasets in order to cover wide range of document formats like:

  1. JSON,
  2. HTML,
  3. XML,
  4. SQL
  5. Documents

Label Set

AGE, COORDINATE, CREDIT_CARD, DATE_TIME, EMAIL_ADDRESS, FINANCIAL, IBAN_CODE, IMEI,
IP_ADDRESS, LOCATION, MAC_ADDRESS, NRP, ORGANIZATION, PASSWORD, PERSON, PHONE_NUMBER,
TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN

How to Use

Quick start (pipeline)

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

repo = "gravitee-io/bert-small-pii-detection"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForTokenClassification.from_pretrained(repo)

pipe = pipeline("token-classification", model=model, tokenizer=tok, aggregation_strategy="simple")
text = ""
pipe(text)

Evaluation

Metric: precision / recall / F1 per entity, micro/macro averages

Entity Precision Recall F1-score Support
AGE 0.9898 0.8858 0.9349 219
COORDINATE 0.9627 0.8738 0.9161 325
CREDIT_CARD 0.9273 0.8870 0.9067 115
DATE_TIME 0.8598 0.7364 0.7933 3255
EMAIL_ADDRESS 0.9428 0.8941 0.9178 387
FINANCIAL 0.9862 0.9565 0.9711 299
IBAN_CODE 0.9577 0.9252 0.9412 147
IMEI 0.9885 0.9663 0.9773 89
IP_ADDRESS 0.9338 0.8812 0.9068 160
LOCATION 0.8849 0.8222 0.8524 4264
MAC_ADDRESS 0.9889 1.0000 0.9944 89
NRP 1.0000 0.9818 0.9908 494
ORGANIZATION 0.7454 0.6688 0.7051 3551
PASSWORD 0.8384 0.8137 0.8259 102
PERSON 0.9123 0.8826 0.8972 4454
PHONE_NUMBER 0.9462 0.8199 0.8785 322
TITLE 0.9887 0.9734 0.9810 451
URL 1.0000 0.9787 0.9892 188
US_BANK_NUMBER 1.0000 0.9579 0.9785 95
US_DRIVER_LICENSE 0.9167 0.9167 0.9167 120
US_ITIN 0.9659 0.8763 0.9189 97
US_LICENSE_PLATE 1.0000 0.9000 0.9474 90
US_PASSPORT 0.9200 0.9200 0.9200 100
US_SSN 0.9744 0.9580 0.9661 119
micro avg 0.8804 0.8141 0.8460 19532
macro avg 0.9429 0.8948 0.9178 19532
weighted avg 0.8785 0.8141 0.8446 19532

Intended Uses & Limitations

Use this model for:

  • Low resource environmens
  • Redacting PII in customer support logs, dev/test environments, API traces and articles
  • Real-time hints in form fields or data entry systems

Limitations:

  • English-focused; other languages will degrade
  • Domain drift is real: audit on your own data

Citation

If you use the model, please consider citing the papers:

@misc{bhargava2021generalization,
      title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics},
      author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers},
      year={2021},
      eprint={2110.01518},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@article{DBLP:journals/corr/abs-1908-08962,
  author    = {Iulia Turc and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {Well-Read Students Learn Better: The Impact of Student Initialization
               on Knowledge Distillation},
  journal   = {CoRR},
  volume    = {abs/1908.08962},
  year      = {2019},
  url       = {http://arxiv.org/abs/1908.08962},
  eprinttype = {arXiv},
  eprint    = {1908.08962},
  timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@online{WinNT,
  author = {Benjamin Kilimnik},
  title = {{Privy} Synthetic PII Protocol Trace Dataset},
  year = 2022,
  url = {https://huggingface.co/datasets/beki/privy},
}

@online{gretel2023,
  author = {Gretel.ai},
  title = {{Synthetic PII Finance Multilingual Dataset}},
  year = 2023,
  url = {https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual},
}

@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
    title = "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition",
    author = "Tjong Kim Sang, Erik F. and De Meulder, Fien",
    booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003",
    year = "2003",
    url = "https://aclanthology.org/W03-0419",
}
}
Downloads last month
64
Safetensors
Model size
28.5M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for gravitee-io/bert-small-pii-detection

Quantized
(2)
this model

Datasets used to train gravitee-io/bert-small-pii-detection

Space using gravitee-io/bert-small-pii-detection 1