Safetensors
Polish
bert
BANonymizer-PL

BANonymizer-PL

This model is a fine-tuned version of HerBERT-large-cased, a Polish language model developed by Allegro, specialized in anonymizing sensitive and personal information in Polish texts.

Training and Purpose

The model has been fine-tuned on the BAN-PL dataset, which contains over 20,000 manually labeled examples and a test set of more than 2,000 examples. It is designed to detect and anonymize entities such as pseudonyms and surnames, except for deceased individuals, historical figures, and fictional characters.

Applications

This model is particularly useful for privacy-preserving tasks, such as anonymizing datasets for research purposes. Unlike other publicly available tools that primarily focus on surnames, this model uniquely handles both surnames and pseudonyms, enhancing its utility in various anonymization workflows.

Usage

Example code:

from transformers import pipeline

model_name = "NASK-PIB/BANonymizer-PL"
ner = pipeline(
        "token-classification",
        model=model_name,
        aggregation_strategy="simple",
    )

text = "Pan Kowalski, znany jako 'Cichy', mieszka w Warszawie"
result = ner(text)

print(result)

License

Apache-2.0

Citation

If you use this model, please cite the following paper:

@misc{kołos2024banpl,
      title={BAN-PL: a Novel Polish Dataset of Banned Harmful and Offensive Content from Wykop.pl web service}, 
      author={Anna Kołos and Inez Okulska and Kinga Głąbińska and Agnieszka Karlińska and Emilia Wiśnios and Paweł Ellerik and Andrzej Prałat},
      year={2024},
      eprint={2308.10592},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
13
Safetensors
Model size
354M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.