Fsoft-AIC/pii-phi · Hugging Face

Overview

pii-phi is a fine-tuned version of Phi-3.5-mini-instruct designed to extract Personally Identifiable Information (PII) from unstructured text. The model outputs PII entities in a structured JSON format according to strict schema guidelines.

Training Prompt Format

# GUIDELINES
- Extract all instances of the following Personally Identifiable Information (PII) entities from the provided text and return them in JSON format.
- Each item in the JSON list should include an 'entity' key specifying the type of PII and a 'value' key containing the extracted information.
- The supported entities are: PERSON_NAME, BUSINESS_NAME, API_KEY, USERNAME, API_ENDPOINT, WEBSITE_ADDRESS, PHONE_NUMBER, EMAIL_ADDRESS, ID, PASSWORD, ADDRESS.

# EXPECTED OUTPUT
- The json output must be in the format below:
{
    "result": [
        {"entity": "ENTITY_TYPE", "value": "EXTRACTED_VALUE"},
        ...
    ]
}

Supported Entities

PERSON_NAME
BUSINESS_NAME
API_KEY
USERNAME
API_ENDPOINT
WEBSITE_ADDRESS
PHONE_NUMBER
EMAIL_ADDRESS
ID
PASSWORD
ADDRESS

Intended Use

The model is intended for PII detection in text documents to support tasks such as data anonymization, compliance, and security auditing.

Limitations

Not guaranteed to detect all forms of PII in every context.
May return false positives or omit contextually relevant information.

Installation

Install the vllm package to run the model efficiently:

pip install vllm

Example:

from vllm import LLM, SamplingParams

llm = LLM("Fsoft-AIC/pii-phi")

system_prompt = """
# GUIDELINES
- Extract all instances of the following Personally Identifiable Information (PII) entities from the provided text and return them in JSON format.
- Each item in the JSON list should include an 'entity' key specifying the type of PII and a 'value' key containing the extracted information.
- The supported entities are: PERSON_NAME, BUSINESS_NAME, API_KEY, USERNAME, API_ENDPOINT, WEBSITE_ADDRESS, PHONE_NUMBER, EMAIL_ADDRESS, ID, PASSWORD, ADDRESS.

# EXPECTED OUTPUT
- The json output must be in the format below:
{
    "result": [
        {"entity": "ENTITY_TYPE", "value": "EXTRACTED_VALUE"},
        ...
    ]
}
"""
pii_message = "I am James Jake and my employee number is 123123123"

sampling_params = SamplingParams(temperature=0, max_tokens=1000)
outputs = llm.chat(
    [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": pii_message},
    ],
    sampling_params,
)


for output in outputs:
    generated_text = output.outputs[0].text
    print(generated_text)

Fsoft-AIC
/

pii-phi