YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Overview

pii-phi is a fine-tuned version of Phi-3.5-mini-instruct designed to extract Personally Identifiable Information (PII) from unstructured text. The model outputs PII entities in a structured JSON format according to strict schema guidelines.

Training Prompt Format

# GUIDELINES
- Extract all instances of the following Personally Identifiable Information (PII) entities from the provided text and return them in JSON format.
- Each item in the JSON list should include an 'entity' key specifying the type of PII and a 'value' key containing the extracted information.
- The supported entities are: PERSON_NAME, BUSINESS_NAME, API_KEY, USERNAME, API_ENDPOINT, WEBSITE_ADDRESS, PHONE_NUMBER, EMAIL_ADDRESS, ID, PASSWORD, ADDRESS.

# EXPECTED OUTPUT
- The json output must be in the format below:
{
    "result": [
        {"entity": "ENTITY_TYPE", "value": "EXTRACTED_VALUE"},
        ...
    ]
}

Supported Entities

  • PERSON_NAME
  • BUSINESS_NAME
  • API_KEY
  • USERNAME
  • API_ENDPOINT
  • WEBSITE_ADDRESS
  • PHONE_NUMBER
  • EMAIL_ADDRESS
  • ID
  • PASSWORD
  • ADDRESS

Intended Use

The model is intended for PII detection in text documents to support tasks such as data anonymization, compliance, and security auditing.

Limitations

  • Not guaranteed to detect all forms of PII in every context.
  • May return false positives or omit contextually relevant information.

Installation

Install the vllm package to run the model efficiently:

pip install vllm

Example:

from vllm import LLM, SamplingParams

llm = LLM("Fsoft-AIC/pii-phi")

system_prompt = """
# GUIDELINES
- Extract all instances of the following Personally Identifiable Information (PII) entities from the provided text and return them in JSON format.
- Each item in the JSON list should include an 'entity' key specifying the type of PII and a 'value' key containing the extracted information.
- The supported entities are: PERSON_NAME, BUSINESS_NAME, API_KEY, USERNAME, API_ENDPOINT, WEBSITE_ADDRESS, PHONE_NUMBER, EMAIL_ADDRESS, ID, PASSWORD, ADDRESS.

# EXPECTED OUTPUT
- The json output must be in the format below:
{
    "result": [
        {"entity": "ENTITY_TYPE", "value": "EXTRACTED_VALUE"},
        ...
    ]
}
"""
pii_message = "I am James Jake and my employee number is 123123123"

sampling_params = SamplingParams(temperature=0, max_tokens=1000)
outputs = llm.chat(
    [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": pii_message},
    ],
    sampling_params,
)


for output in outputs:
    generated_text = output.outputs[0].text
    print(generated_text)
Downloads last month
279,800
Safetensors
Model size
3.82B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Fsoft-AIC/pii-phi

Quantizations
2 models