### Overview `pii-phi` is a fine-tuned version of `Phi-3.5-mini-instruct` designed to extract Personally Identifiable Information (PII) from unstructured text. The model outputs PII entities in a structured JSON format according to strict schema guidelines. ### Training Prompt Format ```text # GUIDELINES - Extract all instances of the following Personally Identifiable Information (PII) entities from the provided text and return them in JSON format. - Each item in the JSON list should include an 'entity' key specifying the type of PII and a 'value' key containing the extracted information. - The supported entities are: PERSON_NAME, BUSINESS_NAME, API_KEY, USERNAME, API_ENDPOINT, WEBSITE_ADDRESS, PHONE_NUMBER, EMAIL_ADDRESS, ID, PASSWORD, ADDRESS. # EXPECTED OUTPUT - The json output must be in the format below: { "result": [ {"entity": "ENTITY_TYPE", "value": "EXTRACTED_VALUE"}, ... ] } ``` ### Supported Entities * PERSON\_NAME * BUSINESS\_NAME * API\_KEY * USERNAME * API\_ENDPOINT * WEBSITE\_ADDRESS * PHONE\_NUMBER * EMAIL\_ADDRESS * ID * PASSWORD * ADDRESS ### Intended Use The model is intended for PII detection in text documents to support tasks such as data anonymization, compliance, and security auditing. ### Limitations * Not guaranteed to detect all forms of PII in every context. * May return false positives or omit contextually relevant information. --- ### Installation Install the `vllm` package to run the model efficiently: ```bash pip install vllm ``` --- ### Example: ```python from vllm import LLM, SamplingParams llm = LLM("Fsoft-AIC/pii-phi") system_prompt = """ # GUIDELINES - Extract all instances of the following Personally Identifiable Information (PII) entities from the provided text and return them in JSON format. - Each item in the JSON list should include an 'entity' key specifying the type of PII and a 'value' key containing the extracted information. - The supported entities are: PERSON_NAME, BUSINESS_NAME, API_KEY, USERNAME, API_ENDPOINT, WEBSITE_ADDRESS, PHONE_NUMBER, EMAIL_ADDRESS, ID, PASSWORD, ADDRESS. # EXPECTED OUTPUT - The json output must be in the format below: { "result": [ {"entity": "ENTITY_TYPE", "value": "EXTRACTED_VALUE"}, ... ] } """ pii_message = "I am James Jake and my employee number is 123123123" sampling_params = SamplingParams(temperature=0, max_tokens=1000) outputs = llm.chat( [ {"role": "system", "content": system_prompt}, {"role": "user", "content": pii_message}, ], sampling_params, ) for output in outputs: generated_text = output.outputs[0].text print(generated_text) ```