library_name: transformers
language:
- ja
base_model:
- LiquidAI/LFM2-350M
license: other
license_name: lfm1.0
license_link: LICENSE
LFM2-350M-PII-Extract-JP
Based on LFM2-350M, LFM2-350M-PII-Extract-JP is designed to extract personally identifiable information (PII) from Japanese text and output it in JSON format. The output can then be used to mask sensitive information.
In particular, it is trained to extract:
- Address/locations (JSON key:
address) - Company/institute/organization names (JSON key:
company_name) - Email addresses (JSON key:
email_address) - Human names (JSON key:
human_name) - Phone numbers (JSON key:
phone_number) from Japanese documents and texts.
Extraction Quality
WIP
π While LFM2-350M-PII-Extract-JP provides strong out-of-the-box PII entity extraction for the categories listed above, our primary goal is to deliver a versatile, community-driven base modelβa foundation that makes it easy to build best-in-class, privacy-focused masking systems.
Like any base model, there remain areas for continued development, particularly for specialized use cases:
- Supporting extraction of organization-specific identification numbers
- Expanding coverage to additional categories such as date of birth, passport numbers, and beyond
These are precisely the kinds of challenges that fine-tuningβby both Liquid AI and our developer community can address. We see this model not just as an endpoint, but as a catalyst for a rich ecosystem of fine-tuned PII extraction models tailored to real-world needs.
Model Details
Generation parameters: We strongly recommend using greedy decoding with a temperature=0.
System prompts: This checkpoint requires the following system prompt:
Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number>
Note the model can handle extraction of particular entities. E.g. The model will only output human names when the system prompt is set to Extract <human_name>.
β οΈ For best performance, ensure alphabetical order of entity categories as shown above.
Chat template: LFM2-PII-Extract-JP uses a ChatML-like chat template as follows:
<|startoftext|><|im_start|>system
Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number><|im_end|>
<|im_start|>user
γγγ«γ‘γ―γγ©γγ³γγγ« B200 GPU γ 10000 ε° θ³ζ₯θ«ζ±γγ¦γγ γγγι£η΅‘ε
γ― [email protected] (ι»θ©±ηͺε·010-000-0000) γ§γγγγ― C. elegans η·θ«γ«ηζ³γεΎγγγ₯γΌγ©γ«γγγγ―γΌγ―γ’γΌγγγ―γγ£γ δ»γγζ§η―γγγγγ«δΈε―ζ¬ γ§γγ<|im_end|>
<|im_start|>assistant
{"address": [], "company_name": [], "email_address": ["[email protected]"], "human_name": ["γ©γγ³"], "phone_number": ["010-000-0000"]}<|im_end|>
You can automatically apply it using the dedicated .apply_chat_template() function from Hugging Face transformers.
β οΈ The model is intended for single turn conversations.
Output format
The model outputs a JSON object containing the fields it was prompted to extract. If no entities are found in a particular category, it returns an empty list for that category. If entities are found, they are returned as a list for each prompted category. The model is trained to output entities exactly as they appear in the text. If the same entity appears multiple times with slight formatting variations, the model outputs all variations to ensure subsequent masking can be performed using exact matches.
π How to run LFM2
- Huggingface: LFM2-350M
- llama.cpp: LFM2-350M-PII-Extract-JP-GGUF
- LEAP: LEAP model library
π¬ Contact
If you are interested in custom solutions with edge deployment, please contact our sales team.