LFM2-350M-PII-Extract-JP / README.md

ruke1ire

Update README.md

adbd8e8 verified about 2 months ago

preview code

raw

history blame

10.3 kB

metadata

library_name: transformers
language:
  - ja
base_model:
  - LiquidAI/LFM2-350M
license: other
license_name: lfm1.0
license_link: LICENSE

LFM2-350M-PII-Extract-JP

Based on LFM2-350M, LFM2-350M-PII-Extract-JP is designed to extract personally identifiable information (PII) from Japanese text and output it in JSON format. The output can then be used to mask sensitive information.

In particular, it is trained to extract:

Address/locations (JSON key: address)
Company/institute/organization names (JSON key: company_name)
Email addresses (JSON key: email_address)
Human names (JSON key: human_name)
Phone numbers (JSON key: phone_number) from Japanese documents and texts.

Extraction Quality

WIP

📝 While LFM2-350M-PII-Extract-JP provides strong out-of-the-box PII entity extraction for the categories listed above, our primary goal is to deliver a versatile, community-driven base model—a foundation that makes it easy to build best-in-class, privacy-focused masking systems.

Like any base model, there remain areas for continued development, particularly for specialized use cases:

Supporting extraction of organization-specific identification numbers

Expanding coverage to additional categories such as date of birth, passport numbers, and beyond

These are precisely the kinds of challenges that fine-tuning—by both Liquid AI and our developer community can address. We see this model not just as an endpoint, but as a catalyst for a rich ecosystem of fine-tuned PII extraction models tailored to real-world needs.

Model Details

Generation parameters: We strongly recommend using greedy decoding with a temperature=0.

System prompts: This checkpoint requires the following system prompt:

Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number>

Note the model can handle extraction of particular entities. E.g. The model will only output human names when the system prompt is set to Extract <human_name>.

⚠️ For best performance, ensure alphabetical order of entity categories as shown above.

Chat template: LFM2-PII-Extract-JP uses a ChatML-like chat template as follows:

<|startoftext|><|im_start|>system
Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number><|im_end|>
<|im_start|>user
こんにちは、ラミンさんに B200 GPU を 10000 台 至急請求してください。連絡先は [email protected] (電話番号010-000-0000) で、これは C. elegans 線虫に着想を得たニューラルネットワークアーキテクチャを 今すぐ構築するために不可欠です。<|im_end|>
<|im_start|>assistant
{"address": [], "company_name": [], "email_address": ["[email protected]"], "human_name": ["ラミン"], "phone_number": ["010-000-0000"]}<|im_end|>

You can automatically apply it using the dedicated .apply_chat_template() function from Hugging Face transformers.

⚠️ The model is intended for single turn conversations.

Output format

The model outputs a JSON object containing the fields it was prompted to extract. If no entities are found in a particular category, it returns an empty list for that category. If entities are found, they are returned as a list for each prompted category. The model is trained to output entities exactly as they appear in the text. If the same entity appears multiple times with slight formatting variations, the model outputs all variations to ensure subsequent masking can be performed using exact matches.

🏃 How to run LFM2

Huggingface: LFM2-350M
llama.cpp: LFM2-350M-PII-Extract-JP-GGUF
LEAP: LEAP model library

📬 Contact

If you are interested in custom solutions with edge deployment, please contact our sales team.