Model Card for SNU Thunder-DeID
Model Summary
SNU Thunder-DeID is a family of transformer encoder-based language models developed for Named Entity Recognition (NER)-based de-identification of Korean court judgments.
Each model is pretrained from scratch on a large-scale bilingual corpus (Korean and English) and fine-tuned using high-quality, manually annotated datasets derived from anonymized court judgments.
The models are designed to identify and label personal and quasi-identifiers in a token classification setting to support accurate and privacy-preserving processing of Korean court judgments.
The SNU Thunder-DeID models are released in three sizes:
- SNU Thunder-DeID-340M
- SNU Thunder-DeID-750M
- SNU Thunder-DeID-1.5B (here)
Intended Use
The SNU Thunder-DeID models are intended to support:
- De-identification of Korean court judgments
- NER tasks focused on court judgments entities
- Fine-tuning for privacy-preserving AI systems
How to Use
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("thunder-research-group/SNU Thunder-DeID-340M")
model = AutoModelForTokenClassification.from_pretrained("thunder-research-group/SNU Thunder-DeID-340M")
inputs = tokenizer("""ํผ๊ณ ์ธ ์ด๊ท์ฑ์ ์์ธ๋ํ๊ต ๋ฐ์ดํฐ์ฌ์ด์ธ์ค๋ํ์ ๋ฐ์ฌ๊ณผ์ ์ ์ฌํ ์ค์ด๋ฉฐ, ๊ฐ์ ์ฐ๊ตฌ์ค ์์ ํจ์ฑ์, ๋ฐํ์ง์ ํจ๊ป AI ๋ชจ๋ธ ๋น์๋ณํ์ ๊ด๋ จ๋ ์ฐ๊ตฌ๋ฅผ ์งํ ์ค์ด๋ค.
๊ทธ๋ ํด๋น ๊ธฐ์ ์ด ์ด๋ฏธ ์ฌ๋ฌ ๊ณต๊ณต๊ธฐ๊ด ๋ฐ ๋๊ธฐ์
์ผ๋ก๋ถํฐ ์์ฉํ ์ ์์ ๋ฐ๊ณ ์๋ค๊ณ ํ์๋ก ์ฃผ์ฅํ๋ฉฐ, ์ปค๋ฎค๋ํฐ ์ฌ์ดํธ โ์๋ธ๋ฆฌํ์โ์ โ๋น์๋ณํ ๊ธฐ์ ํฌ์์ ๋ชจ์งโ์ด๋ผ๋ ์ ๋ชฉ์ ๊ธ์ ๊ฒ์ํ์๋ค.
ํด๋น ๊ธ์๋ โ์ด๋ฏธ ๊ฒ์ฆ๋ ์๊ณ ๋ฆฌ์ฆ, ์ ์ ํฌ์ ์ ์ง๋ถ ์ฐ์ ๋ฐฐ์ โ, โํนํ ์์ต ๋ฐฐ๋ถ ์์ โ ๋ฑ์ ๋ฌธ๊ตฌ์ ํจ๊ป ์์ ๋ช
์์ ์ฐ๋ฆฌ์ํ ๊ณ์ข (9429-424-343942)๋ฅผ ๊ธฐ์ฌํ๊ณ ,
1์ธ๋น 10๋ง ์์ ์ด๊ธฐ ํฌ์๊ธ์ ์๊ตฌํ์๋ค. ์ด์ ๋ฐ๋ผ ์ด๊ท์ฑ์ ์์์ค, ์กฐ๊ฒฝ์ , ์ด๋์, ์์ฐ๊ฒฝ, ์์งํ ๋ฑ 5๋ช
์ผ๋ก๋ถํฐ ์ด 50๋ง ์์ ์ก๊ธ๋ฐ์ ํธ์ทจํ์๋ค.""", return_tensors="pt")
outputs = model(**inputs)
โ ๏ธ Note
To obtain the final deidentified text, use the inference toolkit provided in our SNU_Thunder-DeID GitHub repository.
The toolkit handles the full postprocessing pipeline, including:
id2label
andlabel2id
mappings- token-to-text alignment
- entity merging, whitespace recovery, and formatting
Model Details
Model Architecture
All SNU Thunder-DeID models are based on the DeBERTa-v2 architecture with relative positional encoding and disentangled attention.
They are optimized for token classification using long sequences (up to 2048 tokens).
Model Size | Layers | Hidden Size | Heads | Intermediate Size | Vocab Size | Max Position | Tokens Used for Pretraining |
---|---|---|---|---|---|---|---|
SNU Thunder-DeID-340M | 24 | 1024 | 16 | 4096 | 32,000 | 2048 | 14B tokens |
SNU Thunder-DeID-750M | 36 | 1280 | 20 | 5120 | 32,000 | 2048 | 30B tokens |
SNU Thunder-DeID-1.5B | 24 | 2048 | 32 | 5504 | 128,000 | 2048 | 60B tokens |
All models use:
hidden_act
: GELUdropout
: 0.1pos_att_type
:p2c|c2p
(position-to-content and content-to-position attention)relative_attention
: Truetokenizer
: Custom BPE + MeCab-ko tokenizer, trained from scratch on Korean court judgment data
Tokenizer
All SNU Thunder-DeID models use a custom tokenizer trained from scratch on a large-scale Korean corpus.
The tokenizer combines:
- MeCab-ko for morpheme-based segmentation
- Byte-Pair Encoding (BPE) for subword representation
Two vocabulary sizes were used depending on the model:
- 32,000 tokens (used in 340M and 750M models)
- 128,000 tokens (used in 1.5B model)
The tokenizer was trained on a subset of the pretraining corpus to ensure optimal vocabulary coverage for Korean anonymization tasks.
Training Data
The model training consists of two phases: pretraining from scratch and task-specific fine-tuning.
Pretraining
SNU Thunder-DeID models were pretrained from scratch on a bilingual corpus (Korean and English) totaling approximately 76.7GB,
using 14B / 30B / 60B tokens for the 340M, 750M, and 1.5B models respectively.
Fine-tuning
Fine-tuning was performed on the SNU Thunder-DeID Annotated court judgments dataset, using additional entity information from the SNU Thunder-DeID Entity mention list resource.
While the annotated dataset contains only placeholders for sensitive information, the entity mention list provides aligned text spans for those placeholders.
This alignment enables full token-level supervision for NER training.
- 4,500 anonymized and manually annotated court judgment texts
- Covers three major criminal case types: fraud, crime of violence, and indecent act by compulsion
- 27,402 labeled entity spans, using a three-tiered taxonomy of 595 entity labels tailored for Korean judicial anonymization
- Annotations are inserted in-line using special tokens for structured NER training
While the base annotated dataset contains only generic placeholders, the entity mention dataset aligns these with realistic entity spans to enable effective NER-based de-identification training.
Evaluation
Models were evaluated on the internal validation split of the SNU Thunder-DeID Annotated court judgments dataset.
Metric | 340M | 750M | 1.5B |
---|---|---|---|
Binary Token-Level Micro F1 | 0.9894 | 0.9891 | 0.9910 |
Token-Level Micro F1 | 0.8917 | 0.8862 | 0.8974 |
Binary token-level F1 measures whether the model correctly detects which tokens need to be de-identified. Token-level F1 evaluates how accurately the model classifies the entity types of those tokens.
Limitations
- Trained only on criminal court cases โ not guaranteed to generalize to civil or administrative rulings
- Designed for Korean texts โ not applicable to other languages or domains
- Not suitable for identifying sensitive content outside of structured NER targets
Ethical Considerations
- The model is trained on already-anonymized court documents
- Deployment in real-world settings should still include human oversight and legal compliance check
License
This repository contains original work licensed under the
Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
Portions of this repository (including tokenizer vocabulary and/or model weights)
are derived from Meta Llama 3.1 and are subject to the Meta Llama 3.1 Community License.
https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE
- Creative Commons Attribution-ShareAlike 4.0 License:
https://creativecommons.org/licenses/by-nc-sa/4.0/
Citation
If you use this model, please cite:
@misc{hahm2025thunderdeidaccurateefficientdeidentification,
title={Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments},
author={Sungen Hahm and Heejin Kim and Gyuseong Lee and Hyunji Park and Jaejin Lee},
year={2025},
eprint={2506.15266},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.15266},
}
Contact
If you have questions or issues, contact:
[email protected]
- Downloads last month
- 44