Model Card for SNU Thunder-DeID

Model Summary

SNU Thunder-DeID is a family of transformer encoder-based language models developed for Named Entity Recognition (NER)-based de-identification of Korean court judgments.
Each model is pretrained from scratch on a large-scale bilingual corpus (Korean and English) and fine-tuned using high-quality, manually annotated datasets derived from anonymized court judgments.
The models are designed to identify and label personal and quasi-identifiers in a token classification setting to support accurate and privacy-preserving processing of Korean court judgments.

The SNU Thunder-DeID models are released in three sizes:

SNU Thunder-DeID-340M
SNU Thunder-DeID-750M
SNU Thunder-DeID-1.5B (here)

Intended Use

The SNU Thunder-DeID models are intended to support:

De-identification of Korean court judgments
NER tasks focused on court judgments entities
Fine-tuning for privacy-preserving AI systems

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("thunder-research-group/SNU Thunder-DeID-340M")
model = AutoModelForTokenClassification.from_pretrained("thunder-research-group/SNU Thunder-DeID-340M")

inputs = tokenizer("""피고인 이규성은 서울대학교 데이터사이언스대학원 박사과정에 재학 중이며, 같은 연구실 소속 함성은, 박현지와 함께 AI 모델 비식별화와 관련된 연구를 진행 중이다.
그는 해당 기술이 이미 여러 공공기관 및 대기업으로부터 상용화 제안을 받고 있다고 허위로 주장하며, 커뮤니티 사이트 ‘에브리타임’에 “비식별화 기술 투자자 모집”이라는 제목의 글을 게시하였다.
해당 글에는 “이미 검증된 알고리즘, 선점 투자 시 지분 우선 배정”, “특허 수익 배분 예정” 등의 문구와 함께 자신 명의의 우리은행 계좌 (9429-424-343942)를 기재하고,
1인당 10만 원의 초기 투자금을 요구하였다. 이에 따라 이규성은 손영준, 조경제, 이동영, 소연경, 석지헌 등 5명으로부터 총 50만 원을 송금받아 편취하였다.""", return_tensors="pt")
outputs = model(**inputs)

⚠️ Note
To obtain the final deidentified text, use the inference toolkit provided in our SNU_Thunder-DeID GitHub repository.
The toolkit handles the full postprocessing pipeline, including:

id2label and label2id mappings
token-to-text alignment
entity merging, whitespace recovery, and formatting

Model Details

Model Architecture

All SNU Thunder-DeID models are based on the DeBERTa-v2 architecture with relative positional encoding and disentangled attention.
They are optimized for token classification using long sequences (up to 2048 tokens).

Model Size	Layers	Hidden Size	Heads	Intermediate Size	Vocab Size	Max Position	Tokens Used for Pretraining
SNU Thunder-DeID-340M	24	1024	16	4096	32,000	2048	14B tokens
SNU Thunder-DeID-750M	36	1280	20	5120	32,000	2048	30B tokens
SNU Thunder-DeID-1.5B	24	2048	32	5504	128,000	2048	60B tokens

All models use:

hidden_act: GELU
dropout: 0.1
pos_att_type: p2c|c2p (position-to-content and content-to-position attention)
relative_attention: True
tokenizer: Custom BPE + MeCab-ko tokenizer, trained from scratch on Korean court judgment data

Tokenizer

All SNU Thunder-DeID models use a custom tokenizer trained from scratch on a large-scale Korean corpus.
The tokenizer combines:

MeCab-ko for morpheme-based segmentation
Byte-Pair Encoding (BPE) for subword representation

Two vocabulary sizes were used depending on the model:

32,000 tokens (used in 340M and 750M models)
128,000 tokens (used in 1.5B model)

The tokenizer was trained on a subset of the pretraining corpus to ensure optimal vocabulary coverage for Korean anonymization tasks.

Training Data

The model training consists of two phases: pretraining from scratch and task-specific fine-tuning.

Pretraining

SNU Thunder-DeID models were pretrained from scratch on a bilingual corpus (Korean and English) totaling approximately 76.7GB,
using 14B / 30B / 60B tokens for the 340M, 750M, and 1.5B models respectively.

Fine-tuning

Fine-tuning was performed on the SNU Thunder-DeID Annotated court judgments dataset, using additional entity information from the SNU Thunder-DeID Entity mention list resource.
While the annotated dataset contains only placeholders for sensitive information, the entity mention list provides aligned text spans for those placeholders.
This alignment enables full token-level supervision for NER training.

4,500 anonymized and manually annotated court judgment texts
Covers three major criminal case types: fraud, crime of violence, and indecent act by compulsion
27,402 labeled entity spans, using a three-tiered taxonomy of 595 entity labels tailored for Korean judicial anonymization
Annotations are inserted in-line using special tokens for structured NER training

While the base annotated dataset contains only generic placeholders, the entity mention dataset aligns these with realistic entity spans to enable effective NER-based de-identification training.

Evaluation

Models were evaluated on the internal validation split of the SNU Thunder-DeID Annotated court judgments dataset.

Metric	340M	750M	1.5B
Binary Token-Level Micro F1	0.9894	0.9891	0.9910
Token-Level Micro F1	0.8917	0.8862	0.8974

Binary token-level F1 measures whether the model correctly detects which tokens need to be de-identified. Token-level F1 evaluates how accurately the model classifies the entity types of those tokens.

Limitations

Trained only on criminal court cases — not guaranteed to generalize to civil or administrative rulings
Designed for Korean texts — not applicable to other languages or domains
Not suitable for identifying sensitive content outside of structured NER targets

Ethical Considerations

The model is trained on already-anonymized court documents
Deployment in real-world settings should still include human oversight and legal compliance check

License

This repository contains original work licensed under the
Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Portions of this repository (including tokenizer vocabulary and/or model weights)
are derived from Meta Llama 3.1 and are subject to the Meta Llama 3.1 Community License. https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE

Creative Commons Attribution-ShareAlike 4.0 License:
https://creativecommons.org/licenses/by-nc-sa/4.0/

Citation

If you use this model, please cite:

@misc{hahm2025thunderdeidaccurateefficientdeidentification,
      title={Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments}, 
      author={Sungen Hahm and Heejin Kim and Gyuseong Lee and Hyunji Park and Jaejin Lee},
      year={2025},
      eprint={2506.15266},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.15266}, 
}

Contact

If you have questions or issues, contact:
[email protected]

thunder-research-group
/

SNU_Thunder-DeID-1.5B