Model Card for SNU Thunder-DeID

Model Summary

SNU Thunder-DeID is a family of transformer encoder-based language models developed for Named Entity Recognition (NER)-based de-identification of Korean court judgments.
Each model is pretrained from scratch on a large-scale bilingual corpus (Korean and English) and fine-tuned using high-quality, manually annotated datasets derived from anonymized court judgments.
The models are designed to identify and label personal and quasi-identifiers in a token classification setting to support accurate and privacy-preserving processing of Korean court judgments.

The SNU Thunder-DeID models are released in three sizes:

Intended Use

The SNU Thunder-DeID models are intended to support:

  • De-identification of Korean court judgments
  • NER tasks focused on court judgments entities
  • Fine-tuning for privacy-preserving AI systems

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("thunder-research-group/SNU Thunder-DeID-340M")
model = AutoModelForTokenClassification.from_pretrained("thunder-research-group/SNU Thunder-DeID-340M")

inputs = tokenizer("""ํ”ผ๊ณ ์ธ ์ด๊ทœ์„ฑ์€ ์„œ์šธ๋Œ€ํ•™๊ต ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค๋Œ€ํ•™์› ๋ฐ•์‚ฌ๊ณผ์ •์— ์žฌํ•™ ์ค‘์ด๋ฉฐ, ๊ฐ™์€ ์—ฐ๊ตฌ์‹ค ์†Œ์† ํ•จ์„ฑ์€, ๋ฐ•ํ˜„์ง€์™€ ํ•จ๊ป˜ AI ๋ชจ๋ธ ๋น„์‹๋ณ„ํ™”์™€ ๊ด€๋ จ๋œ ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ ์ค‘์ด๋‹ค.
๊ทธ๋Š” ํ•ด๋‹น ๊ธฐ์ˆ ์ด ์ด๋ฏธ ์—ฌ๋Ÿฌ ๊ณต๊ณต๊ธฐ๊ด€ ๋ฐ ๋Œ€๊ธฐ์—…์œผ๋กœ๋ถ€ํ„ฐ ์ƒ์šฉํ™” ์ œ์•ˆ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค๊ณ  ํ—ˆ์œ„๋กœ ์ฃผ์žฅํ•˜๋ฉฐ, ์ปค๋ฎค๋‹ˆํ‹ฐ ์‚ฌ์ดํŠธ โ€˜์—๋ธŒ๋ฆฌํƒ€์ž„โ€™์— โ€œ๋น„์‹๋ณ„ํ™” ๊ธฐ์ˆ  ํˆฌ์ž์ž ๋ชจ์ง‘โ€์ด๋ผ๋Š” ์ œ๋ชฉ์˜ ๊ธ€์„ ๊ฒŒ์‹œํ•˜์˜€๋‹ค.
ํ•ด๋‹น ๊ธ€์—๋Š” โ€œ์ด๋ฏธ ๊ฒ€์ฆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ์„ ์  ํˆฌ์ž ์‹œ ์ง€๋ถ„ ์šฐ์„  ๋ฐฐ์ •โ€, โ€œํŠนํ—ˆ ์ˆ˜์ต ๋ฐฐ๋ถ„ ์˜ˆ์ •โ€ ๋“ฑ์˜ ๋ฌธ๊ตฌ์™€ ํ•จ๊ป˜ ์ž์‹  ๋ช…์˜์˜ ์šฐ๋ฆฌ์€ํ–‰ ๊ณ„์ขŒ (9429-424-343942)๋ฅผ ๊ธฐ์žฌํ•˜๊ณ ,
1์ธ๋‹น 10๋งŒ ์›์˜ ์ดˆ๊ธฐ ํˆฌ์ž๊ธˆ์„ ์š”๊ตฌํ•˜์˜€๋‹ค. ์ด์— ๋”ฐ๋ผ ์ด๊ทœ์„ฑ์€ ์†์˜์ค€, ์กฐ๊ฒฝ์ œ, ์ด๋™์˜, ์†Œ์—ฐ๊ฒฝ, ์„์ง€ํ—Œ ๋“ฑ 5๋ช…์œผ๋กœ๋ถ€ํ„ฐ ์ด 50๋งŒ ์›์„ ์†ก๊ธˆ๋ฐ›์•„ ํŽธ์ทจํ•˜์˜€๋‹ค.""", return_tensors="pt")
outputs = model(**inputs)

โš ๏ธ Note
To obtain the final deidentified text, use the inference toolkit provided in our SNU_Thunder-DeID GitHub repository.
The toolkit handles the full postprocessing pipeline, including:

  • id2label and label2id mappings
  • token-to-text alignment
  • entity merging, whitespace recovery, and formatting

Model Details

Model Architecture

All SNU Thunder-DeID models are based on the DeBERTa-v2 architecture with relative positional encoding and disentangled attention.
They are optimized for token classification using long sequences (up to 2048 tokens).

Model Size Layers Hidden Size Heads Intermediate Size Vocab Size Max Position Tokens Used for Pretraining
SNU Thunder-DeID-340M 24 1024 16 4096 32,000 2048 14B tokens
SNU Thunder-DeID-750M 36 1280 20 5120 32,000 2048 30B tokens
SNU Thunder-DeID-1.5B 24 2048 32 5504 128,000 2048 60B tokens

All models use:

  • hidden_act: GELU
  • dropout: 0.1
  • pos_att_type: p2c|c2p (position-to-content and content-to-position attention)
  • relative_attention: True
  • tokenizer: Custom BPE + MeCab-ko tokenizer, trained from scratch on Korean court judgment data

Tokenizer

All SNU Thunder-DeID models use a custom tokenizer trained from scratch on a large-scale Korean corpus.
The tokenizer combines:

  • MeCab-ko for morpheme-based segmentation
  • Byte-Pair Encoding (BPE) for subword representation

Two vocabulary sizes were used depending on the model:

  • 32,000 tokens (used in 340M and 750M models)
  • 128,000 tokens (used in 1.5B model)

The tokenizer was trained on a subset of the pretraining corpus to ensure optimal vocabulary coverage for Korean anonymization tasks.

Training Data

The model training consists of two phases: pretraining from scratch and task-specific fine-tuning.

Pretraining

SNU Thunder-DeID models were pretrained from scratch on a bilingual corpus (Korean and English) totaling approximately 76.7GB,
using 14B / 30B / 60B tokens for the 340M, 750M, and 1.5B models respectively.

Fine-tuning

Fine-tuning was performed on the SNU Thunder-DeID Annotated court judgments dataset, using additional entity information from the SNU Thunder-DeID Entity mention list resource.
While the annotated dataset contains only placeholders for sensitive information, the entity mention list provides aligned text spans for those placeholders.
This alignment enables full token-level supervision for NER training.

  • 4,500 anonymized and manually annotated court judgment texts
  • Covers three major criminal case types: fraud, crime of violence, and indecent act by compulsion
  • 27,402 labeled entity spans, using a three-tiered taxonomy of 595 entity labels tailored for Korean judicial anonymization
  • Annotations are inserted in-line using special tokens for structured NER training

While the base annotated dataset contains only generic placeholders, the entity mention dataset aligns these with realistic entity spans to enable effective NER-based de-identification training.

Evaluation

Models were evaluated on the internal validation split of the SNU Thunder-DeID Annotated court judgments dataset.

Metric 340M 750M 1.5B
Binary Token-Level Micro F1 0.9894 0.9891 0.9910
Token-Level Micro F1 0.8917 0.8862 0.8974

Binary token-level F1 measures whether the model correctly detects which tokens need to be de-identified. Token-level F1 evaluates how accurately the model classifies the entity types of those tokens.

Limitations

  • Trained only on criminal court cases โ€” not guaranteed to generalize to civil or administrative rulings
  • Designed for Korean texts โ€” not applicable to other languages or domains
  • Not suitable for identifying sensitive content outside of structured NER targets

Ethical Considerations

  • The model is trained on already-anonymized court documents
  • Deployment in real-world settings should still include human oversight and legal compliance check

License

This repository contains original work licensed under the
Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Portions of this repository (including tokenizer vocabulary and/or model weights)
are derived from Meta Llama 3.1 and are subject to the Meta Llama 3.1 Community License. https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE

Citation

If you use this model, please cite:

@misc{hahm2025thunderdeidaccurateefficientdeidentification,
      title={Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments}, 
      author={Sungen Hahm and Heejin Kim and Gyuseong Lee and Hyunji Park and Jaejin Lee},
      year={2025},
      eprint={2506.15266},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.15266}, 
}

Contact

If you have questions or issues, contact:
[email protected]

Downloads last month
44
Safetensors
Model size
1.21B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including thunder-research-group/SNU_Thunder-DeID-1.5B