20์„ธ๊ธฐ ํ•œ๊ตญ์–ด ํ†ต์‹œ์  ์–ธ์–ด ๋ชจ๋ธ (Diachronic Language Models for 20th Century Korean)

๋ชจ๋ธ ์นด๋“œ: chosunilbo-LMs

๋ชจ๋ธ ๊ฐœ์š” (Model Description)

๋ณธ ์ €์žฅ์†Œ๋Š” 20์„ธ๊ธฐ ํ•œ๊ตญ์–ด์˜ ์‹œ๋Œ€๋ณ„ ์˜๋ฏธ ๋ณ€ํ™”๋ฅผ ์—ฐ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด **์กฐ์„ ์ผ๋ณด ๊ธฐ์‚ฌ ํ…์ŠคํŠธ(1920-1999)**๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋œ ํ†ต์‹œ์ (Diachronic) ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ๋“ค์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ๋“ค์€ ํŠน์ • ์‹œ์ ์˜ ์–ธ์–ด์  ์Šค๋ƒ…์ƒท์„ ๋‹ด๊ณ  ์žˆ์–ด, ์—ญ์‚ฌํ•™, ์‚ฌํšŒํ•™, ์–ธ์–ดํ•™ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์˜ ์—ฐ๊ตฌ์ž๋“ค์ด ํŠน์ • ๊ฐœ๋…์˜ ์˜๋ฏธ ๋ณ€ํ™”๋ฅผ ๊ณ„๋Ÿ‰์ ์œผ๋กœ ์ถ”์ ํ•˜๊ณ  ๋ถ„์„ํ•˜๋Š” ๋ฐ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ณธ ์ €์žฅ์†Œ๋Š” ๋‘ ๊ฐ€์ง€ ์ข…๋ฅ˜์˜ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ(Word2Vec, fastText)์„ ๊ฐ๊ฐ **10๋…„ ๋‹จ์œ„(decade)**์™€ **1๋…„ ๋‹จ์œ„(yearly)**๋กœ ๊ตฌ์ถ•ํ•˜์—ฌ, ์—ฐ๊ตฌ ๋ชฉ์ ์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„์˜ ๋ถ„์„์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ ์ƒ์„ธ ์ •๋ณด (Model Details)

๋ชจ๋ธ ์ข…๋ฅ˜ ์‹œ๊ฐ„ ๋‹จ์œ„ ํŠน์ง• ๋ฐ ์žฅ์ 
Word2Vec 10๋…„ / 1๋…„ ํŠน์ • ์‹œ๋Œ€์˜ ํ•ต์‹ฌ ์–ดํœ˜๋“ค ๊ฐ„์˜ ์˜๋ฏธ ๊ด€๊ณ„๋ฅผ ์ •๊ตํ•˜๊ฒŒ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
fastText 10๋…„ / 1๋…„ ๋‹จ์–ด๋ฅผ ๋” ์ž‘์€ ๋‹จ์œ„(n-grams)๋กœ ๋ถ„ํ•ดํ•˜์—ฌ, ์˜คํƒˆ์ž๋‚˜ ํฌ๊ท€ ์–ดํœ˜ ๋“ฑ ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด(OOV)์— ๋Œ€ํ•ด ๊ฐ•๊ฑดํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ์—ญ์‚ฌ ํ…์ŠคํŠธ ๋ถ„์„์— ํŠนํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋ฐ์ดํ„ฐ (Training Data)

  • ๋ฐ์ดํ„ฐ ์†Œ์Šค: ์กฐ์„ ์ผ๋ณด ํ…์ŠคํŠธ ์•„์นด์ด๋ธŒ (1920-1999)
  • ๋ถ„์„ ๋Œ€์ƒ: '๊ธฐ์‚ฌ(article)' ์œ ํ˜• ํ…์ŠคํŠธ ์•ฝ 277๋งŒ ๊ฑด
  • ์ „์ฒ˜๋ฆฌ:
    1. ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ํ…์ŠคํŠธ ์„ ์ •: 1953๋…„ ์ด์ „์€ ํ•œ๊ธ€ ๋ณ€ํ™˜๋ณธ(body_korean), 1954๋…„ ์ดํ›„๋Š” ์›๋ฌธ(body_archaic)์„ ์‚ฌ์šฉ.
    2. ํ˜•ํƒœ์†Œ ๋ถ„์„: konlpy.tag.Okt๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒด ํ…์ŠคํŠธ๋ฅผ ํ˜•ํƒœ์†Œ ๋‹จ์œ„๋กœ ๋ถ„์„.
    3. ํ•™์Šต ๋ฐ์ดํ„ฐ: ๋ถ„์„๋œ ํ˜•ํƒœ์†Œ ์ค‘ **๋ช…์‚ฌ(Noun)**๋งŒ์„ ์ถ”์ถœํ•˜์—ฌ ๊ฐ ๋ชจ๋ธ์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉ.

์ฃผ์˜: ์›๋ณธ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ์ €์ž‘๊ถŒ์€ ์กฐ์„ ์ผ๋ณด์‚ฌ์— ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋ชจ๋ธ์€ ๋น„์ƒ์—…์  ํ•™์ˆ  ์—ฐ๊ตฌ ๋ชฉ์ ์œผ๋กœ๋งŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ์ ˆ์ฐจ (Training Procedure)

๊ฐ ๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • vector_size: 100
  • window: 5
  • min_count: 5
  • model / sg: skipgram

ํ™œ์šฉ ๋ฐฉ๋ฒ• (How to Use)

Word2Vec ๋ชจ๋ธ ํ™œ์šฉ ์˜ˆ์‹œ (gensim)

from huggingface_hub import hf_hub_download
from gensim.models import Word2Vec

# ์˜ˆ์‹œ: 1975๋…„ Word2Vec ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
model_path = hf_hub_download(
    repo_id="ddokbaro/chosunilbo-LMs",
    filename="word2vec/yearly/word2vec_1975.model"
)
model = Word2Vec.load(model_path)

# 1975๋…„ '๊ฒฝ์ œ'์™€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋‹จ์–ด ํƒ์ƒ‰
print("--- 1975๋…„ '๊ฒฝ์ œ'์˜ ์œ ์‚ฌ์–ด ---")
print(model.wv.most_similar('๊ฒฝ์ œ', topn=5))

fastText ๋ชจ๋ธ ํ™œ์šฉ ์˜ˆ์‹œ

from huggingface_hub import hf_hub_download
import fasttext

# ์˜ˆ์‹œ: 1995๋…„ fastText ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
model_path = hf_hub_download(
    repo_id="ddokbaro/chosunilbo-LMs",
    filename="fasttext/yearly/fasttext_1995.bin"
)
model = fasttext.load_model(model_path)

# 1995๋…„ '๋ฏธ๋ž˜'์™€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋‹จ์–ด ํƒ์ƒ‰
print("\n--- 1995๋…„ '๋ฏธ๋ž˜'์˜ ์œ ์‚ฌ์–ด ---")
print(model.get_nearest_neighbors('๋ฏธ๋ž˜', k=5))

๊ด€๋ จ ์—ฐ๊ตฌ ํ”Œ๋žซํผ ์•ˆ๋‚ด

  • ๋ณธ ์–ธ์–ด ๋ชจ๋ธ๋“ค์„ ํ™œ์šฉํ•œ ์ฝ”์ ค๋ ‰ ๊ฐœ๋…์‚ฌ ์—ฐ๊ตฌ์˜ ์ „์ฒด ๋ถ„์„ ์ฝ”๋“œ, ์ตœ์ข… ๊ฒฐ๊ณผ ๋ฐ์ดํ„ฐ, ๊ทธ๋ฆฌ๊ณ  Colab ๊ธฐ๋ฐ˜์˜ ๊ต์œก์šฉ ํŠœํ† ๋ฆฌ์–ผ์€ ์•„๋ž˜์˜ ํ†ตํ•ฉ ๋ถ„์„ ํ”Œ๋žซํผ ์ €์žฅ์†Œ์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ €์žฅ์†Œ ์ฃผ์†Œ: https://huggingface.co/datasets/ddokbaro/chosunilbo-koselleck-analysis-platform (์ž„์‹œ ์ฃผ์†Œ)

์ธ์šฉ ์ •๋ณด (Citation)

๋ณธ ๋ชจ๋ธ์„ ์—ฐ๊ตฌ์— ์‚ฌ์šฉํ•˜์‹ค ๊ฒฝ์šฐ, ๋‹ค์Œ์„ ์ธ์šฉํ•ด์ฃผ์‹ญ์‹œ์˜ค:

@misc{kimbaro_chosunilbo_lms_2025,
  author = {Kim, Baro},
  title = {20th Century Korean Diachronic Language Models from Chosun Ilbo Text},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{[https://huggingface.co/ddokbaro/chosunilbo-LMs](https://huggingface.co/ddokbaro/chosunilbo-LMs)}},
}
Downloads last month
1,013
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support