chosunilbo-LMs / README.md
ddokbaro's picture
Update README.md
20f3fde verified
---
license: cc-by-nc-sa-4.0
language:
- ko
tags:
- historical-korean
- word-embeddings
- diachronic
- word2vec
- fasttext
- digital-humanities
---
# 20μ„ΈκΈ° ν•œκ΅­μ–΄ ν†΅μ‹œμ  μ–Έμ–΄ λͺ¨λΈ (Diachronic Language Models for 20th Century Korean)
## λͺ¨λΈ μΉ΄λ“œ: chosunilbo-LMs
### λͺ¨λΈ κ°œμš” (Model Description)
λ³Έ μ €μž₯μ†ŒλŠ” 20μ„ΈκΈ° ν•œκ΅­μ–΄μ˜ μ‹œλŒ€λ³„ 의미 λ³€ν™”λ₯Ό μ—°κ΅¬ν•˜κΈ° μœ„ν•΄ **쑰선일보 기사 ν…μŠ€νŠΈ(1920-1999)**λ₯Ό 기반으둜 ν•™μŠ΅λœ ν†΅μ‹œμ (Diachronic) 단어 μž„λ² λ”© λͺ¨λΈλ“€μ„ μ œκ³΅ν•©λ‹ˆλ‹€. 이 λͺ¨λΈλ“€μ€ νŠΉμ • μ‹œμ μ˜ 언어적 μŠ€λƒ…μƒ·μ„ λ‹΄κ³  μžˆμ–΄, 역사학, μ‚¬νšŒν•™, μ–Έμ–΄ν•™ λ“± λ‹€μ–‘ν•œ λΆ„μ•Όμ˜ μ—°κ΅¬μžλ“€μ΄ νŠΉμ • κ°œλ…μ˜ 의미 λ³€ν™”λ₯Ό κ³„λŸ‰μ μœΌλ‘œ μΆ”μ ν•˜κ³  λΆ„μ„ν•˜λŠ” 데 ν™œμš©λ  수 μžˆμŠ΅λ‹ˆλ‹€.
λ³Έ μ €μž₯μ†ŒλŠ” 두 κ°€μ§€ μ’…λ₯˜μ˜ μž„λ² λ”© λͺ¨λΈ(`Word2Vec`, `fastText`)을 각각 **10λ…„ λ‹¨μœ„(decade)**와 **1λ…„ λ‹¨μœ„(yearly)**둜 κ΅¬μΆ•ν•˜μ—¬, 연ꡬ λͺ©μ μ— 따라 λ‹€μ–‘ν•œ ν•΄μƒλ„μ˜ 뢄석을 μ§€μ›ν•©λ‹ˆλ‹€.
### λͺ¨λΈ 상세 정보 (Model Details)
| λͺ¨λΈ μ’…λ₯˜ | μ‹œκ°„ λ‹¨μœ„ | νŠΉμ§• 및 μž₯점 |
| :--- | :--- | :--- |
| **Word2Vec** | 10λ…„ / 1λ…„ | νŠΉμ • μ‹œλŒ€μ˜ 핡심 μ–΄νœ˜λ“€ κ°„μ˜ 의미 관계λ₯Ό μ •κ΅ν•˜κ²Œ ν•™μŠ΅ν•©λ‹ˆλ‹€. |
| **fastText** | 10λ…„ / 1λ…„ | 단어λ₯Ό 더 μž‘μ€ λ‹¨μœ„(n-grams)둜 λΆ„ν•΄ν•˜μ—¬, μ˜€νƒˆμžλ‚˜ 희귀 μ–΄νœ˜ λ“± 사전에 μ—†λŠ” 단어(OOV)에 λŒ€ν•΄ κ°•κ±΄ν•œ μ„±λŠ₯을 λ³΄μž…λ‹ˆλ‹€. 역사 ν…μŠ€νŠΈ 뢄석에 특히 μœ μš©ν•©λ‹ˆλ‹€. |
### ν•™μŠ΅ 데이터 (Training Data)
* **데이터 μ†ŒμŠ€**: 쑰선일보 ν…μŠ€νŠΈ μ•„μΉ΄μ΄λΈŒ (1920-1999)
* **뢄석 λŒ€μƒ**: '기사(article)' μœ ν˜• ν…μŠ€νŠΈ μ•½ 277만 건
* **μ „μ²˜λ¦¬**:
1. **ν•˜μ΄λΈŒλ¦¬λ“œ ν…μŠ€νŠΈ μ„ μ •**: 1953λ…„ 이전은 ν•œκΈ€ λ³€ν™˜λ³Έ(`body_korean`), 1954λ…„ μ΄ν›„λŠ” 원문(`body_archaic`)을 μ‚¬μš©.
2. **ν˜•νƒœμ†Œ 뢄석**: `konlpy.tag.Okt`λ₯Ό μ‚¬μš©ν•˜μ—¬ 전체 ν…μŠ€νŠΈλ₯Ό ν˜•νƒœμ†Œ λ‹¨μœ„λ‘œ 뢄석.
3. **ν•™μŠ΅ 데이터**: λΆ„μ„λœ ν˜•νƒœμ†Œ 쀑 **λͺ…사(Noun)**λ§Œμ„ μΆ”μΆœν•˜μ—¬ 각 λͺ¨λΈμ˜ ν•™μŠ΅ λ°μ΄ν„°λ‘œ μ‚¬μš©.
**주의**: 원본 ν…μŠ€νŠΈ λ°μ΄ν„°μ˜ μ €μž‘κΆŒμ€ 쑰선일보사에 μžˆμŠ΅λ‹ˆλ‹€. λ³Έ λͺ¨λΈμ€ 비상업적 ν•™μˆ  연ꡬ λͺ©μ μœΌλ‘œλ§Œ μ‚¬μš© κ°€λŠ₯ν•©λ‹ˆλ‹€.
### ν•™μŠ΅ 절차 (Training Procedure)
각 λͺ¨λΈμ€ λ‹€μŒκ³Ό 같은 νŒŒλΌλ―Έν„°λ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
* **`vector_size`**: 100
* **`window`**: 5
* **`min_count`**: 5
* **`model` / `sg`**: skipgram
### ν™œμš© 방법 (How to Use)
#### Word2Vec λͺ¨λΈ ν™œμš© μ˜ˆμ‹œ (gensim)
```python
from huggingface_hub import hf_hub_download
from gensim.models import Word2Vec
# μ˜ˆμ‹œ: 1975λ…„ Word2Vec λͺ¨λΈ 뢈러였기
model_path = hf_hub_download(
repo_id="ddokbaro/chosunilbo-LMs",
filename="word2vec/yearly/word2vec_1975.model"
)
model = Word2Vec.load(model_path)
# 1975λ…„ '경제'와 κ°€μž₯ μœ μ‚¬ν•œ 단어 탐색
print("--- 1975λ…„ '경제'의 μœ μ‚¬μ–΄ ---")
print(model.wv.most_similar('경제', topn=5))
```
#### fastText λͺ¨λΈ ν™œμš© μ˜ˆμ‹œ
```python
from huggingface_hub import hf_hub_download
import fasttext
# μ˜ˆμ‹œ: 1995λ…„ fastText λͺ¨λΈ 뢈러였기
model_path = hf_hub_download(
repo_id="ddokbaro/chosunilbo-LMs",
filename="fasttext/yearly/fasttext_1995.bin"
)
model = fasttext.load_model(model_path)
# 1995λ…„ '미래'와 κ°€μž₯ μœ μ‚¬ν•œ 단어 탐색
print("\n--- 1995λ…„ '미래'의 μœ μ‚¬μ–΄ ---")
print(model.get_nearest_neighbors('미래', k=5))
```
## κ΄€λ ¨ 연ꡬ ν”Œλž«νΌ μ•ˆλ‚΄
* λ³Έ μ–Έμ–΄ λͺ¨λΈλ“€μ„ ν™œμš©ν•œ μ½”μ €λ ‰ κ°œλ…μ‚¬ μ—°κ΅¬μ˜ 전체 뢄석 μ½”λ“œ, μ΅œμ’… κ²°κ³Ό 데이터, 그리고 Colab 기반의 ꡐ윑용 νŠœν† λ¦¬μ–Όμ€ μ•„λž˜μ˜ 톡합 뢄석 ν”Œλž«νΌ μ €μž₯μ†Œμ—μ„œ ν™•μΈν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.
* μ €μž₯μ†Œ μ£Όμ†Œ: https://huggingface.co/datasets/ddokbaro/chosunilbo-koselleck-analysis-platform (μž„μ‹œ μ£Όμ†Œ)
## 인용 정보 (Citation)
λ³Έ λͺ¨λΈμ„ 연ꡬ에 μ‚¬μš©ν•˜μ‹€ 경우, λ‹€μŒμ„ μΈμš©ν•΄μ£Όμ‹­μ‹œμ˜€:
```bibtex
@misc{kimbaro_chosunilbo_lms_2025,
author = {Kim, Baro},
title = {20th Century Korean Diachronic Language Models from Chosun Ilbo Text},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{[https://huggingface.co/ddokbaro/chosunilbo-LMs](https://huggingface.co/ddokbaro/chosunilbo-LMs)}},
}