|
--- |
|
license: cc-by-nc-sa-4.0 |
|
language: |
|
- ko |
|
tags: |
|
- historical-korean |
|
- word-embeddings |
|
- diachronic |
|
- word2vec |
|
- fasttext |
|
- digital-humanities |
|
--- |
|
|
|
# 20μΈκΈ° νκ΅μ΄ ν΅μμ μΈμ΄ λͺ¨λΈ (Diachronic Language Models for 20th Century Korean) |
|
|
|
## λͺ¨λΈ μΉ΄λ: chosunilbo-LMs |
|
|
|
### λͺ¨λΈ κ°μ (Model Description) |
|
|
|
λ³Έ μ μ₯μλ 20μΈκΈ° νκ΅μ΄μ μλλ³ μλ―Έ λ³νλ₯Ό μ°κ΅¬νκΈ° μν΄ **μ‘°μ μΌλ³΄ κΈ°μ¬ ν
μ€νΈ(1920-1999)**λ₯Ό κΈ°λ°μΌλ‘ νμ΅λ ν΅μμ (Diachronic) λ¨μ΄ μλ² λ© λͺ¨λΈλ€μ μ 곡ν©λλ€. μ΄ λͺ¨λΈλ€μ νΉμ μμ μ μΈμ΄μ μ€λ
μ·μ λ΄κ³ μμ΄, μμ¬ν, μ¬νν, μΈμ΄ν λ± λ€μν λΆμΌμ μ°κ΅¬μλ€μ΄ νΉμ κ°λ
μ μλ―Έ λ³νλ₯Ό κ³λμ μΌλ‘ μΆμ νκ³ λΆμνλ λ° νμ©λ μ μμ΅λλ€. |
|
|
|
λ³Έ μ μ₯μλ λ κ°μ§ μ’
λ₯μ μλ² λ© λͺ¨λΈ(`Word2Vec`, `fastText`)μ κ°κ° **10λ
λ¨μ(decade)**μ **1λ
λ¨μ(yearly)**λ‘ κ΅¬μΆνμ¬, μ°κ΅¬ λͺ©μ μ λ°λΌ λ€μν ν΄μλμ λΆμμ μ§μν©λλ€. |
|
|
|
### λͺ¨λΈ μμΈ μ 보 (Model Details) |
|
|
|
| λͺ¨λΈ μ’
λ₯ | μκ° λ¨μ | νΉμ§ λ° μ₯μ | |
|
| :--- | :--- | :--- | |
|
| **Word2Vec** | 10λ
/ 1λ
| νΉμ μλμ ν΅μ¬ μ΄νλ€ κ°μ μλ―Έ κ΄κ³λ₯Ό μ κ΅νκ² νμ΅ν©λλ€. | |
|
| **fastText** | 10λ
/ 1λ
| λ¨μ΄λ₯Ό λ μμ λ¨μ(n-grams)λ‘ λΆν΄νμ¬, μ€νμλ ν¬κ· μ΄ν λ± μ¬μ μ μλ λ¨μ΄(OOV)μ λν΄ κ°κ±΄ν μ±λ₯μ 보μ
λλ€. μμ¬ ν
μ€νΈ λΆμμ νΉν μ μ©ν©λλ€. | |
|
|
|
### νμ΅ λ°μ΄ν° (Training Data) |
|
|
|
* **λ°μ΄ν° μμ€**: μ‘°μ μΌλ³΄ ν
μ€νΈ μμΉ΄μ΄λΈ (1920-1999) |
|
* **λΆμ λμ**: 'κΈ°μ¬(article)' μ ν ν
μ€νΈ μ½ 277λ§ κ±΄ |
|
* **μ μ²λ¦¬**: |
|
1. **νμ΄λΈλ¦¬λ ν
μ€νΈ μ μ **: 1953λ
μ΄μ μ νκΈ λ³νλ³Έ(`body_korean`), 1954λ
μ΄νλ μλ¬Έ(`body_archaic`)μ μ¬μ©. |
|
2. **ννμ λΆμ**: `konlpy.tag.Okt`λ₯Ό μ¬μ©νμ¬ μ 체 ν
μ€νΈλ₯Ό ννμ λ¨μλ‘ λΆμ. |
|
3. **νμ΅ λ°μ΄ν°**: λΆμλ ννμ μ€ **λͺ
μ¬(Noun)**λ§μ μΆμΆνμ¬ κ° λͺ¨λΈμ νμ΅ λ°μ΄ν°λ‘ μ¬μ©. |
|
|
|
**μ£Όμ**: μλ³Έ ν
μ€νΈ λ°μ΄ν°μ μ μκΆμ μ‘°μ μΌλ³΄μ¬μ μμ΅λλ€. λ³Έ λͺ¨λΈμ λΉμμ
μ νμ μ°κ΅¬ λͺ©μ μΌλ‘λ§ μ¬μ© κ°λ₯ν©λλ€. |
|
|
|
### νμ΅ μ μ°¨ (Training Procedure) |
|
|
|
κ° λͺ¨λΈμ λ€μκ³Ό κ°μ νλΌλ―Έν°λ‘ νμ΅λμμ΅λλ€. |
|
* **`vector_size`**: 100 |
|
* **`window`**: 5 |
|
* **`min_count`**: 5 |
|
* **`model` / `sg`**: skipgram |
|
|
|
### νμ© λ°©λ² (How to Use) |
|
|
|
#### Word2Vec λͺ¨λΈ νμ© μμ (gensim) |
|
|
|
```python |
|
from huggingface_hub import hf_hub_download |
|
from gensim.models import Word2Vec |
|
|
|
# μμ: 1975λ
Word2Vec λͺ¨λΈ λΆλ¬μ€κΈ° |
|
model_path = hf_hub_download( |
|
repo_id="ddokbaro/chosunilbo-LMs", |
|
filename="word2vec/yearly/word2vec_1975.model" |
|
) |
|
model = Word2Vec.load(model_path) |
|
|
|
# 1975λ
'κ²½μ 'μ κ°μ₯ μ μ¬ν λ¨μ΄ νμ |
|
print("--- 1975λ
'κ²½μ 'μ μ μ¬μ΄ ---") |
|
print(model.wv.most_similar('κ²½μ ', topn=5)) |
|
``` |
|
|
|
#### fastText λͺ¨λΈ νμ© μμ |
|
|
|
```python |
|
from huggingface_hub import hf_hub_download |
|
import fasttext |
|
|
|
# μμ: 1995λ
fastText λͺ¨λΈ λΆλ¬μ€κΈ° |
|
model_path = hf_hub_download( |
|
repo_id="ddokbaro/chosunilbo-LMs", |
|
filename="fasttext/yearly/fasttext_1995.bin" |
|
) |
|
model = fasttext.load_model(model_path) |
|
|
|
# 1995λ
'λ―Έλ'μ κ°μ₯ μ μ¬ν λ¨μ΄ νμ |
|
print("\n--- 1995λ
'λ―Έλ'μ μ μ¬μ΄ ---") |
|
print(model.get_nearest_neighbors('λ―Έλ', k=5)) |
|
``` |
|
|
|
## κ΄λ ¨ μ°κ΅¬ νλ«νΌ μλ΄ |
|
|
|
* λ³Έ μΈμ΄ λͺ¨λΈλ€μ νμ©ν μ½μ €λ κ°λ
μ¬ μ°κ΅¬μ μ 체 λΆμ μ½λ, μ΅μ’
κ²°κ³Ό λ°μ΄ν°, κ·Έλ¦¬κ³ Colab κΈ°λ°μ κ΅μ‘μ© νν 리μΌμ μλμ ν΅ν© λΆμ νλ«νΌ μ μ₯μμμ νμΈνμ€ μ μμ΅λλ€. |
|
|
|
* μ μ₯μ μ£Όμ: https://huggingface.co/datasets/ddokbaro/chosunilbo-koselleck-analysis-platform (μμ μ£Όμ) |
|
|
|
## μΈμ© μ 보 (Citation) |
|
λ³Έ λͺ¨λΈμ μ°κ΅¬μ μ¬μ©νμ€ κ²½μ°, λ€μμ μΈμ©ν΄μ£Όμμμ€: |
|
|
|
```bibtex |
|
@misc{kimbaro_chosunilbo_lms_2025, |
|
author = {Kim, Baro}, |
|
title = {20th Century Korean Diachronic Language Models from Chosun Ilbo Text}, |
|
year = {2025}, |
|
publisher = {Hugging Face}, |
|
journal = {Hugging Face repository}, |
|
howpublished = {\url{[https://huggingface.co/ddokbaro/chosunilbo-LMs](https://huggingface.co/ddokbaro/chosunilbo-LMs)}}, |
|
} |
|
|
|
|