---
license: cc-by-nc-sa-4.0
language:
- ko
tags:
- historical-korean
- word-embeddings
- diachronic
- word2vec
- fasttext
- digital-humanities
---

# 20세기 한국어 통시적 언어 모델 (Diachronic Language Models for 20th Century Korean)

## 모델 카드: chosunilbo-LMs

### 모델 개요 (Model Description)

본 저장소는 20세기 한국어의 시대별 의미 변화를 연구하기 위해 **조선일보 기사 텍스트(1920-1999)**를 기반으로 학습된 통시적(Diachronic) 단어 임베딩 모델들을 제공합니다. 이 모델들은 특정 시점의 언어적 스냅샷을 담고 있어, 역사학, 사회학, 언어학 등 다양한 분야의 연구자들이 특정 개념의 의미 변화를 계량적으로 추적하고 분석하는 데 활용될 수 있습니다.

본 저장소는 두 가지 종류의 임베딩 모델(`Word2Vec`, `fastText`)을 각각 **10년 단위(decade)**와 **1년 단위(yearly)**로 구축하여, 연구 목적에 따라 다양한 해상도의 분석을 지원합니다.

### 모델 상세 정보 (Model Details)

| 모델 종류 | 시간 단위 | 특징 및 장점 |
| :--- | :--- | :--- |
| **Word2Vec** | 10년 / 1년 | 특정 시대의 핵심 어휘들 간의 의미 관계를 정교하게 학습합니다. |
| **fastText** | 10년 / 1년 | 단어를 더 작은 단위(n-grams)로 분해하여, 오탈자나 희귀 어휘 등 사전에 없는 단어(OOV)에 대해 강건한 성능을 보입니다. 역사 텍스트 분석에 특히 유용합니다. |

### 학습 데이터 (Training Data)

* **데이터 소스**: 조선일보 텍스트 아카이브 (1920-1999)
* **분석 대상**: '기사(article)' 유형 텍스트 약 277만 건
* **전처리**:
    1.  **하이브리드 텍스트 선정**: 1953년 이전은 한글 변환본(`body_korean`), 1954년 이후는 원문(`body_archaic`)을 사용.
    2.  **형태소 분석**: `konlpy.tag.Okt`를 사용하여 전체 텍스트를 형태소 단위로 분석.
    3.  **학습 데이터**: 분석된 형태소 중 **명사(Noun)**만을 추출하여 각 모델의 학습 데이터로 사용.

**주의**: 원본 텍스트 데이터의 저작권은 조선일보사에 있습니다. 본 모델은 비상업적 학술 연구 목적으로만 사용 가능합니다.

### 학습 절차 (Training Procedure)

각 모델은 다음과 같은 파라미터로 학습되었습니다.
* **`vector_size`**: 100
* **`window`**: 5
* **`min_count`**: 5
* **`model` / `sg`**: skipgram

### 활용 방법 (How to Use)

#### Word2Vec 모델 활용 예시 (gensim)

```python
from huggingface_hub import hf_hub_download
from gensim.models import Word2Vec

# 예시: 1975년 Word2Vec 모델 불러오기
model_path = hf_hub_download(
    repo_id="ddokbaro/chosunilbo-LMs",
    filename="word2vec/yearly/word2vec_1975.model"
)
model = Word2Vec.load(model_path)

# 1975년 '경제'와 가장 유사한 단어 탐색
print("--- 1975년 '경제'의 유사어 ---")
print(model.wv.most_similar('경제', topn=5))
```

#### fastText 모델 활용 예시

```python
from huggingface_hub import hf_hub_download
import fasttext

# 예시: 1995년 fastText 모델 불러오기
model_path = hf_hub_download(
    repo_id="ddokbaro/chosunilbo-LMs",
    filename="fasttext/yearly/fasttext_1995.bin"
)
model = fasttext.load_model(model_path)

# 1995년 '미래'와 가장 유사한 단어 탐색
print("\n--- 1995년 '미래'의 유사어 ---")
print(model.get_nearest_neighbors('미래', k=5))
```

## 관련 연구 플랫폼 안내

* 본 언어 모델들을 활용한 코젤렉 개념사 연구의 전체 분석 코드, 최종 결과 데이터, 그리고 Colab 기반의 교육용 튜토리얼은 아래의 통합 분석 플랫폼 저장소에서 확인하실 수 있습니다.

* 저장소 주소: https://huggingface.co/datasets/ddokbaro/chosunilbo-koselleck-analysis-platform (임시 주소)

## 인용 정보 (Citation)
본 모델을 연구에 사용하실 경우, 다음을 인용해주십시오:

```bibtex
@misc{kimbaro_chosunilbo_lms_2025,
  author = {Kim, Baro},
  title = {20th Century Korean Diachronic Language Models from Chosun Ilbo Text},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{[https://huggingface.co/ddokbaro/chosunilbo-LMs](https://huggingface.co/ddokbaro/chosunilbo-LMs)}},
}