๊ธˆ์œต ํ…์ŠคํŠธ ๊ฐ„์†Œํ™” ๋ชจ๋ธ (Financial Text Simplifier)

๋ชจ๋ธ ์„ค๋ช…

Open In Colab

fin_simplifier๋Š” ๋ณต์žกํ•œ ๊ธˆ์œต ์šฉ์–ด์™€ ๋ฌธ์žฅ์„ ์ผ๋ฐ˜์ธ์ด ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด ํ•œ๊ตญ์–ด๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๋ชจ๋ธ ๊ตฌ์กฐ (config.json ๊ธฐ๋ฐ˜)

  • ๋ชจ๋ธ ํƒ€์ž…: EncoderDecoderModel
  • ์ธ์ฝ”๋”: snunlp/KR-FinBert-SC (์€๋‹‰ ์ฐจ์›: 768)
  • ๋””์ฝ”๋”: skt/kogpt2-base-v2 (์–ดํœ˜ ํฌ๊ธฐ: 51,201)
  • ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜: ์•ฝ 255M
  • ํŒŒ์ผ ํฌ๊ธฐ: 1.02GB (safetensors ํ˜•์‹)

์ฃผ์š” ํŠน์ง•

  • ๊ธˆ์œต ์ „๋ฌธ ์šฉ์–ด๋ฅผ ์‰ฌ์šด ์ผ์ƒ์–ด๋กœ ๋ณ€ํ™˜
  • ํ•œ๊ตญ์–ด ๊ธˆ์œต ๋ฌธ์„œ์— ์ตœ์ ํ™”
  • ๋ณต์žกํ•œ ๊ธˆ์œต ๊ฐœ๋… ๊ฐ„์†Œํ™” (PER, ROE, ํŒŒ์ƒ์ƒํ’ˆ ๋“ฑ)
  • ์€ํ–‰ ์ƒ๋‹ด ๋ฐ ๊ธˆ์œต ๊ต์œก ํ™œ์šฉ ๊ฐ€๋Šฅ

์‚ฌ์šฉ ๋ชฉ์ 

์ฃผ์š” ํ™œ์šฉ ์‚ฌ๋ก€

  1. ๊ธˆ์œต ์ƒ๋‹ด ์ง€์›: ์€ํ–‰ ์ƒ๋‹ด ์‹œ ๊ณ ๊ฐ ์ดํ•ด๋„ ํ–ฅ์ƒ
  2. ๊ธˆ์œต ๊ต์œก: ๋ณต์žกํ•œ ๊ธˆ์œต ๊ฐœ๋…์„ ์‰ฝ๊ฒŒ ์„ค๋ช…
  3. ๋ฌธ์„œ ๊ฐ„์†Œํ™”: ์•ฝ๊ด€, ์ƒํ’ˆ ์„ค๋ช…์„œ ๋“ฑ์„ ์ดํ•ดํ•˜๊ธฐ ์‰ฝ๊ฒŒ ๋ณ€ํ™˜
  4. ์ ‘๊ทผ์„ฑ ๊ฐœ์„ : ๊ธˆ์œต ์†Œ์™ธ๊ณ„์ธต์˜ ๊ธˆ์œต ์„œ๋น„์Šค ์ ‘๊ทผ์„ฑ ํ–ฅ์ƒ

์‚ฌ์šฉ ์ œํ•œ ์‚ฌํ•ญ

  • ๋ฒ•์  ๊ตฌ์†๋ ฅ์ด ์žˆ๋Š” ๋ฌธ์„œ ์ž‘์„ฑ
  • ํˆฌ์ž ์กฐ์–ธ ๋˜๋Š” ๊ธˆ์œต ์ƒ๋‹ด ๋Œ€์ฒด
  • ์ •ํ™•ํ•œ ์ˆ˜์น˜๋‚˜ ๊ณ„์‚ฐ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ

์‚ฌ์šฉ ๋ฐฉ๋ฒ•

์„ค์น˜

from transformers import EncoderDecoderModel, AutoTokenizer
import torch

# Model loading
model = EncoderDecoderModel.from_pretrained("combe4259/fin_simplifier")
encoder_tokenizer = AutoTokenizer.from_pretrained("snunlp/KR-FinBert-SC")
decoder_tokenizer = AutoTokenizer.from_pretrained("skt/kogpt2-base-v2")

# Set special tokens
if decoder_tokenizer.pad_token is None:
    decoder_tokenizer.pad_token = decoder_tokenizer.eos_token

์ถ”๋ก  ์˜ˆ์‹œ

def simplify_text(text, model, encoder_tokenizer, decoder_tokenizer):
    # Tokenize input
    inputs = encoder_tokenizer(
        text,
        return_tensors="pt",
        max_length=128,
        padding="max_length",
        truncation=True
    )
    
    # Generate simplified text
    with torch.no_grad():
        generated = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=128,
            num_beams=6,
            repetition_penalty=1.2,
            length_penalty=0.8,
            early_stopping=True,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )
    
    # Decode output
    simplified = decoder_tokenizer.decode(generated[0], skip_special_tokens=True)
    return simplified

# Example usage
complex_text = "์ฃผ๊ฐ€์ˆ˜์ต๋น„์œจ(PER)์€ ์ฃผ๊ฐ€๋ฅผ ์ฃผ๋‹น์ˆœ์ด์ต์œผ๋กœ ๋‚˜๋ˆˆ ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค."
simple_text = simplify_text(complex_text, model, encoder_tokenizer, decoder_tokenizer)
print(f"์›๋ฌธ: {complex_text}")
print(f"๊ฐ„์†Œํ™”: {simple_text}")
# ์ถœ๋ ฅ ์˜ˆ์‹œ: ๋ชจ๋ธ์ด ์ƒ์„ฑํ•˜๋Š” ๊ฐ„์†Œํ™”๋œ ํ…์ŠคํŠธ

ํ•™์Šต ์ƒ์„ธ ์ •๋ณด

ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹

๋ฐ์ดํ„ฐ์…‹ ์ž์ฒด ์ œ์ž‘ ๋ฐ์ดํ„ฐ์…‹ -์ถœ์ฒ˜: NH๋†ํ˜‘์€ํ–‰ -NH๋†ํ˜‘์€ํ–‰ ์ƒํ’ˆ์„ค๋ช…์„œ๋ฅผ gemma ๋ชจ๋ธ์— ํˆฌ์ž…ํ•˜์—ฌ ๋ณ€ํ™˜ํ•˜์—ฌ ์ƒ์„ฑ

ํ•™์Šต ์„ค์ • (trainer_state.json ๊ธฐ๋ฐ˜)

  • ์—ํฌํฌ: 10
  • ๋ฐฐ์น˜ ํฌ๊ธฐ: 4 (gradient accumulation steps: 2)
  • ์ตœ๋Œ€ ํ•™์Šต๋ฅ : 2.99e-05
  • ์ตœ์ข… ํ•™์Šต๋ฅ : 8.82e-09
  • ์˜ตํ‹ฐ๋งˆ์ด์ €: AdamW (warmup steps: 200)
  • ๋ ˆ์ด๋ธ” ์Šค๋ฌด๋”ฉ: 0.1
  • ๋“œ๋กญ์•„์›ƒ: 0.2 (์ธ์ฝ”๋” ๋ฐ ๋””์ฝ”๋”)

์ƒ์„ฑ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

  • Beam Search: 6 beams
  • Repetition Penalty: 1.2
  • Length Penalty: 0.8
  • Temperature: 0.7
  • Top-k: 50
  • Top-p: 0.95

ํ‰๊ฐ€ ๊ฒฐ๊ณผ

ํ•™์Šต ์„ฑ๊ณผ (trainer_state.json ๊ธฐ์ค€)

  • ์ดˆ๊ธฐ ์†์‹ค: 13.53
  • ์ตœ์ข… ์†์‹ค: 3.76
  • ์†์‹ค ๊ฐ์†Œ์œจ: 72.2%
  • ์ด ํ•™์Šต ์Šคํ…: 3,600
  • ์ˆ˜๋ ด ํŒจํ„ด: ์—ํฌํฌ 8๋ถ€ํ„ฐ ์•ˆ์ •์  ์ˆ˜๋ ด

์—ํฌํฌ๋ณ„ ํ‰๊ท  ์†์‹ค

์—ํฌํฌ ํ‰๊ท  ์†์‹ค
1 8.98
2 6.93
3 5.95
4 5.28
5 4.81
6 4.44
7 4.17
8 3.97
9 3.82
10 3.73

์˜ˆ์‹œ ์ถœ๋ ฅ

์›๋ฌธ (Complex) ๋ณ€ํ™˜ ๊ฒฐ๊ณผ (Simplified)
์‹œ๊ฐ€์ด์•ก์€ ๋ฐœํ–‰์ฃผ์‹์ˆ˜์— ์ฃผ๊ฐ€๋ฅผ ๊ณฑํ•œ ๊ฐ’์œผ๋กœ ๊ธฐ์—…์˜ ์‹œ์žฅ๊ฐ€์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์‹œ๊ฐ€์ด์•ก์€ ํšŒ์‚ฌ์˜ ๋ชจ๋“  ์ฃผ์‹์„ ํ•ฉ์นœ ๊ฐ€๊ฒฉ์ž…๋‹ˆ๋‹ค.
ํŒŒ์ƒ๊ฒฐํ•ฉ์ฆ๊ถŒ์€ ๊ธฐ์ดˆ์ž์‚ฐ์˜ ๊ฐ€๊ฒฉ๋ณ€๋™์— ์—ฐ๊ณ„ํ•˜์—ฌ ์ˆ˜์ต์ด ๊ฒฐ์ •๋˜๋Š” ์ฆ๊ถŒ์ž…๋‹ˆ๋‹ค. ํŒŒ์ƒ๊ฒฐํ•ฉ์ฆ๊ถŒ์€ ๋‹ค๋ฅธ ์ƒํ’ˆ ๊ฐ€๊ฒฉ์— ๋”ฐ๋ผ ์ˆ˜์ต์ด ๋ฐ”๋€Œ๋Š” ํˆฌ์ž ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค.
ํ™˜๋งค์กฐ๊ฑด๋ถ€์ฑ„๊ถŒ(RP)์€ ์ผ์ •๊ธฐ๊ฐ„ ํ›„ ๋‹ค์‹œ ๋งค์ž…ํ•˜๋Š” ์กฐ๊ฑด์œผ๋กœ ๋งค๋„ํ•˜๋Š” ์ฑ„๊ถŒ์ž…๋‹ˆ๋‹ค. RP๋Š” ๋‚˜์ค‘์— ๋‹ค์‹œ ์‚ฌ๊ฒ ๋‹ค๊ณ  ์•ฝ์†ํ•˜๊ณ  ์ผ๋‹จ ํŒŒ๋Š” ์ฑ„๊ถŒ์ž…๋‹ˆ๋‹ค.
์œ ๋™์„ฑ์œ„ํ—˜์€ ์ž์‚ฐ์„ ์ ์ •๊ฐ€๊ฒฉ์— ํ˜„๊ธˆํ™”ํ•˜์ง€ ๋ชปํ•  ์œ„ํ—˜์ž…๋‹ˆ๋‹ค. ์œ ๋™์„ฑ์œ„ํ—˜์€ ๊ธ‰ํ•˜๊ฒŒ ํŒ” ๋•Œ ์ œ๊ฐ’์„ ๋ชป ๋ฐ›์„ ์œ„ํ—˜์ž…๋‹ˆ๋‹ค.
์›๋ฆฌ๊ธˆ๊ท ๋“ฑ์ƒํ™˜์€ ๋งค์›” ๋™์ผํ•œ ๊ธˆ์•ก์œผ๋กœ ์›๊ธˆ๊ณผ ์ด์ž๋ฅผ ์ƒํ™˜ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์›๋ฆฌ๊ธˆ๊ท ๋“ฑ์ƒํ™˜์€ ๋งค๋‹ฌ ๊ฐ™์€ ๊ธˆ์•ก์„ ๊ฐš๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

์ธ์šฉ

@misc{fin_simplifier2024,
  title={Financial Text Simplifier: Korean Financial Terms Simplification Model},
  author={combe4259},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/combe4259/fin_simplifier}
}

๊ฐ์‚ฌ์˜ ๋ง

  • KR-FinBert-SC: ๊ธˆ์œต ๋„๋ฉ”์ธ ํŠนํ™” ์ธ์ฝ”๋” ์ œ๊ณต
  • SKT KoGPT2: ํ•œ๊ตญ์–ด ์ƒ์„ฑ ๋ชจ๋ธ ์ œ๊ณต

์—ฐ๋ฝ์ฒ˜

  • HuggingFace: combe4259
  • Model Card: ๋ฌธ์˜์‚ฌํ•ญ์€ HuggingFace ํ† ๋ก  ํƒญ์„ ์ด์šฉํ•ด์ฃผ์„ธ์š”

Downloads last month
112
Safetensors
Model size
255M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train combe4259/fin_simplifier