polyglot-ko-1b-txt2sql

polyglot-ko-1b-txt2sql์€ ํ•œ๊ตญ์–ด ์ž์—ฐ์–ด ์งˆ๋ฌธ์„ SQL ์ฟผ๋ฆฌ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด ํŒŒ์ธํŠœ๋‹๋œ ํ…์ŠคํŠธ ์ƒ์„ฑ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ EleutherAI/polyglot-ko-1.3b๋ฅผ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, LoRA๋ฅผ ํ†ตํ•ด ๊ฒฝ๋Ÿ‰ ํŒŒ์ธํŠœ๋‹๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํŒŒ์ธํŠœ๋‹์„ ์ฒ˜์Œ ํ•ด๋ณธ ๊ธ€์“ด์ด๊ฐ€ ์‹ค์Šต์šฉ์œผ๋กœ ๋งŒ๋“  ์ฒซ ๋ชจ๋ธ๋กœ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•  ์ˆœ ์—†์œผ๋‹ˆ ์ฐธ๊ณ ๋ฐ”๋ž๋‹ˆ๋‹ค.


๋ชจ๋ธ ์ •๋ณด

  • Base model: EleutherAI/polyglot-ko-1.3b
  • Fine-tuning: QLoRA (4bit quantization + PEFT)
  • Task: Text2SQL (์ž์—ฐ์–ด โ†’ SQL ๋ณ€ํ™˜)
  • Tokenizer: ๋™์ผํ•œ ํ† ํฌ๋‚˜์ด์ € ์‚ฌ์šฉ

ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹

๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด SQL ๋ณ€ํ™˜ ํƒœ์Šคํฌ๋ฅผ ์œ„ํ•ด ์„ค๊ณ„๋œ ์ž์—ฐ์–ด ์งˆ๋ฌธ-์ฟผ๋ฆฌ ํŽ˜์–ด๋กœ ํŒŒ์ธํŠœ๋‹๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • shangrilar/ko_text2sql ๋ฐ์ดํ„ฐ์…‹ ์ผ๋ถ€

  • ์ „์ฒ˜๋ฆฌ: DDL-Question-SQL ๊ตฌ์กฐ๋กœ prompt ๊ตฌ์„ฑ

  • ํฌ๊ธฐ: ์•ฝ 25,000๊ฑด์˜ DDL + ์ž์—ฐ์–ด ์งˆ๋ฌธ + SQL ์ •๋‹ต ์Œ


ํ‰๊ฐ€ ๊ฒฐ๊ณผ

  • ํ‰๊ฐ€ ๋ฐฉ์‹: GPT-4.1-nano ๋ชจ๋ธ์—๊ฒŒ gen_sql๊ณผ gt_sql ๋น„๊ต ํ›„ ํ‰๊ฐ€ ์š”์ฒญ
  • ํ‰๊ฐ€ ๊ธฐ์ค€: ๊ฒฐ๊ณผ ๋™์ผ ์—ฌ๋ถ€ ๊ธฐ๋ฐ˜ yes/no ํŒ๋‹จ (JSON response: {"resolve_yn": "yes"})
  • ํ‰๊ฐ€ ๊ฒฐ๊ณผ:
    • ๋ฒ ์ด์Šค ๋ชจ๋ธ ์ •ํ™•๋„: 68%
    • ํŒŒ์ธํŠœ๋‹ ๋ชจ๋ธ ์ •ํ™•๋„: 19%

๋ฌธ์ œ์ 

  • ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ์€ gen_sql์— SQL ์ฟผ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•˜๊ณ , ์งˆ๋ฌธ์„ ๋ฐ˜๋ณตํ•˜๊ฑฐ๋‚˜ ์˜๋ฏธ ์—†๋Š” ํ…์ŠคํŠธ๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜๋‹ค.

  • ํŒŒ์ธํŠœ๋‹ ๋ชจ๋ธ์€ SQL ํ˜•ํƒœ๋ฅผ ํ‰๋‚ด๋‚ด๊ธด ํ–ˆ์ง€๋งŒ, ์กด์žฌํ•˜์ง€ ์•Š๋Š” ์ปฌ๋Ÿผ๋ช…์ด๋‚˜ ํ…Œ์ด๋ธ”๋ช…์„ ํฌํ•จํ•˜๋Š” ๋“ฑ ๋…ผ๋ฆฌ์ ์œผ๋กœ ํ‹€๋ฆฐ ์ฟผ๋ฆฌ๋ฅผ ์ƒ์„ฑํ–ˆ๋‹ค.

  • ํ‰๊ฐ€ ๋ชจ๋ธ(GPT-4.1-nano)์€ ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ์ด ์ž˜๋ชป ์ƒ์„ฑํ•œ ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด "resolve_yn": "yes"๋ผ๊ณ  ์ž˜๋ชป ํŒ๋‹จํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜๋‹ค.

  • ์˜ˆ๋ฅผ ๋“ค์–ด, gen_sql์ด SQL ํ˜•์‹์„ ์ „ํ˜€ ๋”ฐ๋ฅด์ง€ ์•Š๋”๋ผ๋„ resolve_yn = yes๋กœ ์ž˜๋ชป ํ‰๊ฐ€๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์—ˆ๋‹ค.

  • ์ปฌ๋Ÿผ๋ช… ๋ฐ ํ…Œ์ด๋ธ”๋ช…์ด ์กด์žฌํ•˜์ง€ ์•Š๊ฑฐ๋‚˜ ์ž˜๋ชป๋œ ์ฟผ๋ฆฌ์ž„์—๋„ resolve_yn = yes๋กœ ์ž˜๋ชป ๋ถ„๋ฅ˜๋œ ๊ฒฝ์šฐ๊ฐ€ ์กด์žฌํ–ˆ๋‹ค.

  • ํ‰๊ฐ€์ž(GPT ๋ชจ๋ธ)๋Š” ๋ฌธ๋ฒ•์  ํƒ€๋‹น์„ฑ์ด๋‚˜ ํ…Œ์ด๋ธ” ๊ตฌ์กฐ ๋ฐ˜์˜ ์—ฌ๋ถ€๋ฅผ ์ œ๋Œ€๋กœ ํŒ๋‹จํ•˜์ง€ ๋ชปํ•˜๊ณ , ๋‹จ์ˆœ ํ…์ŠคํŠธ ์œ ์‚ฌ์„ฑ์— ๊ธฐ๋ฐ˜ํ•ด ํŒ๋ณ„ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์˜€๋‹ค.


์‚ฌ์šฉ ์˜ˆ์‹œ

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model = AutoModelForCausalLM.from_pretrained("your-username/polyglot-ko-1b-txt2sql", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("your-username/polyglot-ko-1b-txt2sql")

generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = """
๋‹น์‹ ์€ SQL ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค.

### DDL:
CREATE TABLE players (
  player_id INT PRIMARY KEY AUTO_INCREMENT,
  username VARCHAR(255) UNIQUE NOT NULL,
  email VARCHAR(255) UNIQUE NOT NULL,
  password_hash VARCHAR(255) NOT NULL,
  date_joined DATETIME NOT NULL,
  last_login DATETIME
);

### Question:
์‚ฌ์šฉ์ž ์ด๋ฆ„์— 'admin'์ด ํฌํ•จ๋œ ๊ณ„์ • ์ˆ˜๋Š”?

### SQL:
"""

outputs = generator(prompt, do_sample=False, max_new_tokens=128)
print(outputs[0]["generated_text"])
Downloads last month
16
Safetensors
Model size
1.33B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for castellina/polyglot-ko-txt2sql

Finetuned
(18)
this model

Dataset used to train castellina/polyglot-ko-txt2sql