Llama-3.1-8B-Spider-SQL-Ko

ν•œκ΅­μ–΄ μ§ˆλ¬Έμ„ SQL 쿼리둜 λ³€ν™˜ν•˜λŠ” Text-to-SQL λͺ¨λΈμž…λ‹ˆλ‹€. Spider train 데이터셋을 ν•œκ΅­μ–΄λ‘œ λ²ˆμ—­ν•œ spider-ko 데이터셋을 ν™œμš©ν•˜μ—¬ λ―Έμ„Έμ‘°μ •ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

πŸ“Š μ£Όμš” μ„±λŠ₯

Spider ν•œκ΅­μ–΄ 검증 데이터셋(1,034개) 평가 κ²°κ³Ό:

  • μ •ν™• 일치율: 42.65% (441/1034)
  • μ‹€ν–‰ 정확도: 65.47% (677/1034)

πŸ’‘ μ‹€ν–‰ 정확도가 μ •ν™• μΌμΉ˜μœ¨λ³΄λ‹€ 높은 μ΄μœ λŠ”, SQL 문법이 λ‹€λ₯΄λ”라도 λ™μΌν•œ κ²°κ³Όλ₯Ό λ°˜ν™˜ν•˜λŠ” κ²½μš°κ°€ 많기 λ•Œλ¬Έμž…λ‹ˆλ‹€.

πŸš€ λ°”λ‘œ μ‹œμž‘ν•˜κΈ°

from unsloth import FastLanguageModel

# λͺ¨λΈ 뢈러였기
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# ν•œκ΅­μ–΄ 질문 β†’ SQL λ³€ν™˜
question = "κ°€μˆ˜λŠ” λͺ‡ λͺ…이 μžˆλ‚˜μš”?"
schema = """ν…Œμ΄λΈ”: singer
컬럼: singer_id, name, country, age"""

prompt = f"""λ°μ΄ν„°λ² μ΄μŠ€ μŠ€ν‚€λ§ˆ:
{schema}

질문: {question}
SQL:"""

# κ²°κ³Ό: SELECT count(*) FROM singer

πŸ“ λͺ¨λΈ μ†Œκ°œ

  • 기반 λͺ¨λΈ: Llama 3.1 8B Instruct (4bit μ–‘μžν™”)
  • ν•™μŠ΅ 데이터: spider-ko (1-epoch)
  • 지원 DB: 166개의 λ‹€μ–‘ν•œ 도메인 λ°μ΄ν„°λ² μ΄μŠ€ ( spider dataset )
  • ν•™μŠ΅ 방법: LoRA (r=16, alpha=32)

πŸ’¬ ν™œμš© μ˜ˆμ‹œ

κΈ°λ³Έ μ‚¬μš©λ²•

def generate_sql(question, schema_info):
    """ν•œκ΅­μ–΄ μ§ˆλ¬Έμ„ SQL둜 λ³€ν™˜"""
    prompt = f"""λ‹€μŒ λ°μ΄ν„°λ² μ΄μŠ€ μŠ€ν‚€λ§ˆλ₯Ό μ°Έκ³ ν•˜μ—¬ μ§ˆλ¬Έμ— λŒ€ν•œ SQL 쿼리λ₯Ό μƒμ„±ν•˜μ„Έμš”.

### λ°μ΄ν„°λ² μ΄μŠ€ μŠ€ν‚€λ§ˆ:
{schema_info}

### 질문: {question}

### SQL 쿼리:"""
    
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
    
    outputs = model.generate(inputs, max_new_tokens=150, temperature=0.1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return response.split("### SQL 쿼리:")[-1].strip()

μ‹€μ œ μ‚¬μš© μ˜ˆμ‹œ

# μ˜ˆμ‹œ 1: 집계 ν•¨μˆ˜
question = "λΆ€μ„œμž₯λ“€ 쀑 56세보닀 λ‚˜μ΄κ°€ λ§Žμ€ μ‚¬λžŒμ΄ λͺ‡ λͺ…μž…λ‹ˆκΉŒ?"
# κ²°κ³Ό: SELECT count(*) FROM head WHERE age > 56

# μ˜ˆμ‹œ 2: 쑰인
question = "κ°€μž₯ λ§Žμ€ λŒ€νšŒλ₯Ό κ°œμ΅œν•œ λ„μ‹œμ˜ μƒνƒœλŠ” λ¬΄μ—‡μΈκ°€μš”?"
# κ²°κ³Ό: SELECT T1.Status FROM city AS T1 JOIN farm_competition AS T2 ON T1.City_ID = T2.Host_city_ID GROUP BY T2.Host_city_ID ORDER BY COUNT(*) DESC LIMIT 1

# μ˜ˆμ‹œ 3: μ„œλΈŒμΏΌλ¦¬
question = "κΈ°μ—…κ°€κ°€ μ•„λ‹Œ μ‚¬λžŒλ“€μ˜ 이름은 λ¬΄μ—‡μž…λ‹ˆκΉŒ?"
# κ²°κ³Ό: SELECT Name FROM people WHERE People_ID NOT IN (SELECT People_ID FROM entrepreneur)

⚠️ μ‚¬μš© μ‹œ μ£Όμ˜μ‚¬ν•­

μ œν•œμ‚¬ν•­

  • βœ… μ˜μ–΄ ν…Œμ΄λΈ”/컬럼λͺ… μ‚¬μš© (ν•œκ΅­μ–΄ 질문 β†’ μ˜μ–΄ SQL)
  • βœ… Spider 데이터셋 도메인에 μ΅œμ ν™”
  • ❌ NoSQL, κ·Έλž˜ν”„ DB 미지원
  • ❌ 맀우 λ³΅μž‘ν•œ 쀑첩 μΏΌλ¦¬λŠ” 정확도 ν•˜λ½

πŸ”§ 기술 사양

ν•™μŠ΅ ν™˜κ²½

  • GPU: NVIDIA Tesla T4 (16GB)
  • ν•™μŠ΅ μ‹œκ°„: μ•½ 4μ‹œκ°„
  • λ©”λͺ¨λ¦¬ μ‚¬μš©: μ΅œλŒ€ 7.6GB VRAM

ν•˜μ΄νΌνŒŒλΌλ―Έν„°

training_args = {
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "learning_rate": 5e-4,
    "num_train_epochs": 1,
    "optimizer": "adamw_8bit",
    "lr_scheduler_type": "cosine",
    "warmup_ratio": 0.05
}

lora_config = {
    "r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0,
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", 
                      "gate_proj", "up_proj", "down_proj"]
}

πŸ“š μ°Έκ³  자료

인용

@misc{llama31_spider_sql_ko_2025,
  title={Llama-3.1-8B-Spider-SQL-Ko: Korean Text-to-SQL Model},
  author={[Sohyun Sim, Youngjun Cho, Seongwoo Choi]},
  year={2025},
  publisher={Hugging Face KREW},
  url={https://huggingface.co/huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko}
}

κ΄€λ ¨ λ…Όλ¬Έ

🀝 κΈ°μ—¬μž

@sim-so, @choincnp, @nuatmochoi

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko

Datasets used to train huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko

Space using huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko 1

Evaluation results