CypressTree-1.0-7B

CypressTree-1.0-7B๋Š” ํ•œ๊ตญ์–ด ์ž์—ฐ์–ด ์งˆ์˜๋ฅผ SQL ์ฟผ๋ฆฌ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐ ํŠนํ™”๋œ 7B ํŒŒ๋ผ๋ฏธํ„ฐ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์Šคํ‚ค๋งˆ์™€ ํ•œ๊ตญ์–ด ์งˆ๋ฌธ์˜ ์˜๋„๋ฅผ ๊นŠ์ด ์ดํ•ดํ•˜์—ฌ, ์ •ํ™•ํ•˜๊ณ  ํšจ์œจ์ ์ธ SQL ์ฟผ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Alibaba์˜ Qwen2 ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, daje/kotext-to-sql-v1 ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•˜์—ฌ ํ•œ๊ตญ์–ด SQL ์ƒ์„ฑ ๋Šฅ๋ ฅ์„ ๊ณ ๋„ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์ง•:

  • ๊ณ ๊ธ‰ SQL ๊ตฌ๋ฌธ ์ƒ์„ฑ: ๋‹ค์ค‘ ํ…Œ์ด๋ธ” JOIN, ์„œ๋ธŒ์ฟผ๋ฆฌ, ๋ณตํ•ฉ WHERE ์กฐ๊ฑด ๋“ฑ ๋†’์€ ๋ณต์žก๋„์˜ SQL ๊ตฌ๋ฌธ์„ ์ •ํ™•ํ•˜๊ฒŒ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ํ•œ๊ตญ์–ด ์˜๋ฏธ ๊ตฌ์กฐ ๋ถ„์„: ํ•œ๊ตญ์–ด์˜ ๋ฌธ๋งฅ์  ๋‰˜์•™์Šค๋ฅผ ํŒŒ์•…ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์Šคํ‚ค๋งˆ์™€ ์ •ํ™•ํ•˜๊ฒŒ ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค.
  • ๋†’์€ ์ •ํ™•๋„์˜ SQL ๋ณ€ํ™˜: ๋ฐฉ๋Œ€ํ•œ Text-to-SQL ๋ฐ์ดํ„ฐ ํ•™์Šต์„ ํ†ตํ•ด ๋†’์€ ํ…์ŠคํŠธ ์œ ์‚ฌ๋„ ๋ฐ ํ† ํฐ ์ผ์น˜๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ์ •๋ณด (Model Information)

  • Model Developer: namuai-x
  • Model: CypressTree-1.0-7B
  • Model Type: Korean Text-to-SQL Language Model
  • Language(s): Korean (primary), English (secondary)

์‚ฌ์šฉ ๋ฐฉ๋ฒ• (How to Use)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
model_id = "namuai-x/CypressTree-1.0-7B"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 2. ํ”„๋กฌํ”„ํŠธ ๊ตฌ์„ฑ (๋ณต์žกํ•œ Multi-JOIN ์˜ˆ์‹œ)
question = "๋ณด์Šคํ„ด์—์„œ ์ƒŒํ”„๋ž€์‹œ์Šค์ฝ”๋กœ ๋น„ํ–‰ํ•˜๋Š” ํ•ญ๊ณต์‚ฌ๋Š” ์–ด๋–ค ๊ฒƒ๋“ค์ด ์žˆ๋‚˜์š”?"
schema = """
CREATE TABLE airline (airline_code VARCHAR, airline_name TEXT);
CREATE TABLE airport_service (city_code VARCHAR, airport_code VARCHAR);
CREATE TABLE city (city_code VARCHAR, city_name VARCHAR);
CREATE TABLE flight (airline_code VARCHAR, from_airport VARCHAR, to_airport VARCHAR)
"""

prompt = f"""### Instruction:
{question}

### Input:
{schema}

### Response:
"""

# 3. ์ถ”๋ก  ์‹คํ–‰
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False, num_beams=1)
generated_sql = tokenizer.decode(outputs[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip()

print(generated_sql)
# ๋ชจ๋ธ ์ถœ๋ ฅ (์ •๋‹ต):
# SELECT DISTINCT airline.airline_code FROM airline, airport_service AS AIRPORT_SERVICE_0, airport_service AS AIRPORT_SERVICE_1, city AS CITY_0, city AS CITY_1, flight WHERE CITY_0.city_code = AIRPORT_SERVICE_0.city_code AND CITY_0.city_name = 'BOSTON' AND CITY_1.city_code = AIRPORT_SERVICE_1.city_code AND CITY_1.city_name = 'SAN FRANCISCO' AND flight.airline_code = airline.airline_code AND flight.from_airport = AIRPORT_SERVICE_0.airport_code AND flight.to_airport = AIRPORT_SERVICE_1.airport_code

์„ฑ๋Šฅ ํ‰๊ฐ€ (Performance)

30,000๊ฐœ์˜ ํ™€๋“œ์•„์›ƒ(hold-out) ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ๋น„์‹คํ–‰(non-executing) ํ‰๊ฐ€ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

Metric Qwen/Qwen2.5-Coder-7B-Instruct CypressTree-1.0-7B Improvement (ฮ”)
BLEU 5.8701 86.2469 +80.38
chrF 11.2665 90.1089 +78.84
ROUGE-L 0.1634 0.9271 +0.76
Token-F1 0.1087 0.9237 +0.82
Syntactic-Validity 0.8745 0.9977 +0.12

ํ•™์Šต ๋ฐ์ดํ„ฐ

  • Dataset: daje/kotext-to-sql-v1 (200,000 samples)
  • Epochs: 3
  • Max Sequence Length: 2048

์˜๋„๋œ ์‚ฌ์šฉ ๋ฐ ์ œํ•œ์‚ฌํ•ญ

Primary Use Cases

  • ์ž์—ฐ์–ด ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ๊ฒ€์ƒ‰ ๋ฐ ์กฐํšŒ ์‹œ์Šคํ…œ
  • ์ฑ—๋ด‡์„ ํ†ตํ•œ ์‹ค์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ ๋ถ„์„
  • ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™” ๋Œ€์‹œ๋ณด๋“œ์˜ ์ž๋™ ์ฟผ๋ฆฌ ์ƒ์„ฑ

Out of Scope

  • ์‹ค์‹œ๊ฐ„ ์ •๋ณด ๊ฒ€์ƒ‰ ๋˜๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์™ธ๋ถ€ ์ง€์‹์„ ์š”๊ตฌํ•˜๋Š” ์งˆ๋ฌธ
  • ์˜๋ฃŒ, ๋ฒ•๋ฅ , ๊ธˆ์œต ๋“ฑ ์ „๋ฌธ์ ์ธ ์กฐ์–ธ ์ƒ์„ฑ
  • ์˜์–ด๋ฅผ ํฌํ•จํ•œ ๋‹ค๋ฅธ ์–ธ์–ด์—์„œ์˜ ์ตœ์  ์„ฑ๋Šฅ (ํ•œ๊ตญ์–ด ํŠนํ™”)

์ธ์šฉ ๋ฐ ๋ผ์ด์„ผ์Šค

Citation

@misc{namuai-x-2025-cypresstree,
  title={CypressTree-1.0-7B: A Specialized Language Model for Korean Text-to-SQL},
  author={namuai-x},
  year={2025},
  url={[https://huggingface.co/namuai-x/CypressTree-1.0-7B](https://huggingface.co/namuai-x/CypressTree-1.0-7B)}
}

License

Apache 2.0 license.


CypressTree-1.0-7B

CypressTree-1.0-7B is a 7B parameter language model specialized for converting natural language queries in Korean to SQL. It is designed to deeply understand complex database schemas and the intent of Korean questions to generate accurate and efficient SQL queries.

Based on Alibaba's Qwen2 architecture, its Korean SQL generation capabilities were advanced using the daje/kotext-to-sql-v1 dataset.

Key Features:

  • Advanced SQL Generation: Accurately generates highly complex SQL queries, including multi-table JOINs, subqueries, and compound WHERE conditions.
  • Korean Semantic Analysis: Grasps the contextual nuances of the Korean language to map them correctly to the database schema.
  • High-Accuracy SQL Conversion: Achieves high text similarity and token-F1 scores through extensive training on a large Text-to-SQL dataset.

Model Information

  • Model Developer: namuai-x
  • Model: CypressTree-1.0-7B
  • Model Type: Korean Text-to-SQL Language Model
  • Language(s): Korean (primary), English (secondary)

How to Use

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. Load the model and tokenizer
model_id = "namuai-x/CypressTree-1.0-7B"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 2. Construct the prompt (Example with complex Multi-JOIN)
question = "๋ณด์Šคํ„ด์—์„œ ์ƒŒํ”„๋ž€์‹œ์Šค์ฝ”๋กœ ๋น„ํ–‰ํ•˜๋Š” ํ•ญ๊ณต์‚ฌ๋Š” ์–ด๋–ค ๊ฒƒ๋“ค์ด ์žˆ๋‚˜์š”?" # "Which airlines fly from Boston to San Francisco?"
schema = """
CREATE TABLE airline (airline_code VARCHAR, airline_name TEXT);
CREATE TABLE airport_service (city_code VARCHAR, airport_code VARCHAR);
CREATE TABLE city (city_code VARCHAR, city_name VARCHAR);
CREATE TABLE flight (airline_code VARCHAR, from_airport VARCHAR, to_airport VARCHAR)
"""

prompt = f"""### Instruction:
{question}

### Input:
{schema}

### Response:
"""

# 3. Run inference
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False, num_beams=1)
generated_sql = tokenizer.decode(outputs[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip()

print(generated_sql)
# Model Output (Correct):
# SELECT DISTINCT airline.airline_code FROM airline, airport_service AS AIRPORT_SERVICE_0, airport_service AS AIRPORT_SERVICE_1, city AS CITY_0, city AS CITY_1, flight WHERE CITY_0.city_code = AIRPORT_SERVICE_0.city_code AND CITY_0.city_name = 'BOSTON' AND CITY_1.city_code = AIRPORT_SERVICE_1.city_code AND CITY_1.city_name = 'SAN FRANCISCO' AND flight.airline_code = airline.airline_code AND flight.from_airport = AIRPORT_SERVICE_0.airport_code AND flight.to_airport = AIRPORT_SERVICE_1.airport_code

Performance

Non-executing evaluation results on a 30,000-sample hold-out set.

Metric Qwen/Qwen2.5-Coder-7B-Instruct CypressTree-1.0-7B Improvement (ฮ”)
BLEU 5.8701 86.2469 +80.38
chrF 11.2665 90.1089 +78.84
ROUGE-L 0.1634 0.9271 +0.76
Token-F1 0.1087 0.9237 +0.82
Syntactic-Validity 0.8745 0.9977 +0.12

Training Data

  • Dataset: daje/kotext-to-sql-v1 (200,000 samples)
  • Epochs: 3
  • Max Sequence Length: 2048

Intended Use & Limitations

Primary Use Cases

  • Natural language-based data search and retrieval systems.
  • Real-time data analysis via chatbots.
  • Automatic query generation for data visualization dashboards.

Out of Scope

  • Questions requiring real-time information or knowledge outside the database schema.
  • Generation of professional medical, legal, or financial advice.
  • Optimal performance in languages other than Korean.

Hardware Requirements

Inference

  • Minimum: 24GB VRAM
  • Recommended: 32GB VRAM (FP16)

Citation & License

Citation

@misc{namuai-x-2025-cypresstree,
  title={CypressTree-1.0-7B: A Specialized Language Model for Korean Text-to-SQL},
  author={namuai-x},
  year={2025},
  url={[https://huggingface.co/namuai-x/CypressTree-1.0-7B](https://huggingface.co/namuai-x/CypressTree-1.0-7B)}
}

License

Apache 2.0 license.

Downloads last month
2
Safetensors
Model size
8B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for namuai-x/CypressTree-1.0-7B

Base model

Qwen/Qwen2.5-7B
Finetuned
(243)
this model
Quantizations
1 model

Dataset used to train namuai-x/CypressTree-1.0-7B