CypressTree-1.0-7B

CypressTree-1.0-7B는 한국어 자연어 질의를 SQL 쿼리로 변환하는 데 특화된 7B 파라미터 언어 모델입니다. 복잡한 데이터베이스 스키마와 한국어 질문의 의도를 깊이 이해하여, 정확하고 효율적인 SQL 쿼리를 생성하도록 설계되었습니다.

Alibaba의 Qwen2 아키텍처를 기반으로, daje/kotext-to-sql-v1 데이터셋을 활용하여 한국어 SQL 생성 능력을 고도화했습니다.

주요 특징:

고급 SQL 구문 생성: 다중 테이블 JOIN, 서브쿼리, 복합 WHERE 조건 등 높은 복잡도의 SQL 구문을 정확하게 생성합니다.
한국어 의미 구조 분석: 한국어의 문맥적 뉘앙스를 파악하여 데이터베이스 스키마와 정확하게 매핑합니다.
높은 정확도의 SQL 변환: 방대한 Text-to-SQL 데이터 학습을 통해 높은 텍스트 유사도 및 토큰 일치도를 달성했습니다.

모델 정보 (Model Information)

Model Developer: namuai-x
Model: CypressTree-1.0-7B
Model Type: Korean Text-to-SQL Language Model
Language(s): Korean (primary), English (secondary)

사용 방법 (How to Use)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. 모델과 토크나이저 로드
model_id = "namuai-x/CypressTree-1.0-7B"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 2. 프롬프트 구성 (복잡한 Multi-JOIN 예시)
question = "보스턴에서 샌프란시스코로 비행하는 항공사는 어떤 것들이 있나요?"
schema = """
CREATE TABLE airline (airline_code VARCHAR, airline_name TEXT);
CREATE TABLE airport_service (city_code VARCHAR, airport_code VARCHAR);
CREATE TABLE city (city_code VARCHAR, city_name VARCHAR);
CREATE TABLE flight (airline_code VARCHAR, from_airport VARCHAR, to_airport VARCHAR)
"""

prompt = f"""### Instruction:
{question}

### Input:
{schema}

### Response:
"""

# 3. 추론 실행
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False, num_beams=1)
generated_sql = tokenizer.decode(outputs[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip()

print(generated_sql)
# 모델 출력 (정답):
# SELECT DISTINCT airline.airline_code FROM airline, airport_service AS AIRPORT_SERVICE_0, airport_service AS AIRPORT_SERVICE_1, city AS CITY_0, city AS CITY_1, flight WHERE CITY_0.city_code = AIRPORT_SERVICE_0.city_code AND CITY_0.city_name = 'BOSTON' AND CITY_1.city_code = AIRPORT_SERVICE_1.city_code AND CITY_1.city_name = 'SAN FRANCISCO' AND flight.airline_code = airline.airline_code AND flight.from_airport = AIRPORT_SERVICE_0.airport_code AND flight.to_airport = AIRPORT_SERVICE_1.airport_code

성능 평가 (Performance)

30,000개의 홀드아웃(hold-out) 데이터셋에 대한 비실행(non-executing) 평가 결과입니다.

Metric	Qwen/Qwen2.5-Coder-7B-Instruct	CypressTree-1.0-7B	Improvement (Δ)
BLEU	5.8701	86.2469	`+80.38`
chrF	11.2665	90.1089	`+78.84`
ROUGE-L	0.1634	0.9271	`+0.76`
Token-F1	0.1087	0.9237	`+0.82`
Syntactic-Validity	0.8745	0.9977	`+0.12`

학습 데이터

Dataset: daje/kotext-to-sql-v1 (200,000 samples)
Epochs: 3
Max Sequence Length: 2048

의도된 사용 및 제한사항

Primary Use Cases

자연어 기반 데이터 검색 및 조회 시스템
챗봇을 통한 실시간 데이터 분석
데이터 시각화 대시보드의 자동 쿼리 생성

Out of Scope

실시간 정보 검색 또는 데이터베이스 외부 지식을 요구하는 질문
의료, 법률, 금융 등 전문적인 조언 생성
영어를 포함한 다른 언어에서의 최적 성능 (한국어 특화)

인용 및 라이센스

Citation

@misc{namuai-x-2025-cypresstree,
  title={CypressTree-1.0-7B: A Specialized Language Model for Korean Text-to-SQL},
  author={namuai-x},
  year={2025},
  url={[https://huggingface.co/namuai-x/CypressTree-1.0-7B](https://huggingface.co/namuai-x/CypressTree-1.0-7B)}
}

License

Apache 2.0 license.

CypressTree-1.0-7B

CypressTree-1.0-7B is a 7B parameter language model specialized for converting natural language queries in Korean to SQL. It is designed to deeply understand complex database schemas and the intent of Korean questions to generate accurate and efficient SQL queries.

Based on Alibaba's Qwen2 architecture, its Korean SQL generation capabilities were advanced using the daje/kotext-to-sql-v1 dataset.

Key Features:

Advanced SQL Generation: Accurately generates highly complex SQL queries, including multi-table JOINs, subqueries, and compound WHERE conditions.
Korean Semantic Analysis: Grasps the contextual nuances of the Korean language to map them correctly to the database schema.
High-Accuracy SQL Conversion: Achieves high text similarity and token-F1 scores through extensive training on a large Text-to-SQL dataset.

Model Information

Model Developer: namuai-x
Model: CypressTree-1.0-7B
Model Type: Korean Text-to-SQL Language Model
Language(s): Korean (primary), English (secondary)

How to Use

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. Load the model and tokenizer
model_id = "namuai-x/CypressTree-1.0-7B"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 2. Construct the prompt (Example with complex Multi-JOIN)
question = "보스턴에서 샌프란시스코로 비행하는 항공사는 어떤 것들이 있나요?" # "Which airlines fly from Boston to San Francisco?"
schema = """
CREATE TABLE airline (airline_code VARCHAR, airline_name TEXT);
CREATE TABLE airport_service (city_code VARCHAR, airport_code VARCHAR);
CREATE TABLE city (city_code VARCHAR, city_name VARCHAR);
CREATE TABLE flight (airline_code VARCHAR, from_airport VARCHAR, to_airport VARCHAR)
"""

prompt = f"""### Instruction:
{question}

### Input:
{schema}

### Response:
"""

# 3. Run inference
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False, num_beams=1)
generated_sql = tokenizer.decode(outputs[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip()

print(generated_sql)
# Model Output (Correct):
# SELECT DISTINCT airline.airline_code FROM airline, airport_service AS AIRPORT_SERVICE_0, airport_service AS AIRPORT_SERVICE_1, city AS CITY_0, city AS CITY_1, flight WHERE CITY_0.city_code = AIRPORT_SERVICE_0.city_code AND CITY_0.city_name = 'BOSTON' AND CITY_1.city_code = AIRPORT_SERVICE_1.city_code AND CITY_1.city_name = 'SAN FRANCISCO' AND flight.airline_code = airline.airline_code AND flight.from_airport = AIRPORT_SERVICE_0.airport_code AND flight.to_airport = AIRPORT_SERVICE_1.airport_code

Performance

Non-executing evaluation results on a 30,000-sample hold-out set.

Metric	Qwen/Qwen2.5-Coder-7B-Instruct	CypressTree-1.0-7B	Improvement (Δ)
BLEU	5.8701	86.2469	`+80.38`
chrF	11.2665	90.1089	`+78.84`
ROUGE-L	0.1634	0.9271	`+0.76`
Token-F1	0.1087	0.9237	`+0.82`
Syntactic-Validity	0.8745	0.9977	`+0.12`

Training Data

Dataset: daje/kotext-to-sql-v1 (200,000 samples)
Epochs: 3
Max Sequence Length: 2048

Intended Use & Limitations

Primary Use Cases

Natural language-based data search and retrieval systems.
Real-time data analysis via chatbots.
Automatic query generation for data visualization dashboards.

Out of Scope

Questions requiring real-time information or knowledge outside the database schema.
Generation of professional medical, legal, or financial advice.
Optimal performance in languages other than Korean.

Hardware Requirements

Inference

Minimum: 24GB VRAM
Recommended: 32GB VRAM (FP16)

Citation & License

Citation

@misc{namuai-x-2025-cypresstree,
  title={CypressTree-1.0-7B: A Specialized Language Model for Korean Text-to-SQL},
  author={namuai-x},
  year={2025},
  url={[https://huggingface.co/namuai-x/CypressTree-1.0-7B](https://huggingface.co/namuai-x/CypressTree-1.0-7B)}
}

License

Apache 2.0 license.

Downloads last month: 2

Safetensors

Model size

8B params

Tensor type

F16

Model tree for namuai-x/CypressTree-1.0-7B

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Coder-7B

Finetuned

Qwen/Qwen2.5-Coder-7B-Instruct

Finetuned

(243)

this model

Quantizations

1 model

namuai-x
/

CypressTree-1.0-7B

CypressTree-1.0-7B

모델 정보 (Model Information)

사용 방법 (How to Use)

성능 평가 (Performance)

학습 데이터

의도된 사용 및 제한사항

Primary Use Cases

Out of Scope

인용 및 라이센스

Citation

License

CypressTree-1.0-7B

Model Information

How to Use

Performance

Training Data

Intended Use & Limitations

Primary Use Cases

Out of Scope

Hardware Requirements

Inference

Citation & License

Citation

License

Model tree for namuai-x/CypressTree-1.0-7B

Dataset used to train namuai-x/CypressTree-1.0-7B