joonavel's picture
Update README.md
9de3576 verified
|
raw
history blame
6.97 kB
metadata
base_model: unsloth/qwen2.5-coder-32b-instruct-bnb-4bit
library_name: peft
datasets:
  - 100suping/ko-bird-sql-schema
  - won75/text_to_sql_ko
language:
  - ko
pipeline_tag: text-generation
tags:
  - SQL
  - lora
  - adapter
  - instruction-tuning

100suping/Qwen2.5-Coder-34B-Instruct-kosql-adapter

This Repo contains LoRA (Low-Rank Adaptation) Adapter for [unsloth/qwen2.5-coder-32b-instruct-bnb-4bit]

The Adapter was trained for improving model's SQL generation capability in Korean question & multi-db context.

This adapter was created through instruction tuning.

Model Details

Model Description

  • Base Model: unsloth/Qwen2.5-Coder-32B-Instruct
  • Task: Instruction Following(Korean)
  • Language: English (or relevant language)
  • Training Data: 100suping/ko-bird-sql-schema, won75/text_to_sql_ko
  • Model type: Causal Language Models.
  • Language(s) (NLP): Multi-Language

How to Get Started with the Model

To use this LoRA adapter, refer to the following code:

Prompt

GENERAL_QUERY_PREFIX = """๋‹น์‹ ์€ ์‚ฌ์šฉ์ž์˜ ์ž…๋ ฅ์„ MySQL ์ฟผ๋ฆฌ๋ฌธ์œผ๋กœ ๋ฐ”๊พธ์–ด์ฃผ๋Š” ์กฐ์ง์˜ ํŒ€์›์ž…๋‹ˆ๋‹ค.
๋‹น์‹ ์˜ ์ž„๋ฌด๋Š” DB ์ด๋ฆ„ ๊ทธ๋ฆฌ๊ณ  DB๋‚ด ํ…Œ์ด๋ธ”์˜ ๋ฉ”ํƒ€ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด ์•„๋ž˜์˜ (context)๋ฅผ ์ด์šฉํ•ด์„œ ์ฃผ์–ด์ง„ ์งˆ๋ฌธ(user_question)์— ๊ฑธ๋งž๋Š” MySQL ์ฟผ๋ฆฌ๋ฌธ์„ ์ž‘์„ฑํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

(context)
{context}
"""

GENERATE_QUERY_INSTRUCTIONS = """
์ฃผ์–ด์ง„ ์งˆ๋ฌธ(user_question)์— ๋Œ€ํ•ด์„œ ๋ฌธ๋ฒ•์ ์œผ๋กœ ์˜ฌ๋ฐ”๋ฅธ MySQL ์ฟผ๋ฆฌ๋ฌธ์„ ์ž‘์„ฑํ•ด ์ฃผ์„ธ์š”.
"""

Preprocess Functions

def get_conversation_data(examples):
    questions = examples['question']
    schemas =examples['schema']
    sql_queries =examples['SQL']
    convos = []
    for question, schema, sql in zip(questions, schemas, sql_queries):
        conv = [
        {"role": "system", "content": GENERAL_QUERY_PREFIX.format(context=schema) + GENERATE_QUERY_INSTRUCTIONS},
        {"role": "user", "content": question},
        {"role": "assistant", "content": "```sql\n"+sql+";\n```"}
        ]
        convos.append(conv)
    return {"conversation":convos,}

def formatting_prompts_func(examples):
    convos = examples["conversation"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

Example input

<|im_start|>system
๋‹น์‹ ์€ ์‚ฌ์šฉ์ž์˜ ์ž…๋ ฅ์„ MySQL ์ฟผ๋ฆฌ๋ฌธ์œผ๋กœ ๋ฐ”๊พธ์–ด์ฃผ๋Š” ์กฐ์ง์˜ ํŒ€์›์ž…๋‹ˆ๋‹ค.
๋‹น์‹ ์˜ ์ž„๋ฌด๋Š” DB ์ด๋ฆ„ ๊ทธ๋ฆฌ๊ณ  DB๋‚ด ํ…Œ์ด๋ธ”์˜ ๋ฉ”ํƒ€ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด ์•„๋ž˜์˜ (context)๋ฅผ ์ด์šฉํ•ด์„œ ์ฃผ์–ด์ง„ ์งˆ๋ฌธ(user_question)์— ๊ฑธ๋งž๋Š” MySQL ์ฟผ๋ฆฌ๋ฌธ์„ ์ž‘์„ฑํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

(context)
DB: movie_platform
table DDL: CREATE TABLE `movies` ( `movie_id` INTEGER `movie_title` TEXT `movie_release_year` INTEGER `movie_url` TEXT `movie_title_language` TEXT `movie_popularity` INTEGER `movie_image_url` TEXT `director_id` TEXT `director_name` TEXT `director_url` TEXT PRIMARY KEY (movie_id) FOREIGN KEY (user_id) REFERENCES `lists_users`(user_id) FOREIGN KEY (user_id) REFERENCES `lists_users`(user_id) FOREIGN KEY (user_id) REFERENCES `lists`(user_id) FOREIGN KEY (list_id) REFERENCES `lists`(list_id) FOREIGN KEY (user_id) REFERENCES `ratings_users`(user_id) FOREIGN KEY (user_id) REFERENCES `lists_users`(user_id) FOREIGN KEY (movie_id) REFERENCES `movies`(movie_id) );


์ฃผ์–ด์ง„ ์งˆ๋ฌธ(user_question)์— ๋Œ€ํ•ด์„œ ๋ฌธ๋ฒ•์ ์œผ๋กœ ์˜ฌ๋ฐ”๋ฅธ MySQL ์ฟผ๋ฆฌ๋ฌธ์„ ์ž‘์„ฑํ•ด ์ฃผ์„ธ์š”.
<|im_end|>
<|im_start|>user
๊ฐ€์žฅ ์ธ๊ธฐ ์žˆ๋Š” ์˜ํ™”๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”? ๊ทธ ์˜ํ™”๋Š” ์–ธ์ œ ๊ฐœ๋ด‰๋˜์—ˆ๊ณ  ๋ˆ„๊ฐ€ ๊ฐ๋…์ธ๊ฐ€์š”?<|im_end|>
<|im_start|>assistant
```sql
SELECT movie_title, movie_release_year, director_name FROM movies ORDER BY movie_popularity DESC LIMIT 1 ;
```<|im_end|>

Inference

messages = [
        {"role": "system", "content": GENERAL_QUERY_PREFIX.format(context=context) + GENERATE_QUERY_INSTRUCTIONS},
        {"role": "user", "content": "user_question: "+ user_question}
    ]


text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=max_new_tokens
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

Training Details

Training Data

[More Information Needed]

Training Procedure


Preprocessing [optional]


Training Hyperparameters

  • Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Framework versions

  • PEFT 0.13.2