Model Card: Gemma 3 1B Text-to-SQL Finetuned

Model Description

This model is a finetuned version of the google/gemma-3-1b large language model, specifically adapted for the text-to-SQL task. It leverages Quantized Low-Rank Adaptation (QLoRA) for efficient finetuning, making it suitable for deployment and inference on systems with limited computational resources.

The primary function of this model is to translate natural language questions and provided database schemas into executable SQL queries. This capability is crucial for applications requiring natural language interaction with databases, such as business intelligence tools, data analysis platforms, and conversational AI agents.

Intended Use

This model is intended for research and development purposes related to text-to-SQL generation. It can be used to:

Generate SQL queries from natural language prompts and database schemas.
Serve as a component in larger systems that require natural language interaction with databases.
Further research into efficient finetuning techniques for large language models.

Out-of-Scope Use Cases

This model is not intended for:

Generating SQL queries for highly sensitive or mission-critical systems without thorough validation and human oversight.
Deployment in production environments without rigorous testing and adherence to security best practices.
Generating SQL for databases with unknown or complex schemas without proper adaptation and training.

Training Data

The model was finetuned on the gretelai/synthetic_text_to_sql dataset [1]. This dataset is a high-quality, synthetically generated collection of text-to-SQL samples. Key characteristics of the dataset include:

Size: 105,851 records (100,000 for training, 5,851 for testing).
Content: Each record includes a natural language prompt (sql_prompt), database schema (sql_context as CREATE TABLE statements), the corresponding SQL query (sql), and an explanation of the SQL query (sql_explanation).
Diversity: Covers 100 distinct domains/verticals and a wide range of SQL complexity levels (e.g., aggregations, joins, subqueries, window functions).

The training data was transformed into a conversational format, where the user provides the database schema and natural language query, and the assistant responds with the SQL query. An example of the input format is:

Given the following database schema:

CREATE TABLE Employees (id INT, name VARCHAR(255), salary INT);

Generate the SQL query for: Select all employees with salary greater than 50000

Training Procedure

The model was finetuned using the QLoRA technique, implemented with the unsloth library. The training was performed using the SFTTrainer from the trl library.

Base Model: google/gemma-3-1b

Finetuning Parameters (QLoRA):

LoRA Rank (r): 16
LoRA Alpha (lora_alpha): 16
LoRA Dropout (lora_dropout): 0.05
Bias: none
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Gradient Checkpointing: Enabled (unsloth optimized)

Training Arguments (SFTTrainer):

Per Device Train Batch Size: 2
Gradient Accumulation Steps: 4
Warmup Steps: 5
Max Steps: 100 (can be adjusted for full dataset training)
Learning Rate: 2e-4
Optimizer: adamw_8bit
Precision: bf16 (if supported by GPU), otherwise fp16

Model Architecture

The Gemma 3 1B model is a decoder-only transformer architecture. During QLoRA finetuning, low-rank adapters are injected into the specified layers, allowing for efficient training by only updating a small fraction of the model's parameters while keeping the majority of the pre-trained weights frozen in 4-bit quantized form.

Performance and Limitations

Due to the synthetic nature of the training data, the model's performance on real-world, noisy, or highly complex database schemas may vary. It is recommended to perform further evaluation and potentially finetune on domain-specific data for production use cases.

Limitations include:

Schema Complexity: May struggle with highly intricate database schemas or those with ambiguous column names.
Natural Language Ambiguity: Performance can be affected by ambiguous or underspecified natural language queries.
SQL Dialect: Primarily trained on standard SQL syntax. May require further adaptation for specific SQL dialects (e.g., PostgreSQL, MySQL, SQL Server).

Environmental Impact

Finetuning with QLoRA significantly reduces the computational resources and energy consumption compared to full finetuning. The specific energy consumption for this finetuning run would depend on the hardware used and the duration of training.

Citation

If you use this model or the finetuning approach, please consider citing the original Gemma model and the gretelai/synthetic_text_to_sql dataset:

@article{gemma2024,
  author = {Google},
  title = {Gemma: A Family of Lightweight, State-of-the-Art Open Models},
  year = {2024},
  url = {https://ai.google.dev/gemma}
}

@software{gretel-synthetic-text-to-sql-2024,
  author = {Meyer, Yev and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Boyd, Kendrick and Van Segbroeck, Maarten and Grossman, Matthew and Mlocek, Piotr and Newberry, Drew},
  title = {{Synthetic-Text-To-SQL}: A synthetic dataset for training language models to generate SQL queries from natural language prompts},
  month = {April},
  year = {2024},
  url = {https://huggingface.co/datasets/gretelai/synthetic-text-to-sql}
}

Yuk050
/

gemma-3-1b-text-to-sql-model