BanQL-1B

Model Description

BanQL is a family of Code LLMs dedicated solely for the text-to-SQL task in the Financial domain.

The checkpoint included in this repository is based on bigcode/starcoderbase and further finetuned on BanQies, a dataset generated using SelectCraft compose of NLQ-SQL pairs in the financial domain.

Finetuning Procedure

BanQL was fine-tuned using PEFT (Parameter-Efficient Fine-Tuning) techniques, specifically LoRA (Low-Rank Adaptation) adapters.

Intended Use and Limitations

The model was designed as a use case to prove the efficiency of SelectCraft in generating large-scale good quality domain-specific text-to-SQL datasets. The model is mainly finetuned on the database schemas displayed above. The prompt format is defined below.

How to Use

Example 1: Loans_DB

from transformers import AutoTokenizer, AutoModelForCausalLM

device="cuda"
tokenizer = AutoTokenizer.from_pretrained("salmane11/BanQL-1b")
model = AutoModelForCausalLM.from_pretrained("salmane11/BanQL-1b").to(device)

input_text = """
      CREATE TABLE Loans {
            loan_id number, 
            client_id number, 
            budget real, 
            duration number, 
            interest real, 
            status varchar
        }

        CREATE TABLE Clients {
            client_id number, 
            first_name varchar, 
            last_name varchar, 
            email varchar, 
            city varchar, 
            year_of_birth number
        }

        CREATE TABLE Accounts {
            account_id number, 
            client_id number, 
            balance real, 
            type varchar
        }

        CREATE TABLE Deposits{
            deposit_id number, 
            account_id number, 
            source varchar, 
            amount real
        }

        -- Using valid SQLite, answer the following question for the tables provided above.

        -- What is the duration and budget of the loan id 16342?

        SELECT"""

encoding = tokenizer.encode_plus(input_text, return_tensors="pt").to(device)
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
    

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=512,
    do_sample=True,
    top_k=120,
    top_p=0.95,
    early_stopping=True,
)
line = tokenizer.decode(outputs[0], skip_special_tokens=True,clean_up_tokenization_spaces=True)
query_begining = line.find("SELECT")
print(line[query_begining:])

Example 2: Transactions_DB

from transformers import AutoTokenizer, AutoModelForCausalLM

device="cuda"
tokenizer = AutoTokenizer.from_pretrained("salmane11/BanQL-1b")
model = AutoModelForCausalLM.from_pretrained("salmane11/BanQL-1b").to(device)

input_text = """
        CREATE TABLE Transactions {
            transaction_id number, 
            timestamp_id number, 
            primary_contract_id number, 
            client_id number, 
            beneficiary_id number, 
            transaction_amount real, 
            is_fraudulent boolean, 
            product_family_code varchar, 
            amount_currency varchar
        }

        CREATE TABLE Beneficiary {
            beneficiary_id number, 
            bank_branch_id number, 
            country_name varchar, 
            country_code varchar, 
        }

        CREATE TABLE Source {
            primary_contract_id number, 
            client_id number, 
            counterparty_bank_branch_id number, 
            counterparty_donor_id number, 
        }

        CREATE TABLE Time{
            timestamp_id number, 
            week_number number, 
            day_number number, 
            hour_number number, 
            day_name varchar, 
            year number,
            month_number number
        }

        -- Using valid SQLite, answer the following question for the tables provided above.

        -- How many transactions for the client id 15482?

        SELECT"""


encoding = tokenizer.encode_plus(input_text, return_tensors="pt").to(device)
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
    

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=512,
    do_sample=True,
    top_k=120,
    top_p=0.95,
    early_stopping=True,
)
line = tokenizer.decode(outputs[0], skip_special_tokens=True,clean_up_tokenization_spaces=True)
query_begining = line.find("SELECT")
print(line[query_begining:])

Cite our work

Citation

salmane11
/

BanQL-1b