julioc-p/mistral_txt_sparql_en_v2

This model is a fine-tuned version of mistralai/Mistral-7B-v0.1 for generating SPARQL queries from English natural language questions, specifically targeting the Wikidata knowledge graph.

Model Details

Model Description

It was fine-tuned using QLoRA with 4-bit quantization. It takes an English natural language question and corresponding entity/relationship context as input and aims to produce a SPARQL query for Wikidata. This model is part of experiments investigating continual multilingual pre-training.

Developed by: Julio Cesar Perez Duran
Funded by : DFKI
Model type: Decoder-only Transformer-based language model
Language: en (English)
License: mit
Finetuned from model: mistralai/Mistral-7B-v0.1

Bias, Risks, and Limitations

Context Reliant: Performance relies on the provided entity/relationship context mappings.
Output Format: generates extraneous text after the SPARQL query, requiring post-processing (extraction of content within ```sparql ... ``` delimiters).

How to Get Started with the Model

The following Python script provides an example of how to load the model and tokenizer to generate a SPARQL query.

import torch
from transformers import AutoTokenizer, BitsAndBytesConfig
from peft import AutoPeftModelForCausalLM
import re
import json

# Model ID for the Mistral English v2 model
model_id = "julioc-p/mistral_txt_sparql_en_v2"

# Configuration for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoPeftModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.pad_token_id

# SPARQL extraction function
def extract_sparql(text):
    code_block_match = re.search(r"```(?:sparql)?\s*(.*?)\s*```", text, re.DOTALL | re.IGNORECASE)
    if code_block_match:
        text_to_search = code_block_match.group(1)
    else:
        # v2 models wrap output in ```sparql ... ``` so this is the main path
        text_to_search = text

    match = re.search(r"(SELECT|ASK|CONSTRUCT|DESCRIBE).*?\}", text_to_search, re.DOTALL | re.IGNORECASE)
    if match:
        return match.group(0).strip()
    return ""

question = "Who was Barnard College's American female employee?"
example_context_json_str = '''
{
  "entities": {
    "Barnard College": "Q167733",
    "American": "Q30",
    "female": "Q6581072",
    "employee": "Q5"
  },
  "relationships": {
    "instance of": "P31",
    "employer": "P108",
    "gender": "P21",
    "country of citizenship": "P27"
  }
}
'''


system_message_template = """You are an expert text to SparQL query translator. Users will ask you questions in English and you will generate a SparQL query based on the provided context encloses in ```sparql <respose_query>```.
CONTEXT:
{context}"""

# Format the system message with the actual context
formatted_system_message = system_message_template.format(context=example_context_json_str)

chat_template = [
    {"role": "system", "content": formatted_system_message},
    {"role": "user", "content": question},
]

inputs = tokenizer.apply_chat_template(
    chat_template,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# Generate the output
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id
    )

# Decode only the generated part
generated_text_full = tokenizer.decode(outputs[0], skip_special_tokens=True)
assistant_response_part = generated_text_full.split("<|im_start|>assistant")[-1].split("<|im_end|>")[0].strip()

cleaned_sparql = extract_sparql(assistant_response_part)

print(f"Question: {question}")
print(f"Context: {example_context_json_str}")
print(f"Generated SPARQL: {cleaned_sparql}")
print(f"Assistant's Raw Response: {assistant_response_part}")

Training Data

The model was fine-tuned on a subset of the julioc-p/Question-Sparql dataset. 80,000 English examples for training, which included a context field containing Wikidata entity and relationship ID mappings.

Training Hyperparameters

The following hyperparameters were used for fine-tuning:

LoRA Configuration (v2 models):
- r (LoRA rank): 256
- lora_alpha: 128
- lora_dropout: 0.05
- target_modules: "all-linear"
Training Arguments (v2 models):
- num_train_epochs: 3
- Effective batch size: 6
- optim: "adamw_torch_fused"
- learning_rate: 2e-4
- fp16: True
- max_grad_norm: 0.3
- warmup_ratio: 0.03
- lr_scheduler_type: "constant"
- packing: True
- NEFTune noise_alpha: 5
BitsAndBytesConfig (v2 models):
- load_in_4bit: True
- bnb_4bit_quant_type: "nf4"
- bnb_4bit_compute_dtype: torch.float16
- bnb_4bit_use_double_quant: True

Speeds, Sizes, Times

Took approx. 19-20 hours on a single NVIDIA V100 GPU.

Evaluation

Testing Data, Factors & Metrics

Testing Data

QALD-10 test set (English): Standardized benchmark. 394 English questions were evaluated for this model.
v2 Test Set (English): 10,000 English held-out examples from the julioc-p/Question-Sparql dataset, including context.

Metrics

QALD standard macro-averaged F1-score, Precision, and Recall. Non-executable queries result in P, R, F1 = 0.

Results

On QALD-10 (English, N=394):

Macro F1-Score: 0.2846
Macro Precision: 0.6612
Macro Recall: 0.2844
Executable Queries: 99.75% (393/394)
Correctness (Exact Match + Both Empty): 27.41% (108/394)
- Correct (Exact Match): 25.89% (102/394)
- Correct (Both Empty): 1.52% (6/394)

On v2 Test Set (English, N=10000):

Macro F1-Score: 0.8285
Macro Precision: 0.9104
Macro Recall: 0.8292
Executable Queries: 99.63% (9963/10000)
Correctness (Exact Match + Both Empty): 82.73% (8273/10000)
- Correct (Exact Match): 74.55% (7455/10000)
- Correct (Both Empty): 8.18% (818/10000)

Environmental Impact

Hardware Type: 1 x NVIDIA V100 32GB GPU
Hours used: Approx. 19-20 hours for fine-tuning.
Cloud Provider: DFKI HPC Cluster
Compute Region: Germany
Carbon Emitted: Approx. 2.96 kg CO2eq.

Technical Specifications

Compute Infrastructure

Hardware

NVIDIA V100 GPU (32 GB RAM)
Approx. 60 GB system RAM

Software

Slurm, NVIDIA Enroot, CUDA 11.8.0
Python, Hugging Face transformers, peft (0.13.2), bitsandbytes, trl, PyTorch.

More Information

Thesis GitHub: https://github.com/julioc-p/cross-lingual-transferability-thesis
Dataset: https://huggingface.co/datasets/julioc-p/Question-Sparql

Framework versions

PEFT 0.13.2
Transformers (4.39.3)
BitsAndBytes (0.43.0)
trl (0.8.6)
PyTorch (torch==2.1.0)

julioc-p
/

mistral_txt_sparql_en_v2