This model is a fine-tuned version of occiglot/occiglot-7b-eu5 for generating SPARQL queries from English natural language questions, specifically targeting the Wikidata knowledge graph.

Model Details

Model Description

It was fine-tuned using QLoRA with 4-bit quantization. It takes an English natural language question and corresponding entity/relationship context as input and aims to produce a SPARQL query for Wikidata. This model is part of experiments investigating continual multilingual pre-training.

  • Developed by: Julio Cesar Perez Duran
  • Funded by : DFKI
  • Model type: Decoder-only Transformer-based language model
  • Language(s) (NLP): en (English)
  • License: mit
  • Finetuned from model: occiglot/occiglot-7b-eu5

Bias, Risks, and Limitations

  • Context Reliant: Performance heavily relies on the provided entity/relationship context mappings.
  • Output Format: generates extraneous text after the SPARQL query, requiring post-processing (extraction of content within ```sparql ... ``` delimiters).

How to Get Started with the Model

The following Python script provides an example of how to load the model and tokenizer to generate a SPARQL query.

import torch
from transformers import AutoTokenizer, BitsAndBytesConfig
from peft import AutoPeftModelForCausalLM
import re
import json

model_id = "julioc-p/occiglot_txt_sparql_en_v2"

# Configuration for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)


model = AutoPeftModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.pad_token_id

# SPARQL extraction function
def extract_sparql(text):
    code_block_match = re.search(r"```(?:sparql)?\s*(.*?)\s*```", text, re.DOTALL | re.IGNORECASE)
    if code_block_match:
        text_to_search = code_block_match.group(1)
    else:
        text_to_search = text

    match = re.search(r"(SELECT|ASK|CONSTRUCT|DESCRIBE).*?\}", text_to_search, re.DOTALL | re.IGNORECASE)
    if match:
        return match.group(0).strip()
    return ""

# --- Example usage ---
question = "Who was Barnard College's American female employee?"
example_context_json_str = '''
{
  "entities": {
    "Barnard College": "Q167733",
    "American": "Q30",
    "female": "Q6581072",
    "employee": "Q5"
  },
  "relationships": {
    "instance of": "P31",
    "employer": "P108",
    "gender": "P21",
    "country of citizenship": "P27"
  }
}
'''
# System prompt template for v2 models (English)
system_message_template = """You are an expert text to SparQL query translator. Users will ask you questions in English and you will generate a SparQL query based on the provided context encloses in ```sparql <respose_query>```.
CONTEXT:
{context}"""

# Format the system message with the actual context
formatted_system_message = system_message_template.format(context=example_context_json_str)

chat_template = [
    {"role": "system", "content": formatted_system_message},
    {"role": "user", "content": question},
]

inputs = tokenizer.apply_chat_template(
    chat_template,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# Generate the output
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id
    )

# Decode only the generated part
generated_text_full = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only assistant's response
assistant_response_part = generated_text_full.split("<|im_start|>assistant")[-1].split("<|im_end|>")[0].strip()

cleaned_sparql = extract_sparql(assistant_response_part)

print(f"Question: {question}")
print(f"Context: {example_context_json_str}")
print(f"Generated SPARQL: {cleaned_sparql}")
print(f"Assistant's Raw Response: {assistant_response_part}")

Training Data

The model was fine-tuned on a subset of the julioc-p/Question-Sparql dataset. 80,000 English examples, which included a context field containing Wikidata entity and relationship ID mappings.

Training Hyperparameters

The following hyperparameters were used for fine-tuning:

  • LoRA Configuration:
    • r (LoRA rank): 256
    • lora_alpha: 128
    • lora_dropout: 0.05
    • target_modules: "all-linear"
  • Training Arguments:
    • num_train_epochs: 3
    • Effective batch size: 6
    • optim: "adamw_torch_fused"
    • learning_rate: 2e-4
    • fp16: True
    • max_grad_norm: 0.3
    • warmup_ratio: 0.03
    • lr_scheduler_type: "constant"
    • packing: True
    • NEFTune noise_alpha: 5
  • BitsAndBytesConfig:
    • load_in_4bit: True
    • bnb_4bit_quant_type: "nf4"
    • bnb_4bit_compute_dtype: torch.float16
    • bnb_4bit_use_double_quant: True

Speeds, Sizes, Times

  • Training took approx. 8 hours on a single NVIDIA V100 GPU.

Evaluation

Testing Data, Factors & Metrics

Testing Data

  1. QALD-10 test set (English): Standardized benchmark. 394 English questions were evaluated for this model.
  2. v2 Test Set (English): 10,000 English held-out examples from the julioc-p/Question-Sparql dataset, including context.

Metrics

QALD standard macro-averaged F1-score, Precision, and Recall. Non-executable queries result in P, R, F1 = 0.

Results

On QALD-10 (English, N=394):

  • Macro F1-Score: 0.1906
  • Macro Precision: 0.5890
  • Macro Recall: 0.1943
  • Executable Queries: 99.24% (391/394)
  • Correctness (Exact Match + Both Empty): 18.27% (72/394)
    • Correct (Exact Match): 16.75% (66/394)
    • Correct (Both Empty): 1.52% (6/394)

On v2 Test Set (English, N=10000):

  • Macro F1-Score: 0.8051
  • Macro Precision: 0.8906
  • Macro Recall: 0.8057
  • Executable Queries: 99.71% (9971/10000)
  • Correctness (Exact Match + Both Empty): 80.39% (8039/10000)
    • Correct (Exact Match): 72.29% (7229/10000)
    • Correct (Both Empty): 8.10% (810/10000)

Environmental Impact

  • Hardware Type: 1 x NVIDIA V100 32GB GPU
  • Hours used: Approx. 8 hours for fine-tuning.
  • Cloud Provider: DFKI HPC Cluster
  • Compute Region: Germany
  • Carbon Emitted: Approx. 0.30 kg CO2eq.

Technical Specifications

Compute Infrastructure

Hardware

  • NVIDIA V100 GPU (32 GB RAM)
  • Approx. 60 GB system RAM

Software

  • Slurm, NVIDIA Enroot, CUDA 11.8.0
  • Python, Hugging Face transformers, peft (0.13.2), bitsandbytes, trl, PyTorch.

More Information

Framework versions

  • PEFT 0.13.2
  • Transformers (4.39.3)
  • BitsAndBytes (0.43.0)
  • trl (0.8.6)
  • PyTorch (torch==2.1.0)
Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for julioc-p/occiglot_txt_sparql_en_v2

Adapter
(2)
this model

Dataset used to train julioc-p/occiglot_txt_sparql_en_v2