This model is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.1
for generating SPARQL queries from English natural language questions, specifically targeting the Wikidata knowledge graph.
Model Details
Model Description
It was fine-tuned using QLoRA. It takes an English natural language question as input and aims to produce a corresponding SPARQL query that can be executed against the Wikidata knowledge graph. It is part of a series of experiments to investigate the impact of continual multilingual pre-training on cross-lingual transferability and task-specific performance. Uses 4-bit quantization.
- Developed by: Julio Cesar Perez Duran
- Funded by : DFKI
- Model type: Decoder-only Transformer-based language model
- Language(s) (NLP): en (English)
- License: mit
- Finetuned from model [optional]:
mistralai/Mistral-7B-Instruct-v0.1
Bias, Risks, and Limitations
- Entity/Relationship Linking Bottleneck: A primary limitation of this model (and v1 models generally) is a significant deficiency in accurately mapping textual entities and relationships to their correct Wikidata identifiers (QIDs and PIDs) without explicit contextual aid. While the model might generate structurally valid SPARQL, the entities or properties could be incorrect. This significantly impacted recall.
- Output Cleaning: May occasionally produce queries with minor syntactic issues or extraneous text, requiring post-processing.
How to Get Started with the Model
The following Python script provides an example of how to load the model and tokenizer using the Hugging Face Transformers and PEFT libraries to generate a SPARQL query.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import re
# Model ID for the Mistral English v1.1 fine-tuned model
model_id = "julioc-p/mistral_en_txt_sparql_4bit"
base_model_for_tokenizer = "mistralai/Mistral-7B-Instruct-v0.1"
# Configuration for 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=False,
)
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model_for_tokenizer)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id
def extract_sparql(text):
match = re.search(
r"(SELECT|ASK|CONSTRUCT|DESCRIBE)(.*?)\}",
text,
re.DOTALL | re.IGNORECASE | re.MULTILINE,
)
if match:
sparql_query = match.group(0).strip()
sparql_query = re.sub(
r"^\s*```sparql\n", "", sparql_query, flags=re.IGNORECASE | re.MULTILINE
)
sparql_query = re.sub(r"\n```\s*$", "", sparql_query)
return sparql_query.strip()
match_simple = re.search(
r"(SELECT|ASK|CONSTRUCT|DESCRIBE).*?\}", text, re.DOTALL | re.IGNORECASE
)
if match_simple:
return match_simple.group(0).strip()
return ""
# --- Example usage ---
question = "What is the boiling point of water?"
knowledge_graph_target = "Wikidata"
prompt_content = f"Write a SparQL query that answers this request: '{question}' from the knowledge graph {knowledge_graph_target}."
chat_template = [
{"role": "user", "content": prompt_content},
]
inputs = tokenizer.apply_chat_template(
chat_template,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
# Generate the output
with torch.no_grad():
outputs = model.generate(
input_ids=inputs, # Pass input_ids directly
attention_mask=inputs.attention_mask if hasattr(inputs, 'attention_mask') else None, # Pass attention_mask if present
max_new_tokens=512,
do_sample=True,
pad_token_id=tokenizer.pad_token_id
)
# Decode only the generated part
input_length = inputs.input_ids.shape[1] if hasattr(inputs, 'input_ids') else inputs.shape[1]
generated_text_assistant_part = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
cleaned_sparql = extract_sparql(generated_text_assistant_part)
print(f"Question: {question}")
print(f"Generated SPARQL: {cleaned_sparql}")
print(f"Raw generated text (assistant part): {generated_text_assistant_part}")
Training Data
The model was fine-tuned on a subset of the julioc-p/Question-Sparql
dataset. Specifically, a 35,000-sample English subset filtered to include only Wikidata-related queries was used.
Training Hyperparameters
The following hyperparameters were used for the fine-tuning:
- LoRA Configuration:
r
(LoRA rank): 16lora_alpha
: 16lora_dropout
: 0.1bias
: "none"task_type
: "CAUSAL_LM"target_modules
: "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"
- Training Arguments:
num_train_epochs
: 5per_device_train_batch_size
: 1gradient_accumulation_steps
: 8gradient_checkpointing
: Trueoptim
: "paged_adamw_32bit"learning_rate
: 1e-5weight_decay
: 0.05bf16
: Falsefp16
: Truemax_grad_norm
: 1.0warmup_ratio
: 0.01lr_scheduler_type
: "cosine"group_by_length
: Truepacking
: False
- BitsAndBytesConfig:
load_in_4bit
: Truebnb_4bit_quant_type
: "nf4"bnb_4bit_compute_dtype
:torch.float16
bnb_4bit_use_double_quant
: False
Speeds, Sizes, Times
- The training took approximately 19-20 hours for 5 epochs on a single NVIDIA V100 GPU.
Evaluation
Testing Data, Factors & Metrics
Testing Data
- QALD-10 test set (English): Standardized benchmark with English questions targeting Wikidata. 391 English questions were attempted after filtering.
- v1 Test Set (English): 3,500 English held-out examples randomly sampled from the
julioc-p/Question-Sparql
dataset (Wikidata-focused).
Metrics
The primary evaluation metrics used were the QALD standard macro-averaged F1-score, Precision, and Recall. Non-executable queries resulted in P, R, F1 = 0. The percentage of Executable Queries was also tracked.
Results
On QALD-10 (English, N=391):
- Macro F1-Score: 0.0691
- Macro Precision: 0.7033
- Macro Recall: 0.0691
- Executable Queries: 98.47% (385/391)
- Correctness (Exact Match + Both Empty): 6.91% (27/391)
- Correct (Exact Match): 5.63% (22/391)
- Correct (Both Empty): 1.28% (5/391)
On v1 Test Set (English, N=3500):
- Macro F1-Score: 0.2268
- Macro Precision: 0.6244
- Macro Recall: 0.2269
- Executable Queries: 86.54% (3029/3500)
- Correctness (Exact Match + Both Empty): 22.63% (792/3500)
- Correct (Exact Match): 14.63% (512/3500)
- Correct (Both Empty): 8.00% (280/3500)
Environmental Impact
- Hardware Type: 1 x NVIDIA V100 32GB GPU
- Hours used: Approx. 19-20 hours for fine-tuning.
- Cloud Provider: DFKI HPC Cluster
- Compute Region: Germany
- Carbon Emitted: Approx. 2.96 kg CO2eq.
Technical Specifications
Compute Infrastructure
Hardware
- NVIDIA V100 GPU (32 GB RAM)
- Approx. 60 GB system RAM
Software
- Slurm, NVIDIA Enroot, CUDA 11.8.0
- Python, Hugging Face
transformers
,peft
(0.13.2),bitsandbytes
,trl
, PyTorch.
More Information
- Thesis GitHub: https://github.com/julioc-p/cross-lingual-transferability-thesis
- Dataset: https://huggingface.co/datasets/julioc-p/Question-Sparql
Framework versions
- PEFT 0.13.2
- Transformers (
4.39.3
) - BitsAndBytes (
0.43.0
) - trl (
0.8.6
) - PyTorch (
torch==2.1.0
)
- Downloads last month
- 55
Model tree for julioc-p/mistral_en_txt_sparql_4bit
Base model
mistralai/Mistral-7B-v0.1