julioc-p/mistral_de_txt_sparql_4bit

This model is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.1 for generating SPARQL queries from German natural language questions, specifically targeting the Wikidata knowledge graph.

Model Details

Model Description

It was fine-tuned using QLoRA. It takes a German natural language question as input and aims to produce a corresponding SPARQL query that can be executed against the Wikidata knowledge graph. It is part of a series of experiments to investigate the impact of continual multilingual pre-training on cross-lingual transferability and task-specific performance. Uses 4-bit quantization.

Developed by: Julio Cesar Perez Duran
Funded by : DFKI
Model type: Decoder-only Transformer-based language model
Language(s) (NLP): de (German)
License: mit
Finetuned from model [optional]: mistralai/Mistral-7B-Instruct-v0.1

Bias, Risks, and Limitations

Entity/Relationship Linking Bottleneck: A primary limitation of this model is a significant deficiency in accurately mapping textual entities and relationships in German to their correct Wikidata identifiers (QIDs and PIDs) without explicit contextual aid. While the model might generate structurally valid SPARQL, the entities or properties could be incorrect. This significantly impacted recall.

How to Get Started with the Model

The following Python script provides an example of how to load the model and tokenizer using the Hugging Face Transformers and PEFT libraries to generate a SPARQL query. This script aligns with the generation script you provided.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import re

model_id = "julioc-p/mistral_de_txt_sparql_4bit"
base_model_for_tokenizer = "mistralai/Mistral-7B-Instruct-v0.1"

# Configuration for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto" # "cuda" in your script, "auto" is generally more flexible
)
tokenizer = AutoTokenizer.from_pretrained(base_model_for_tokenizer)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id


sparql_pattern_strict = re.compile(
    r"""
    (SELECT|ASK|CONSTRUCT|DESCRIBE) # Match SPARQL query type
    .*?                               # Match any characters (non-greedy)
    \}                                # Match the first closing curly brace
    (                                 # Start of optional block for trailing clauses
        (?:                           # Non-capturing group for one or more trailing clauses
            \s* # Match any whitespace
            (?:                       # Non-capturing group for specific clauses
                (?:(?:GROUP|ORDER)\s+BY|HAVING)\s+.+?\s*(?=\s*(?:(?:GROUP|ORDER)\s+BY|HAVING|LIMIT|OFFSET|VALUES|$)) | # GROUP BY, ORDER BY, HAVING
                LIMIT\s+\d+ |         # LIMIT clause
                OFFSET\s+\d+ |        # OFFSET clause
                VALUES\s*(?:\{.*?\}|\w+|\(.*?\)) # VALUES clause
            )
        )* # Match zero or more trailing clauses
    )
    """,
    re.DOTALL | re.IGNORECASE | re.VERBOSE,
)

def extract_sparql(text):
    code_block_match = re.search(
        r"```(?:sparql)?\s*(.*?)\s*```", text, re.DOTALL | re.IGNORECASE
    )
    if code_block_match:
        text_to_search = code_block_match.group(1)
    else:
        text_to_search = text
    
    match = sparql_pattern_strict.search(text_to_search)
    if match:
        return match.group(0).strip()
    else:
        # Fallback to simpler regex if strict pattern doesn't match
        fallback_match = re.search(
            r"(SELECT|ASK|CONSTRUCT|DESCRIBE).*?\}",
            text_to_search,
            re.DOTALL | re.IGNORECASE,
        )
        if fallback_match:
            return fallback_match.group(0).strip()
    return ""

# --- Example usage ---
question = "Was ist der Siedepunkt von Wasser?"
knowledge_graph_target = "Wikidata"

prompt_content = f"Write a SparQL query that answers this request: '{question}' from the knowledge graph {knowledge_graph_target}."

chat_template = [
    {"role": "user", "content": prompt_content},
]

inputs = tokenizer.apply_chat_template(
    chat_template,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# Generate the output
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=512,
        do_sample=True,  
        pad_token_id=tokenizer.pad_token_id
    )

generated_text_assistant_part = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
cleaned_sparql = extract_sparql(generated_text_assistant_part)

print(f"Frage: {question}")
print(f"Generierte SPARQL: {cleaned_sparql}")
print(f"Rohe generierte Textausgabe (Assistent): {generated_text_assistant_part}")

Training Data

The model was fine-tuned on a subset of the julioc-p/Question-Sparql dataset. Specifically, for the v1.1 Mistral German model, a 35,000-sample German subset was used.

Training Hyperparameters

The following hyperparameters were used for the fine-tuning:

LoRA Configuration (for Mistral v1.1):
- r (LoRA rank): 16 (Adjusted from 64 for Mistral due to stability, as per thesis)
- lora_alpha: 16 (Maintained from initial v1 setup, or potentially adjusted with r)
- lora_dropout: 0.1
- bias: "none"
- task_type: "CAUSAL_LM"
- target_modules: "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" (Note: lm_head was removed for Mistral v1.1, as per thesis page 39)
Training Arguments:
- num_train_epochs: 5
- per_device_train_batch_size: 1
- gradient_accumulation_steps: 8
- gradient_checkpointing: True
- optim: "paged_adamw_32bit"
- learning_rate: 1e-5
- weight_decay: 0.05
- bf16: False
- fp16: True
- max_grad_norm: 1.0
- warmup_ratio: 0.01
- lr_scheduler_type: "cosine"
- group_by_length: True
- packing: False
BitsAndBytesConfig:
- load_in_4bit: True
- bnb_4bit_quant_type: "nf4"
- bnb_4bit_compute_dtype: torch.float16
- bnb_4bit_use_double_quant: False

Speeds, Sizes, Times

The training took approximately 19-20 hours for 5 epochs on a single NVIDIA V100 GPU.

Evaluation

Testing Data, Factors & Metrics

Testing Data

QALD-10 test set (German): Standardized benchmark with German questions targeting Wikidata. 391 German questions were attempted after filtering.
v1 Test Set (German): 3,500 German held-out examples randomly sampled from the julioc-p/Question-Sparql dataset (Wikidata-focused).

Metrics

The primary evaluation metrics used were the QALD standard macro-averaged F1-score, Precision, and Recall. Non-executable queries resulted in P, R, F1 = 0. The percentage of Executable Queries was also tracked.

Results

On QALD-10 (German, N=391):

Macro F1-Score: 0.0563
Macro Precision: 0.6726
Macro Recall: 0.0563
Executable Queries: 94.88% (371/391)
Correctness (Exact Match + Both Empty): 5.63% (22/391)
- Correct (Exact Match): 4.60% (18/391)
- Correct (Both Empty): 1.02% (4/391)

On v1 Test Set (German, N=3500):

Macro F1-Score: 0.1003
Macro Precision: 0.7481
Macro Recall: 0.1006
Executable Queries: 89.11% (3119/3500)
Correctness (Exact Match + Both Empty): 9.97% (349/3500)
- Correct (Exact Match): 2.51% (88/3500)
- Correct (Both Empty): 7.46% (261/3500)

Environmental Impact

Hardware Type: 1 x NVIDIA V100 32GB GPU
Hours used: Approx. 19-20 hours for fine-tuning.
Cloud Provider: DFKI HPC Cluster
Compute Region: Germany
Carbon Emitted: Approx. 2.96 kg CO2eq.

Technical Specifications

Compute Infrastructure

Hardware

NVIDIA V100 GPU (32 GB RAM)
Approx. 60 GB system RAM

Software

Slurm, NVIDIA Enroot, CUDA 11.8.0
Python, Hugging Face transformers, peft (0.13.2), bitsandbytes, trl, PyTorch.

More Information

Thesis GitHub: https://github.com/julioc-p/cross-lingual-transferability-thesis
Dataset: https://huggingface.co/datasets/julioc-p/Question-Sparql
Model Link: https://huggingface.co/julioc-p/mistral_de_txt_sparql_4bit

Framework versions

PEFT 0.13.2
Transformers (4.39.3)
BitsAndBytes (0.43.0)
trl (0.8.6)
PyTorch (torch==2.1.0)

julioc-p
/

mistral_de_txt_sparql_4bit