julioc-p/mistral_txt_sparql_de_v2

This model is a fine-tuned version of mistralai/Mistral-7B-v0.1 for generating SPARQL queries from German natural language questions, specifically targeting the Wikidata knowledge graph.

Model Details

Model Description

It was fine-tuned using QLoRA with 4-bit quantization. It takes a German natural language question and corresponding entity/relationship context as input and aims to produce a SPARQL query for Wikidata. This model is part of experiments investigating continual multilingual pre-training.

Developed by: Julio Cesar Perez Duran
Funded by : DFKI
Model type: Decoder-only Transformer-based language model
Language: de (German)
License: mit
Finetuned from model [optional]: mistralai/Mistral-7B-v0.1

Bias, Risks, and Limitations

Context Reliant: Performance heavily relies on the accuracy and completeness of the provided entity/relationship context mappings.
Output Format: V2 models sometimes generated extraneous text after the SPARQL query, requiring post-processing (extraction of content within ```sparql ... ``` delimiters).
EOS Token Generation: Inconsistent End-Of-Sequence token generation was observed, possibly influenced by dataset packing during training.

How to Get Started with the Model

The following Python script provides an example of how to load the model and tokenizer to generate a SPARQL query.

import torch
from transformers import AutoTokenizer, BitsAndBytesConfig
from peft import AutoPeftModelForCausalLM # Use AutoPeftModelForCausalLM for v2 models
import re
import json

# Model ID for the Mistral German v2 model
model_id = "julioc-p/mistral_txt_sparql_de_v2"

# Configuration for 4-bit quantization (as per v2 setup in thesis/script)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16, 
    bnb_4bit_use_double_quant=True,
)

# Load the model and tokenizer
print(f"Loading model: {model_id}")
model = AutoPeftModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)
print(f"Loading tokenizer for: {model_id}")
tokenizer = AutoTokenizer.from_pretrained(model_id)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.pad_token_id

sparql_pattern_strict = re.compile(
    r"""
    (SELECT|ASK|CONSTRUCT|DESCRIBE) # Match the starting keyword
    .*?                               # Match any characters non-greedily
    \}                                # Match the closing brace of the main block
    (                                 # Start of optional block for trailing clauses
        (?:                           # Non-capturing group for one or more trailing clauses
            \s*
            (?:                       # Clause alternatives
                (?:(?:GROUP|ORDER)\s+BY|HAVING)\s+.+?\s*(?=\s*(?:(?:GROUP|ORDER)\s+BY|HAVING|LIMIT|OFFSET|VALUES|$)) |
                LIMIT\s+\d+ |
                OFFSET\s+\d+ |
                VALUES\s*(?:\{.*?\}|\w+|\(.*?\))
            )
        )* # Match zero or more clauses
    )
    """,
    re.DOTALL | re.IGNORECASE | re.VERBOSE
)

def extract_sparql(text):
    code_block_match = re.search(r"```(?:sparql)?\s*(.*?)\s*```", text, re.DOTALL | re.IGNORECASE)
    if code_block_match:
        text_to_search = code_block_match.group(1)
    else:
        text_to_search = text # Search directly if no markdown code block

    match = sparql_pattern_strict.search(text_to_search)
    if match:
        return match.group(0).strip()
    else:
        fallback_match = re.search(r"(SELECT|ASK|CONSTRUCT|DESCRIBE).*?\}", text_to_search, re.DOTALL | re.IGNORECASE)
        if fallback_match:
            return fallback_match.group(0).strip()
    return ""

# --- Example usage ---
question = "Wer war der amerikanische weibliche Angestellte des Barnard College?"
example_context_json_str = '''
{
  "entitäten": {
    "Barnard College": "Q167733",
    "amerikanisch": "Q30",
    "weiblich": "Q6581072",
    "Angestellte": "Q5"
  },
  "beziehungen": {
    "Instanz von": "P31",
    "Arbeitgeber": "P108",
    "Geschlecht": "P21",
    "Land der Staatsbürgerschaft": "P27"
  }
}
'''
# System prompt template for v2 models (German)
system_message_template = """Sie sind ein Experte für die Übersetzung von Text in SPARQL-Anfragen. Benutzer werden Ihnen Fragen auf Deutsch stellen, und Sie werden eine SPARQL-Anfrage basierend auf dem bereitgestellten Kontext generieren, der in ```sparql <Antwortanfrage>``` eingeschlossen ist.
KONTEXT:
{context}"""

# Format the system message with the actual context
formatted_system_message = system_message_template.format(context=example_context_json_str)

chat_template = [
    {"role": "system", "content": formatted_system_message},
    {"role": "user", "content": question},
]

inputs = tokenizer.apply_chat_template(
    chat_template,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# Generate the output
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=512, # From your v2 generation script
        do_sample=True,     # From your v2 generation script
        temperature=0.7,    # Common for sampling
        top_p=0.9,          # Common for sampling
        pad_token_id=tokenizer.pad_token_id
    )

# Decode only the generated part
generated_text_full = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only assistant's response (ChatML format specific extraction)
assistant_response_part = ""
if "<|im_start|>assistant" in generated_text_full: # Specific to ChatML after template application
    assistant_response_part = generated_text_full.split("<|im_start|>assistant")[-1].split("<|im_end|>")[0].strip()
elif "assistant\n" in generated_text_full: # More generic if template output varies
     assistant_response_part = generated_text_full.split("assistant\n")[-1].strip()
else: 
    input_length = inputs.input_ids.shape[1]
    assistant_response_part = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True).strip()


cleaned_sparql = extract_sparql(assistant_response_part)

print(f"Frage: {question}")
print(f"Kontext: {example_context_json_str}")
print(f"Generierte SPARQL: {cleaned_sparql}")
print(f"Textausgabe (Assistent): {assistant_response_part}")

Training Data

The model was fine-tuned on a subset of the julioc-p/Question-Sparql dataset. 80,000 German examples

Training Hyperparameters

The following hyperparameters were used:

LoRA Configuration:
- r (LoRA rank): 256
- lora_alpha: 128
- lora_dropout: 0.05
- target_modules: "all-linear"
- task_type: "CAUSAL_LM"
Training Arguments:
- num_train_epochs: 3
- Effective batch size: 6 (per_device_train_batch_size=1, gradient_accumulation_steps=6)
- optim: "adamw_torch_fused"
- learning_rate: 2e-4
- weight_decay: 0.05
- fp16: True
- max_grad_norm: 0.3
- warmup_ratio: 0.03
- lr_scheduler_type: "constant"
- packing: True
- NEFTune noise_alpha: 5
BitsAndBytesConfig:
- load_in_4bit: True
- bnb_4bit_quant_type: "nf4"
- bnb_4bit_compute_dtype: torch.float16
- bnb_4bit_use_double_quant: True

Speeds, Sizes, Times

Mistral German v2 training took approx. 19-20 hours on a single NVIDIA V100 GPU.

Evaluation

Testing Data, Factors & Metrics

Testing Data

QALD-10 test set (German): Standardized benchmark. 394 German questions were evaluated for this model.
v2 Test Set (German): 10,000 German held-out examples from the julioc-p/Question-Sparql dataset, including context.

Metrics

QALD standard macro-averaged F1-score, Precision, and Recall. Non-executable queries: P, R, F1 = 0. Executable Queries % tracked.

Results

On QALD-10 (German, N=394):

Macro F1-Score: 0.2595
Macro Precision: 0.6419
Macro Recall: 0.2632
Executable Queries: 99.75% (393/394)
Correctness (Exact Match + Both Empty): 25.13% (99/394)
- Correct (Exact Match): 23.86% (94/394)
- Correct (Both Empty): 1.27% (5/394)

On v2 Test Set (German, N=10000):

Macro F1-Score: 0.7183
Macro Precision: 0.8362
Macro Recall: 0.7198
Executable Queries: 97.27% (9727/10000)
Correctness (Exact Match + Both Empty): 71.58% (7158/10000)
- Correct (Exact Match): 62.74% (6274/10000)
- Correct (Both Empty): 8.84% (884/10000)

Environmental Impact

Hardware Type: 1 x NVIDIA V100 32GB GPU
Hours used: Approx. 19-20 hours for fine-tuning.
Cloud Provider: DFKI HPC Cluster
Compute Region: Germany
Carbon Emitted: Approx. 2.96 kg CO2eq.

Technical Specifications

Compute Infrastructure

Hardware

NVIDIA V100 GPU (32 GB RAM)
Approx. 60 GB system RAM

Software

Slurm, NVIDIA Enroot, CUDA 11.8.0
Python, Hugging Face transformers, peft (0.13.2), bitsandbytes, trl, PyTorch.

More Information

Thesis GitHub: https://github.com/julioc-p/cross-lingual-transferability-thesis
Dataset: https://huggingface.co/datasets/julioc-p/Question-Sparql

Framework versions

PEFT 0.13.2
Transformers (4.39.3)
BitsAndBytes (0.43.0)
trl (0.8.6)
PyTorch (torch==2.1.0)

julioc-p
/

mistral_txt_sparql_de_v2