This model is a fine-tuned version of occiglot/occiglot-7b-eu5
for generating SPARQL queries from German natural language questions, specifically targeting the Wikidata knowledge graph.
Model Details
Model Description
It was fine-tuned using QLoRA with 4-bit quantization. It takes a German natural language question and corresponding entity/relationship context as input and aims to produce a SPARQL query for Wikidata. This model is part of a series of experiments investigating continual multilingual pre-training.
- Developed by: Julio Cesar Perez Duran
- Funded by : DFKI
- Model type: Decoder-only Transformer-based language model
- Language(s) (NLP): de (German)
- License: mit
- Finetuned from model:
occiglot/occiglot-7b-eu5
Bias, Risks, and Limitations
- Context Reliant: Performance heavily relies on the provided entity/relationship context mappings.
- Output Format: sometimes generates extraneous text after the SPARQL query, requiring post-processing (extraction of content within
```sparql ... ```
delimiters). - EOS Token Generation: Inconsistent End-Of-Sequence token generation was observed, possibly influenced by dataset packing during training.
How to Get Started with the Model
The following Python script provides an example of how to load the model and tokenizer to generate a SPARQL query. It requires providing entity/relationship context.
import torch
from transformers import AutoTokenizer, BitsAndBytesConfig
from peft import AutoPeftModelForCausalLM
import re
import json
# Model ID for the Occiglot German v2 model
model_id = "julioc-p/occiglot_txt_sparql_de_v2"
# Configuration for 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
# Load the model and tokenizer from the PEFT-saved directory
model = AutoPeftModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id
# SPARQL extraction function
def extract_sparql(text):
code_block_match = re.search(r"```(?:sparql)?\s*(.*?)\s*```", text, re.DOTALL | re.IGNORECASE)
if code_block_match:
text_to_search = code_block_match.group(1)
else:
text_to_search = text
match = re.search(r"(SELECT|ASK|CONSTRUCT|DESCRIBE).*?\}", text_to_search, re.DOTALL | re.IGNORECASE)
if match:
return match.group(0).strip()
return ""
question = "Wer war der amerikanische weibliche Angestellte des Barnard College?"
example_context_json_str = '''
{
"entitäten": {
"Barnard College": "Q167733",
"amerikanisch": "Q30",
"weiblich": "Q6581072",
"Angestellte": "Q5"
},
"beziehungen": {
"Instanz von": "P31",
"Arbeitgeber": "P108",
"Geschlecht": "P21",
"Land der Staatsbürgerschaft": "P27"
}
}
'''
# System prompt template
system_message_template = """Sie sind ein Experte für die Übersetzung von Text in SPARQL-Anfragen. Benutzer werden Ihnen Fragen auf Deutsch stellen, und Sie werden eine SPARQL-Anfrage basierend auf dem bereitgestellten Kontext generieren, der in ```sparql <Antwortanfrage>``` eingeschlossen ist.
KONTEXT:
{context}"""
# Format the system message with the actual context
formatted_system_message = system_message_template.format(context=example_context_json_str)
chat_template = [
{"role": "system", "content": formatted_system_message},
{"role": "user", "content": question},
]
inputs = tokenizer.apply_chat_template(
chat_template,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
# Generate the output
with torch.no_grad():
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.pad_token_id
)
# Decode and extract the assistant's response
generated_text_full = tokenizer.decode(outputs[0], skip_special_tokens=True)
assistant_response_part = generated_text_full.split("<|im_start|>assistant")[-1].split("<|im_end|>")[0].strip()
cleaned_sparql = extract_sparql(assistant_response_part)
print(f"Frage: {question}")
print(f"Kontext: {example_context_json_str}")
print(f"Generierte SPARQL: {cleaned_sparql}")
print(f"Textausgabe (Assistent): {assistant_response_part}")
Training Data
The model was fine-tuned on a subset of the julioc-p/Question-Sparql
dataset. 80,000 German examples for training, which included a context
field containing Wikidata entity and relationship ID mappings.
Training Hyperparameters
The following hyperparameters were used for the v2 Occiglot German fine-tuning:
- LoRA Configuration (v2 models):
r
(LoRA rank): 256lora_alpha
: 128lora_dropout
: 0.05target_modules
: "all-linear"
- Training Arguments (v2 models):
num_train_epochs
: 3- Effective batch size: 6
optim
: "adamw_torch_fused"learning_rate
: 2e-4fp16
: Truemax_grad_norm
: 0.3warmup_ratio
: 0.03lr_scheduler_type
: "constant"packing
: True- NEFTune
noise_alpha
: 5
- BitsAndBytesConfig (v2 models):
load_in_4bit
: Truebnb_4bit_quant_type
: "nf4"bnb_4bit_compute_dtype
:torch.float16
bnb_4bit_use_double_quant
: True
Speeds, Sizes, Times
- Took approx. 6 hours on a single NVIDIA V100 GPU.
Evaluation
Testing Data, Factors & Metrics
Testing Data
- QALD-10 test set (German): 394 German questions were evaluated for this model.
- v2 Test Set (German): 10,000 German held-out examples from the
julioc-p/Question-Sparql
dataset, including context.
Metrics
QALD standard macro-averaged F1-score, Precision, and Recall. Non-executable queries result in P, R, F1 = 0.
Results
On QALD-10 (German, N=394):
- Macro F1-Score: 0.2304
- Macro Precision: 0.6920
- Macro Recall: 0.2340
- Executable Queries: 99.75% (393/394)
- Correctness (Exact Match + Both Empty): 22.34% (88/394)
- Correct (Exact Match): 20.56% (81/394)
- Correct (Both Empty): 1.78% (7/394)
On v2 Test Set (German, N=10000):
- Macro F1-Score: 0.7268
- Macro Precision: 0.8515
- Macro Recall: 0.7278
- Executable Queries: 99.63% (9963/10000)
- Correctness (Exact Match + Both Empty): 72.50% (7250/10000)
- Correct (Exact Match): 63.48% (6348/10000)
- Correct (Both Empty): 9.02% (902/10000)
Environmental Impact
- Hardware Type: 1 x NVIDIA V100 32GB GPU
- Hours used: Approx. 6 hours for fine-tuning.
- Cloud Provider: DFKI HPC Cluster
- Compute Region: Germany
- Carbon Emitted: Approx. 0.89 kg CO2eq.
Technical Specifications
Compute Infrastructure
Hardware
- NVIDIA V100 GPU (32 GB RAM)
- Approx. 60 GB system RAM
Software
- Slurm, NVIDIA Enroot, CUDA 11.8.0
- Python, Hugging Face
transformers
,peft
(0.13.2),bitsandbytes
,trl
, PyTorch.
More Information
- Thesis GitHub: https://github.com/julioc-p/cross-lingual-transferability-thesis
- Dataset: https://huggingface.co/datasets/julioc-p/Question-Sparql
Framework versions
- PEFT 0.13.2
- Transformers (
4.39.3
) - BitsAndBytes (
0.43.0
) - trl (
0.8.6
) - PyTorch (
torch==2.1.0
)
- Downloads last month
- 1
Model tree for julioc-p/occiglot_txt_sparql_de_v2
Base model
occiglot/occiglot-7b-eu5