arxiv:2507.16410

GG-BBQ: German Gender Bias Benchmark for Question Answering

Published on Jul 22

Authors:

Shalaka Satheesh ,

Abstract

Gender bias in German Large Language Models is evaluated using a manually corrected machine-translated dataset, revealing that all models exhibit bias aligned with and against social stereotypes.

AI-generated summary

Within the context of Natural Language Processing (NLP), fairness evaluation is often associated with the assessment of bias and reduction of associated harm. In this regard, the evaluation is usually carried out by using a benchmark dataset, for a task such as Question Answering, created for the measurement of bias in the model's predictions along various dimensions, including gender identity. In our work, we evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al. (2022) as a reference. Specifically, the templates in the gender identity subset of this English dataset were machine translated into German. The errors in the machine translated templates were then manually reviewed and corrected with the help of a language expert. We find that manual revision of the translation is crucial when creating datasets for gender bias evaluation because of the limitations of machine translation from English to a language such as German with grammatical gender. Our final dataset is comprised of two subsets: Subset-I, which consists of group terms related to gender identity, and Subset-II, where group terms are replaced with proper names. We evaluate several LLMs used for German NLP on this newly created dataset and report the accuracy and bias scores. The results show that all models exhibit bias, both along and against existing social stereotypes.

View arXiv page View PDF GitHub 0 Add to collection

Community

stefan-it

10 days ago

We evaluate several LLMs used for German NLP

A bit of an overclaim, as only two different German models were evaluated. And Teuken is not even part of the evaluation.

I mean the main paper is about constructing the benchmark, but the model choice is really bad.

stefan-it

10 days ago

@shalakasatheesh and team: why is the evaluation section so limited?

shalakasatheesh

Paper author 5 days ago

Dear Stefan,

Thank you for your interest in our work. As you correctly note, the core contribution of our work is the dataset. The evaluation we carried out focused on a subset of available models used for German NLP. As stated in the paper:

"We evaluate both pre-trained and instruction-tuned models, publicly available on the Hugging-Face hub that support the German language with varying sizes ranging from 3B to 70B parameters. The models evaluated are [base and instruction-tuned versions of]:
(1) Llama-3.2-3B
(2) DiscoResearch/Llama3-German-8B
(3) Mistral-7B-v0.3
(4) leo-hessianai-13b
(5) Llama-3.1-70B".

We also see the value in evaluating further models.

Best Regards,
Shalaka

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.16410 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.16410 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.