arxiv:2504.09714

Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish

Published on Apr 13

Authors:

Abstract

The reliance on translated or adapted datasets from English or multilingual resources introduces challenges regarding linguistic and cultural suitability. This study addresses the need for robust and culturally appropriate benchmarks by evaluating the quality of 17 commonly used Turkish benchmark datasets. Using a comprehensive framework that assesses six criteria, both human and LLM-judge annotators provide detailed evaluations to identify dataset strengths and shortcomings. Our results reveal that 70% of the benchmark datasets fail to meet our heuristic quality standards. The correctness of the usage of technical terms is the strongest criterion, but 85% of the criteria are not satisfied in the examined datasets. Although LLM judges demonstrate potential, they are less effective than human annotators, particularly in understanding cultural common sense knowledge and interpreting fluent, unambiguous text. GPT-4o has stronger labeling capabilities for grammatical and technical tasks, while Llama3.3-70B excels at correctness and cultural knowledge evaluation. Our findings emphasize the urgent need for more rigorous quality control in creating and adapting datasets for low-resource languages.

View arXiv page View PDF Add to collection

Community

stefan-it

1 day ago

•

edited 1 day ago

From the paper it seems that WikiANN is an awesome and high-quality dataset. Which is of course not true.

First, WikiANN is a silver-standard and badly sentence-segmented. And looking deeper at the dataset it is amazingly bad - I investigated English and German. Same holds for Turkish. Even after 5 minutes you can find these examples in the test dataset:

[ "Türkiye", "Kupası", "'", "''", "(", "4", ")", ":", "'", "''", "2006", ",", "2007", ",", "2009", ",", "2011" ]

And surprisingly, we can now guess what is annotated as ORG:

[ "ORG: Türkiye Kupası", "ORG: 2006", "ORG: 2007", "ORG: 2009", "ORG: 2011" ]

LOCs are also weird:

[ "YÖNLENDİRME", "G3", ":", "Live", "in", "Concert" ]

annotated as:

[ "LOC: G3 : Live in Concert" ]

or:

[ "1970-71", "sezonu", "öncesinde", "Yasin", "Özdenak", "gibi", "oyuncular", "kadroya", "katıldı", "." ]

annotated as:

[ "LOC: 1970-71", "PER: Yasin Özdenak" ]

No matter, what an LLM judge is saying, this dataset is not high-quality.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.09714 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.09714 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.09714 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.