Navigating Korean LLM Research #2: Evaluation Tools
Continuing from my first post on Korean LLMs, this post will introduce some of the widely used evaluation tools for Korean models. As many of you know, evaluation tools play a crucial role in LLM research. Personally, I think strong benchmarks can shape the direction of model research for some time. Once a benchmark is accepted as a de facto standard, researchers and developers tend to optimize models toward that benchmark (as we’ve seen with datasets like SQuAD, GLUE, MMLU, and others), which can significantly influence the research landscape.
For a long time, Korean LLMs were scarce, and there wasn’t a pressing need for Korean-specific benchmarks. However, with the recent surge in model development, especially from private corporations, having robust evaluation tools to measure and demonstrate progress has become essential.
Evaluation Tools
I categorize Korean benchmarks into two main types:
Re-Implemented Benchmarks: These are adaptations of existing English benchmarks, such as the (K-)MMLU or (Ko-)BBQ. They retain the same structure and objectives as their English counterparts but with Korean contents.
Native Benchmarks: These are unique benchmarks developed by the Korean community, with no direct English equivalent. They are generally designed to evaluate Korean-specific aspects, such as cultural context and linguistic nuances.
Re-Implemented Benchmarks
The advantages of re-implemented benchmarks are clear—they serve as Korean adaptations of already widely-used English benchmarks, making them easily and intuitively accepted. For example, when discussing KMMLU, even though it might be new to some readers, you can quickly understand its design: a multiple-choice question-answering dataset meant to test the knowledge of LLMs.
There are several well-known re-implemented benchmarks, such as KLUE, KorQuAD, and KorNLI/KorSTS. However, in this post, I’ll focus on more recent benchmarks that are specifically designed for evaluating LLMs, as the previously mentioned ones are no longer commonly used in modern research.
KoBEST: Korean Balanced Evaluation of Significant Tasks
KoBEST is one of the first benchmarks designed to evaluate reasoning in Korean. It consists of five categories, each modeled after an established English benchmark:
BoolQ: A question-answering dataset that presents a passage followed by a yes-or-no question.
COPA: A commonsense reasoning dataset that provides a premise and two alternatives. The model must choose the alternative that is more plausibly related to the premise, either as a cause or effect.
WiC: A semantic benchmark in which two sentences are provided, each containing the same word. The task is to determine whether the word has the same meaning in both contexts.
HellaSwag: A commonsense reasoning dataset with multiple-choice questions. The model is tasked with choosing the correct sentence, from four options, that is most likely to follow a given context.
SentiNeg: A sentiment analysis dataset where the model predicts the polarity of negated sentences, testing its ability to handle complex sentiment scenarios.
One impressive aspect of the dataset is that the authors hire professional Korean linguists to ensure its quality during construction. While benchmarks like BoolQ, COPA, WiC, and SentiNeg are no longer widely used in the English NLP community, HellaSwag remains a popular benchmark. Similarly, state-of-the-art LLMs still struggle at the KoBEST version of HellaSwag. As shown in the table below, even top-performing multilingual models still have significant room for improvement.
Models | Performance |
---|---|
Command-R-Plus | 51.3 |
Llama-3-70B-Instruct | 49.7 |
Qwen2-72B-Instruct | 49.2 |
Aya-23-35B | 47.6 |
Random Baseline | 25.0 |
KMMLU: Measuring Massive Multitask Language Understanding in Korean
KMMLU is a project I worked on in collaboration with EleutherAI, and I’m proud to say it has become one of the most widely-used datasets in Korea, with over 3 million total downloads on Hugging Face. As the name suggests, it is a knowledge benchmark designed specifically for Korean. While translated versions of MMLU are now common, I believe that re-implementations (not translations) are necessary.
Suppose you are a researcher at a Korean legal-tech startup seeking an LLM proficient in Korean law. Choosing a model based on its performance in the professional law subset of the translated MMLU would be far from ideal. The translated benchmark doesn’t capture the proficiency in the Korean legal environment. KMMLU fills this gap, offering evaluations across 45 categories of professional knowledge in Korean, providing a much more relevant measure of LLM performance in the local context. Some example of such questions are provided in the following figure.
Figure 1: Image from "KMMLU: Measuring Massive Multitask Language Understanding in Korean"
LogicKor & Ko-Chatbot-Arena
These two benchmarks are the ones I envied the most when I first encountered them. LogicKor is the Korean adaptation of MT-Bench, while Ko-Chatbot-Arena mirrors the LMSys Chatbot Arena for Korean. LogicKor closely follows MT-Bench but introduces a new category focused on Korean grammar. It has become somewhat saturated in usage, and from what I know, a second version is currently in progress.
Ko-Chatbot-Arena offered a platform for evaluating over 10 LLMs, where users could vote between two responses from different models. Unfortunately, the platform is no longer active, but the preference data it collected is still available on Hugging Face, making it a useful resource for human-annotated preference data.
(KUDGE) LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
This is another one of my own projects. Initially, I set out to create a Korean version of MT-Bench, but I soon realized it was redundant since LogicKor already released. So, I pivoted towards building a Korean LLM-as-a-Judge, but quickly discovered there weren't any strong benchmarking tools for this specific task. That’s when I decided to create my own: KUDGE, the first and only meta-evaluation benchmark for Korean at the moment.
KUDGE consists of two categories: pointwise and pairwise evaluations. In the pointwise setup, given a (prompt, response) pair, the judge model evaluates the response on a Likert scale. In the pairwise setup, given a (prompt, response A, response B) triplet, the model chooses the better response. To build this, I hired 15 annotators and collected 6K annotations, including prompts, responses, and Likert-scale scores. Human-annotated preference datasets are rare—not just in Korean, but across all languages—so I hope this dataset reaches a wider audience.
In this work, I deliberately corrupted some of the responses by injecting false information to evaluate whether LLM-as-a-Judge models or reward models (RMs) could accurately detect and penalize responses containing incorrect facts. (Un)Surprisingly, they completely failed to do so.
Native Benchmarks
While re-implemented benchmarks are useful for evaluating general capabilities, native benchmarks are better suited to reflect the specific demands of the Korean community.
Benchmarks on Korean Culture: HAE-RAE Bench, CLIcK, K-Viscuit
A common approach to creating a language-adapted LLM is continual pretraining on native corpora. However, this requires over 10B tokens, which are not easy to collect. As a result, a popular alternative is to translate English corpora and train on it. The issue with this approach is that it doesn’t guarantee that the model will learn culture-specific knowledge. For instance, it's unlikely that comprehensive information about Korean history exists in English texts, so even with perfect translation, it's unclear whether the model would acquire the depth of knowledge expected of a native Korean speaker. To address this gap, benchmarks like HAE-RAE Bench and CLIcK were introduced. Both are multiple-choice question-answering benchmarks that cover unique aspects of Korean culture, ensuring a more culturally relevant evaluation.
Figure 2: Comparison of categories in HAE-RAE Bench and CLIcK
K-Viscuit takes a similar approach but focuses on visual questions, making it the only VQA (Visual Question Answering) dataset specifically designed for Korean culture.
Figure 3: Image from "Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration"
I've always considered creating my own Korean-adapted visual-grounded reasoning benchmark, including questions like the one illustrated in Figure 4.
Figure 4: A sample question I've thought of. The answer is **East**
The model, when presented with this question, should:
- Identify the monument shown in the image.
- Determine whether the image is mirrored or not.
- Recall the fact that the monument (Gwanghwamun) is built to face south.
- Use reasoning to figure out which direction is to the right based on the monument's orientation.
Unfortunately, that was the only question I could think of.
Benchmarks on Korean Social Value
Another unique benchmark is KorNAT, which consists of subjective questions. Since the questions are inherently subjective, there are no fixed answers. Instead, the responses are collected from a large-scale survey of 6,174 unique Korean participants. The goal is to assess how closely LLMs align with native Korean speakers in terms of their values and perspectives.
Translated Benchmarks
As I’ve mentioned earlier, I don’t place much value on translated benchmarks, particularly for knowledge-based benchmarks, which I believe are somewhat meaningless when translated. However, there are areas where knowledge is more language-agnostic, and in those cases, translations can be perfectly valid. So, I’ve created this section to introduce some translated benchmarks, in case someone finds them useful.
GSM8K-Ko: A machine-translated version of GSM8K. Math is a great example of language-agnostic knowledge. As long as the translation quality is solid, I think this could be useful. There’s also a Korean version of MGSM as well.
Ko-H5: This is a translation of four English benchmarks: ARC, HellaSwag, MMLU, and TruthfulQA. Notably, the dataset was provided in the form of a leaderboard, where evaluations were run for you without releasing the actual dataset, which remained private. The leaderboard highlighted an issue—despite the dataset being private, people were still able to overfit on it. After recognizing this, the team behind the leaderboard launched a second version, which included a wider range of translated benchmarks, such as GPQA, WinoGrande, GSM8K, EQ-Bench, and IFEval. Unfortunately, the leaderboard is now severely backlogged, with 874 pending evaluations. And since all the datasets are kept private, there's no practical way to use them outside the leaderboard system. Nonetheless, the first version generated massive hype in the Korean community, drawing countless developers to experiment with fine-tuning LLMs.
Conclusion
I’ve left out several safety-related benchmarks since it’s not my area of expertise, but I’ll list their names here for anyone interested in exploring them further (KoBBQ, SQuARe, KoSBi, KOLD) . In my final post, I plan to share some benchmarking results from my work on multiple benchmarking papers. My goal is to shed light on how LLMs perform on Korean benchmarks and provide a review of their abilities in the Korean language.