HiTZ
/

gemma-2-9b-it-multi-truth-judge

@@ -1,6 +1,7 @@
 ---
 license: gemma
 language:
 - es
 - ca
 - gl
@@ -26,14 +27,14 @@ This model card is for a judge model fine-tuned to evaluate truthfulness, based
 ### Model Description
-This model is an LLM-as-a-Judge, fine-tuned from `google/gemma-2-9b-it` to assess the truthfulness of text generated by other language models. The evaluation framework and findings are detailed in the paper "Truth Knows No Language: Evaluating Truthfulness Beyond English." The primary goal of this work is to extend truthfulness evaluations beyond English, covering Basque, Catalan, Galician, and Spanish. This specific judge model evaluates truthfulness across multiple languages.
 - **Developed by:** Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri.
 - **Affiliations:** HiTZ Center - Ixa, University of the Basque Country, UPV/EHU; Elhuyar; Centro de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela; Departament de Traducció i Ciències del Llenguatge, Universitat Pompeu Fabra.
 - **Funded by:** MCIN/AEI/10.13039/501100011033 projects: DeepKnowledge (PID2021-127777OB-C21) and by FEDER, EU; Disargue (TED2021-130810B-C21) and European Union NextGenerationEU/PRTR; DeepMinor (CNS2023-144375) and European Union NextGenerationEU/PRTR; NÓS-ILENIA (2022/TL22/0021533). Xunta de Galicia: Centro de investigación de Galicia accreditation 2024-2027 ED431G-2023/04. UPV/EHU PIF22/84 predoc grant (Blanca Calvo Figueras). Basque Government PhD grant PRE_2024_2_0028 (Julen Etxaniz). Juan de la Cierva contract and project JDC2022-049433-I (Iria de Dios Flores), financed by the MCIN/AEI/10.13039/501100011033 and the European Union “NextGenerationEU”/PRTR.
 - **Shared by:** HiTZ Center
 - **Model type:** LLM-as-a-Judge, based on `Gemma2`
-- **Language(s) (NLP):** Fine-tuned to judge outputs in multiple languages (Basque, Catalan, Galician, Spanish). The underlying TruthfulQA-Multi benchmark, used for context, covers English, Basque, Catalan, Galician, and Spanish.
 - **License:** The base model `google/gemma-2-9b-it` is governed by the Gemma license. The fine-tuning code, this model's weights, and the TruthfulQA-Multi dataset are publicly available under Apache 2.0.
 - **Finetuned from model:** `google/gemma-2-9b-it`
@@ -47,7 +48,7 @@ This model is an LLM-as-a-Judge, fine-tuned from `google/gemma-2-9b-it` to asses
 ### Direct Use
-This model is intended for direct use as an LLM-as-a-Judge. It takes a question, a reference answer, and a model-generated answer as input, and outputs a judgment on the truthfulness of the model-generated answer. This is particularly relevant for evaluating models on the TruthfulQA benchmark, specifically for multiple languages (Basque, Catalan, Galician, Spanish).
 ### Downstream Use
@@ -102,7 +103,7 @@ Refer to the project repository (`https://github.com/hitz-zentroa/truthfulqa-mul
 The model was fine-tuned on a dataset derived from the TruthfulQA-Multi benchmark \cite{calvo-etal-2025-truthknowsnolanguage}.
 - **Dataset Link:** `https://huggingface.co/datasets/HiTZ/truthful_judge`
-- **Training Data Specifics:** Trained on data for multiple languages (Basque, Catalan, Galician, Spanish) for truth judging. This corresponds to the "MT data (all languages except English)" mentioned in the paper for Truth-Judges.
 ### Training Procedure
@@ -128,11 +129,11 @@ Inputs were formatted to present the judge model with a question, correct answer
 #### Testing Data
-The model's evaluation methodology is described in "Truth Knows No Language: Evaluating Truthfulness Beyond English," using questions from the TruthfulQA-Multi dataset (Basque, Catalan, Galician, Spanish portions).
 #### Factors
-- **Language:** Multiple languages (Basque, Catalan, Galician, Spanish).
 - **Model Type (of models being judged):** Base and instruction-tuned LLMs.
 - **Evaluation Metric:** Correlation of LLM-as-a-Judge scores with human judgments on truthfulness.
@@ -146,7 +147,7 @@ The model's evaluation methodology is described in "Truth Knows No Language: Eva
 #### Summary
 As reported in "Truth Knows No Language: Evaluating Truthfulness Beyond English" (specifically Table 4 for Truth-Judges):
-- This specific model (`multi_gemma9b_instruct_truth_judge`) is the Truth-Judge fine-tuned on `google/gemma-2-9b-it` using combined multilingual data (Basque, Catalan, Galician, Spanish).
 - Performance varies by language, with Kappa scores detailed in Table 4 of the paper.
 ## Technical Specifications

 ---
 license: gemma
 language:
+- en
 - es
 - ca
 - gl
 ### Model Description
+This model is an LLM-as-a-Judge, fine-tuned from `google/gemma-2-9b-it` to assess the truthfulness of text generated by other language models. The evaluation framework and findings are detailed in the paper "Truth Knows No Language: Evaluating Truthfulness Beyond English." The primary goal of this work is to extend truthfulness evaluations beyond English, covering English, Basque, Catalan, Galician, and Spanish. This specific judge model evaluates truthfulness across multiple languages.
 - **Developed by:** Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri.
 - **Affiliations:** HiTZ Center - Ixa, University of the Basque Country, UPV/EHU; Elhuyar; Centro de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela; Departament de Traducció i Ciències del Llenguatge, Universitat Pompeu Fabra.
 - **Funded by:** MCIN/AEI/10.13039/501100011033 projects: DeepKnowledge (PID2021-127777OB-C21) and by FEDER, EU; Disargue (TED2021-130810B-C21) and European Union NextGenerationEU/PRTR; DeepMinor (CNS2023-144375) and European Union NextGenerationEU/PRTR; NÓS-ILENIA (2022/TL22/0021533). Xunta de Galicia: Centro de investigación de Galicia accreditation 2024-2027 ED431G-2023/04. UPV/EHU PIF22/84 predoc grant (Blanca Calvo Figueras). Basque Government PhD grant PRE_2024_2_0028 (Julen Etxaniz). Juan de la Cierva contract and project JDC2022-049433-I (Iria de Dios Flores), financed by the MCIN/AEI/10.13039/501100011033 and the European Union “NextGenerationEU”/PRTR.
 - **Shared by:** HiTZ Center
 - **Model type:** LLM-as-a-Judge, based on `Gemma2`
+- **Language(s) (NLP):** Fine-tuned to judge outputs in multiple languages (English, Basque, Catalan, Galician, Spanish). The underlying TruthfulQA-Multi benchmark, used for context, covers English, Basque, Catalan, Galician, and Spanish.
 - **License:** The base model `google/gemma-2-9b-it` is governed by the Gemma license. The fine-tuning code, this model's weights, and the TruthfulQA-Multi dataset are publicly available under Apache 2.0.
 - **Finetuned from model:** `google/gemma-2-9b-it`
 ### Direct Use
+This model is intended for direct use as an LLM-as-a-Judge. It takes a question, a reference answer, and a model-generated answer as input, and outputs a judgment on the truthfulness of the model-generated answer. This is particularly relevant for evaluating models on the TruthfulQA benchmark, specifically for multiple languages (English, Basque, Catalan, Galician, Spanish).
 ### Downstream Use
 The model was fine-tuned on a dataset derived from the TruthfulQA-Multi benchmark \cite{calvo-etal-2025-truthknowsnolanguage}.
 - **Dataset Link:** `https://huggingface.co/datasets/HiTZ/truthful_judge`
+- **Training Data Specifics:** Trained on data for multiple languages (English, Basque, Catalan, Galician, Spanish) for truth judging. This corresponds to the "MT data (all languages except English)" mentioned in the paper for Truth-Judges.
 ### Training Procedure
 #### Testing Data
+The model's evaluation methodology is described in "Truth Knows No Language: Evaluating Truthfulness Beyond English," using questions from the TruthfulQA-Multi dataset (English, Basque, Catalan, Galician, Spanish portions).
 #### Factors
+- **Language:** Multiple languages (English, Basque, Catalan, Galician, Spanish).
 - **Model Type (of models being judged):** Base and instruction-tuned LLMs.
 - **Evaluation Metric:** Correlation of LLM-as-a-Judge scores with human judgments on truthfulness.
 #### Summary
 As reported in "Truth Knows No Language: Evaluating Truthfulness Beyond English" (specifically Table 4 for Truth-Judges):
+- This specific model (`multi_gemma9b_instruct_truth_judge`) is the Truth-Judge fine-tuned on `google/gemma-2-9b-it` using combined multilingual data (English, Basque, Catalan, Galician, Spanish).
 - Performance varies by language, with Kappa scores detailed in Table 4 of the paper.
 ## Technical Specifications