juletxara commited on
Commit
96159a5
·
verified ·
1 Parent(s): 9c31704

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +196 -0
README.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ language:
4
+ - es
5
+ - ca
6
+ - gl
7
+ - eu
8
+ tags:
9
+ - truthfulqa
10
+ - llm-judge
11
+ - hitz
12
+ - multilingual
13
+ - gemma
14
+ - multi
15
+ - truth-judge
16
+ datasets:
17
+ - HiTZ/truthful_judge
18
+ base_model: google/gemma-2-9b-it
19
+ ---
20
+
21
+ # Model Card for HiTZ/gemma-2-9b-it-multi-truth-judge
22
+
23
+ This model card is for a judge model fine-tuned to evaluate truthfulness, based on the work "Truth Knows No Language: Evaluating Truthfulness Beyond English".
24
+
25
+ ## Model Details
26
+
27
+ ### Model Description
28
+
29
+ This model is an LLM-as-a-Judge, fine-tuned from `google/gemma-2-9b-it` to assess the truthfulness of text generated by other language models. The evaluation framework and findings are detailed in the paper "Truth Knows No Language: Evaluating Truthfulness Beyond English." The primary goal of this work is to extend truthfulness evaluations beyond English, covering Basque, Catalan, Galician, and Spanish. This specific judge model evaluates truthfulness across multiple languages.
30
+
31
+ - **Developed by:** Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri.
32
+ - **Affiliations:** HiTZ Center - Ixa, University of the Basque Country, UPV/EHU; Elhuyar; Centro de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela; Departament de Traducció i Ciències del Llenguatge, Universitat Pompeu Fabra.
33
+ - **Funded by:** MCIN/AEI/10.13039/501100011033 projects: DeepKnowledge (PID2021-127777OB-C21) and by FEDER, EU; Disargue (TED2021-130810B-C21) and European Union NextGenerationEU/PRTR; DeepMinor (CNS2023-144375) and European Union NextGenerationEU/PRTR; NÓS-ILENIA (2022/TL22/0021533). Xunta de Galicia: Centro de investigación de Galicia accreditation 2024-2027 ED431G-2023/04. UPV/EHU PIF22/84 predoc grant (Blanca Calvo Figueras). Basque Government PhD grant PRE_2024_2_0028 (Julen Etxaniz). Juan de la Cierva contract and project JDC2022-049433-I (Iria de Dios Flores), financed by the MCIN/AEI/10.13039/501100011033 and the European Union “NextGenerationEU”/PRTR.
34
+ - **Shared by:** HiTZ Center
35
+ - **Model type:** LLM-as-a-Judge, based on `Gemma2`
36
+ - **Language(s) (NLP):** Fine-tuned to judge outputs in multiple languages (Basque, Catalan, Galician, Spanish). The underlying TruthfulQA-Multi benchmark, used for context, covers English, Basque, Catalan, Galician, and Spanish.
37
+ - **License:** The base model `google/gemma-2-9b-it` is governed by the Gemma license. The fine-tuning code, this model's weights, and the TruthfulQA-Multi dataset are publicly available under Apache 2.0.
38
+ - **Finetuned from model:** `google/gemma-2-9b-it`
39
+
40
+ ### Model Sources
41
+
42
+ - **Repository (for the project and fine-tuning code):** `https://github.com/hitz-zentroa/truthfulqa-multi`
43
+ - **Paper:** "Truth Knows No Language: Evaluating Truthfulness Beyond English" (`https://arxiv.org/abs/2502.09387`)
44
+ - **Dataset (TruthfulQA-Multi):** `https://huggingface.co/datasets/HiTZ/truthful_judge`
45
+
46
+ ## Uses
47
+
48
+ ### Direct Use
49
+
50
+ This model is intended for direct use as an LLM-as-a-Judge. It takes a question, a reference answer, and a model-generated answer as input, and outputs a judgment on the truthfulness of the model-generated answer. This is particularly relevant for evaluating models on the TruthfulQA benchmark, specifically for multiple languages (Basque, Catalan, Galician, Spanish).
51
+
52
+ ### Downstream Use
53
+
54
+ This judge model could potentially be used as a component in larger systems for content moderation, automated fact-checking research, or as a basis for further fine-tuning on more specific truthfulness-related tasks or domains.
55
+
56
+ ### Out-of-Scope Use
57
+
58
+ This model is not designed for:
59
+ - Generating general-purpose creative text or dialogue.
60
+ - Providing factual information directly (it judges, it doesn't assert).
61
+ - Use in safety-critical applications without thorough validation.
62
+ - Any application intended to deceive or spread misinformation.
63
+ The model's judgments are based on its training and may not be infallible.
64
+
65
+ ## Bias, Risks, and Limitations
66
+
67
+ The model's performance and biases are influenced by its base model (`google/gemma-2-9b-it`) and the TruthfulQA dataset. Key points from "Truth Knows No Language: Evaluating Truthfulness Beyond English":
68
+ - **Language Discrepancies:** LLMs (and thus judges based on them) may perform differently across languages.
69
+ - **Anglocentricity:** The original TruthfulQA benchmark has English-centric knowledge and cultural contexts. This model judges multiple non-English languages.
70
+ - **Context and Time Dependency:** Universal knowledge questions are handled better across languages than context- and time-dependent ones.
71
+ - **Truthfulness:** This is a "truth-judge". Truthfulness is a critical component of overall answer quality.
72
+ - **Imitative Falsehoods:** TruthfulQA focuses on "imitative falsehoods" – statements that are false but sound plausible or mimic common misconceptions.
73
+
74
+ ### Recommendations
75
+
76
+ - Users should be aware of the potential for varying performance across different types of questions (e.g., universal vs. culturally specific) and languages.
77
+ - For critical applications, judgments from this model should be cross-verified, ideally with human oversight.
78
+ - Consider the limitations regarding cultural and temporal variability when interpreting results.
79
+
80
+ ## How to Get Started with the Model
81
+
82
+ This model can be loaded using the Hugging Face `transformers` library.
83
+ ```python
84
+ # Example (conceptual, actual usage depends on task setup)
85
+ from transformers import AutoModelForCausalLM, AutoTokenizer
86
+
87
+ model_name = "HiTZ/gemma-2-9b-it-multi-truth-judge"
88
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
89
+ model = AutoModelForCausalLM.from_pretrained(model_name)
90
+
91
+ # For LLM-as-a-Judge, you would typically format input like:
92
+ # prompt = f"Question: {question}\nReference Answer: {ref_answer}\nModel Answer: {model_answer}\nIs the model answer truthful? Judge:"
93
+ # inputs = tokenizer(prompt, return_tensors="pt")
94
+ # outputs = model.generate(**inputs) # Adjust generation parameters as needed
95
+ # judgment = tokenizer.decode(outputs[0], skip_special_tokens=True)
96
+ ```
97
+ Refer to the project repository (`https://github.com/hitz-zentroa/truthfulqa-multi`) for specific examples of how judge models were used in the evaluation.
98
+
99
+ ## Training Details
100
+
101
+ ### Training Data
102
+
103
+ The model was fine-tuned on a dataset derived from the TruthfulQA-Multi benchmark \cite{calvo-etal-2025-truthknowsnolanguage}.
104
+ - **Dataset Link:** `https://huggingface.co/datasets/HiTZ/truthful_judge`
105
+ - **Training Data Specifics:** Trained on data for multiple languages (Basque, Catalan, Galician, Spanish) for truth judging. This corresponds to the "MT data (all languages except English)" mentioned in the paper for Truth-Judges.
106
+
107
+ ### Training Procedure
108
+
109
+ The model was fine-tuned as an LLM-as-a-Judge. The methodology was adapted from the original TruthfulQA paper \cite{lin-etal-2022-truthfulqa}, where the model learns to predict whether an answer is truthful given a question and reference answers.
110
+
111
+ #### Preprocessing
112
+
113
+ Inputs were formatted to present the judge model with a question, correct answer(s), and the answer to be judged, prompting it to assess truthfulness.
114
+
115
+ #### Training Hyperparameters
116
+
117
+ - **Training regime:** `bfloat16` mixed precision
118
+ - **Base model:** `google/gemma-2-9b-it`
119
+ - **Epochs:** 5
120
+ - **Learning rate:** 0.01
121
+ - **Batch size:** Refer to project code
122
+ - **Optimizer:** Refer to project code
123
+ - **Transformers Version:** `4.44.2`
124
+
125
+ ## Evaluation
126
+
127
+ ### Testing Data, Factors & Metrics
128
+
129
+ #### Testing Data
130
+
131
+ The model's evaluation methodology is described in "Truth Knows No Language: Evaluating Truthfulness Beyond English," using questions from the TruthfulQA-Multi dataset (Basque, Catalan, Galician, Spanish portions).
132
+
133
+ #### Factors
134
+
135
+ - **Language:** Multiple languages (Basque, Catalan, Galician, Spanish).
136
+ - **Model Type (of models being judged):** Base and instruction-tuned LLMs.
137
+ - **Evaluation Metric:** Correlation of LLM-as-a-Judge scores with human judgments on truthfulness.
138
+
139
+ #### Metrics
140
+
141
+ - **Primary Metric:** Spearman correlation between the judge model's scores and human-annotated scores for truthfulness.
142
+ - The paper (Table 4) reports performance for Truth-Judge models. For the Gemma-2-9B-IT base model trained on MT data (all languages except English), the Kappa scores were: Basque (0.49), Catalan (0.52), Galician (0.47), Spanish (0.55).
143
+
144
+ ### Results
145
+
146
+ #### Summary
147
+
148
+ As reported in "Truth Knows No Language: Evaluating Truthfulness Beyond English" (specifically Table 4 for Truth-Judges):
149
+ - This specific model (`multi_gemma9b_instruct_truth_judge`) is the Truth-Judge fine-tuned on `google/gemma-2-9b-it` using combined multilingual data (Basque, Catalan, Galician, Spanish).
150
+ - Performance varies by language, with Kappa scores detailed in Table 4 of the paper.
151
+
152
+ ## Technical Specifications
153
+
154
+ ### Model Architecture and Objective
155
+
156
+ The model is based on the `Gemma2` architecture (`Gemma2ForCausalLM`). It is a Causal Language Model fine-tuned with the objective of acting as a "judge" to predict the truthfulness of answers to questions.
157
+ - **Hidden Size:** `3584`
158
+ - **Intermediate Size:** `14336`
159
+ - **Num Attention Heads:** `16`
160
+ - **Num Hidden Layers:** `42`
161
+ - **Num Key Value Heads:** `8`
162
+ - **Vocab Size:** `256000`
163
+
164
+ ### Compute Infrastructure
165
+
166
+ - **Hardware:** Refer to project for details.
167
+ - **Software:** PyTorch, Transformers `4.44.2`
168
+
169
+ ## Citation
170
+
171
+ **Paper:**
172
+ ```bibtex
173
+ @inproceedings{calvo-etal-2025-truthknowsnolanguage,
174
+ title = "Truth Knows No Language: Evaluating Truthfulness Beyond English",
175
+ author = "Calvo Figueras, Blanca and Sagarzazu, Eneko and Etxaniz, Julen and Barnes, Jeremy and Gamallo, Pablo and De Dios Flores, Iria and Agerri, Rodrigo",
176
+ year={2025},
177
+ eprint={2502.09387},
178
+ archivePrefix={arXiv},
179
+ primaryClass={cs.CL},
180
+ url={https://arxiv.org/abs/2502.09387}
181
+ }
182
+ ```
183
+
184
+ ## More Information
185
+
186
+ For more details on the methodology, dataset, and findings, please refer to the full paper "Truth Knows No Language: Evaluating Truthfulness Beyond English" and the project repository: `https://github.com/hitz-zentroa/truthfulqa-multi`.
187
+
188
+ ## Model Card Authors
189
+
190
+ This model card was generated based on information from the paper "Truth Knows No Language: Evaluating Truthfulness Beyond English" by Blanca Calvo Figueras et al., and adapted from the Hugging Face model card template. Content populated by GitHub Copilot.
191
+
192
+ ## Model Card Contact
193
+
194
+ For questions about the model or the research, please contact:
195
+ - Blanca Calvo Figueras: `[email protected]`
196
+ - Rodrigo Agerri: `[email protected]`