Jlonge4 commited on
Commit
f70147f
·
verified ·
1 Parent(s): 6317dda

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -23
README.md CHANGED
@@ -10,27 +10,123 @@ model-index:
10
  - name: outputs
11
  results: []
12
  ---
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
-
17
- # outputs
18
-
19
- This model is a fine-tuned version of [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) on the None dataset.
20
-
21
- ## Model description
22
-
23
- More information needed
24
-
25
- ## Intended uses & limitations
26
-
27
- More information needed
28
-
29
- ## Training and evaluation data
30
-
31
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
- ## Training procedure
 
 
 
 
34
 
35
  ### Training hyperparameters
36
 
@@ -46,10 +142,6 @@ The following hyperparameters were used during training:
46
  - lr_scheduler_warmup_steps: 10
47
  - training_steps: 150
48
 
49
- ### Training results
50
-
51
-
52
-
53
  ### Framework versions
54
 
55
  - PEFT 0.11.1
 
10
  - name: outputs
11
  results: []
12
  ---
13
+ ---
14
+ license: mit
15
+ library_name: peft
16
+ tags:
17
+ - trl
18
+ - sft
19
+ - generated_from_trainer
20
+ base_model: microsoft/Phi-3-mini-4k-instruct
21
+ model-index:
22
+ - name: outputs
23
+ results: []
24
+ ---
25
 
26
+ ## Merged Model Performance
27
+
28
+ This repository contains our hallucination evaluation PEFT adapter model.
29
+
30
+ ### Hallucination Detection Metrics
31
+
32
+ Our merged model achieves the following performance on a binary classification task for detecting hallucinations in language model outputs:
33
+
34
+ ```
35
+ precision recall f1-score support
36
+
37
+ 0 0.90 0.98 0.94 100
38
+ 1 0.98 0.89 0.93 100
39
+
40
+ accuracy 0.94 200
41
+ macro avg 0.94 0.94 0.93 200
42
+ weighted avg 0.94 0.94 0.93 200
43
+ ```
44
+
45
+ ### Model Usage
46
+ For best results, we recommend starting with the following prompting strategy (and encourage tweaks as you see fit):
47
+
48
+ ```python
49
+ def format_input(query, response):
50
+ """Your query field can be a dialogue or a single query with optional context included"""
51
+ input = f"""Your job is to evaluate whether a machine learning model has hallucinated or not.
52
+ A hallucination occurs when the response is coherent but factually incorrect or nonsensical
53
+ outputs that are not grounded in the provided context.
54
+ You are given the following information:
55
+ ####INFO####
56
+ [Query]: {query}
57
+ [Model Response]: {response}
58
+ ####END INFO####
59
+ Based on the information provided is the model output a hallucination? Respond with only "yes" or "no"
60
+ """
61
+ return input
62
+
63
+ text = format_input(query='Based on the follwoing
64
+ <context>Walrus are the largest mammal</context>
65
+ answer the question
66
+ <query> What is the best PC?</query>',
67
+ response='The best PC is the mac')
68
+
69
+ messages = [
70
+ {"role": "user", "content": text}
71
+ ]
72
+
73
+ pipe = pipeline(
74
+ "text-generation",
75
+ model=base_model,
76
+ model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
77
+ tokenizer=tokenizer,
78
+ )
79
+ generation_args = {
80
+ "max_new_tokens": 2,
81
+ "return_full_text": False,
82
+ "temperature": 0.01,
83
+ "do_sample": True,
84
+ }
85
+
86
+ output = pipe(messages, **generation_args)
87
+ print(f'Hallucination: {output[0]['generated_text'].strip().lower()}')
88
+ # Hallucination: yes
89
+ ```
90
+
91
+ ### Comparison with Other Models
92
+
93
+ We compared our merged model's performance on the hallucination detection benchmark against several other state-of-the-art language models:
94
+
95
+ | Model | Precision | Recall | F1 |
96
+ |---------------------- |----------:|-------:|-------:|
97
+ | Our Merged Model | 0.75 | 0.87 | 0.81 |
98
+ | GPT-4 | 0.93 | 0.72 | 0.82 |
99
+ | GPT-4 Turbo | 0.97 | 0.70 | 0.81 |
100
+ | Gemini Pro | 0.89 | 0.53 | 0.67 |
101
+ | GPT-3.5 | 0.89 | 0.65 | 0.75 |
102
+ | GPT-3.5-turbo-instruct| 0.89 | 0.80 | 0.84 |
103
+ | Palm 2 (Text Bison) | 1.00 | 0.44 | 0.61 |
104
+ | Claude V2 | 0.80 | 0.95 | 0.87 |
105
+
106
+ As shown in the table, our merged model achieves one of the highest F1 scores of 0.81, outperforming several other state-of-the-art language models on this hallucination detection task.
107
+
108
+ We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.
109
+
110
+ Citations:
111
+ Scores from arize/phoenix
112
+
113
+ ### Training Data
114
+
115
+ @misc{HaluEval,
116
+ author = {Junyi Li and Xiaoxue Cheng and Wayne Xin Zhao and Jian-Yun Nie and Ji-Rong Wen },
117
+ title = {HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models},
118
+ year = {2023},
119
+ journal={arXiv preprint arXiv:2305.11747},
120
+ url={https://arxiv.org/abs/2305.11747}
121
+ }
122
+
123
+ ### Framework versions
124
 
125
+ - PEFT 0.11.1
126
+ - Transformers 4.41.2
127
+ - Pytorch 2.3.0+cu121
128
+ - Datasets 2.19.2
129
+ - Tokenizers 0.19.1
130
 
131
  ### Training hyperparameters
132
 
 
142
  - lr_scheduler_warmup_steps: 10
143
  - training_steps: 150
144
 
 
 
 
 
145
  ### Framework versions
146
 
147
  - PEFT 0.11.1