Christopher Glaze
commited on
Commit
·
dfc3b71
1
Parent(s):
c894c5e
Update readme
Browse files- README.md +11 -5
- curating_model_eval.png +0 -0
README.md
CHANGED
@@ -25,16 +25,22 @@ The instruction classification schema is based on prior work in large language m
|
|
25 |
# Model evaluation
|
26 |
Model response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's:
|
27 |
|
28 |
-
|
29 |
-
|
30 |
-
|
|
|
|
|
|
|
31 |
|
32 |
# Usage
|
33 |
The model can accept either a dictionary or list of dicts as input. Each dict needs an ```instruction``` field at a bare minimum (in which case it will simply classify the instruction). If a ```response field``` is included, a response score will be returned. Users can also provide a ```dataset field```, which will only change model predictions if it falls under one of the existing sources we trained on (but can be left blank): dolly, helpful-instructions or open-assistant.
|
34 |
|
35 |
## Example
|
36 |
-
Input:
|
|
|
37 |
```{'instruction': 'What are ways I can stay energized throughout the day?', 'response': 'Drink lots of coffee!'}```
|
38 |
-
|
|
|
39 |
Model output:
|
|
|
40 |
```{'instruction class': 'brainstorming', 'instruction class confidence': 0.9683452, 'response quality': 0.08076164}```
|
|
|
25 |
# Model evaluation
|
26 |
Model response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's:
|
27 |
|
28 |
+
| Model response score | Win rate over ChatGPT |
|
29 |
+
| ----------- | ----------- |
|
30 |
+
| 0-0.25 | 0.25 |
|
31 |
+
| 0.25-0.5 | 0.28 |
|
32 |
+
| 0.5-0.75 | 0.43 |
|
33 |
+
| 0.75-1.0 | 0.47 |
|
34 |
|
35 |
# Usage
|
36 |
The model can accept either a dictionary or list of dicts as input. Each dict needs an ```instruction``` field at a bare minimum (in which case it will simply classify the instruction). If a ```response field``` is included, a response score will be returned. Users can also provide a ```dataset field```, which will only change model predictions if it falls under one of the existing sources we trained on (but can be left blank): dolly, helpful-instructions or open-assistant.
|
37 |
|
38 |
## Example
|
39 |
+
Input:
|
40 |
+
<br>
|
41 |
```{'instruction': 'What are ways I can stay energized throughout the day?', 'response': 'Drink lots of coffee!'}```
|
42 |
+
<br>
|
43 |
+
<br>
|
44 |
Model output:
|
45 |
+
<br>
|
46 |
```{'instruction class': 'brainstorming', 'instruction class confidence': 0.9683452, 'response quality': 0.08076164}```
|
curating_model_eval.png
DELETED
Binary file (64.7 kB)
|
|