Joblib
English
llm
human-feedback
weak supervision
data filtering
Christopher Glaze commited on
Commit
ae10b2a
·
1 Parent(s): cbaea3e

Update readme

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -40,7 +40,7 @@ The model consists of a chain of two xbgoost algorithms, one for instruction cla
40
 
41
  # Model evaluation
42
  ## Instruction classification
43
- Instruction classification scores were measured with ground-truth developed internally, with an out-of-sample accuracy/macro averaged f1 score of 78%/70%. A the largest error mode appears linked with basic uncertainty as to how to classify an instruction. For example, "What are a few words that can be used to describe running?" could be interpeted as a ```generation``` task to write a brief snippet describing running, a ```brainstorming``` task to simply come up ideas for writing about running, or (as was indicated in metadata associated with the instruction) as an ```open-qa``` task to answer what running is. However, model predictions appear unbiased when comparing the distribution of ground truth classes with the predicted. This, model predictions remain useful for tracking overall instruction diversity and representation.
44
 
45
  ## Response quality
46
  Response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's:
 
40
 
41
  # Model evaluation
42
  ## Instruction classification
43
+ Instruction classification scores were measured with ground-truth developed internally, with an out-of-sample accuracy/macro averaged f1 score of 78%/70%. The largest error mode appears linked with basic uncertainty as to how to classify an instruction. For example, "What are a few words that can be used to describe running?" could be interpeted as a ```generation``` task to write a brief snippet describing running, a ```brainstorming``` task to simply come up ideas for writing about running, or (as was indicated in metadata associated with the instruction) as an ```open-qa``` task to answer what running is. However, model predictions appear unbiased when comparing the distribution of ground truth classes with the predicted. Thus, the model remains useful for tracking overall instruction diversity and representation.
44
 
45
  ## Response quality
46
  Response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's: