snorkelai
/

instruction-response-quality

weak supervision

Model card Files Files and versions

Christopher Glaze commited on Jul 20, 2023

Commit

ae10b2a

·

1 Parent(s): cbaea3e

Update readme

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -40,7 +40,7 @@ The model consists of a chain of two xbgoost algorithms, one for instruction cla
 # Model evaluation
 ## Instruction classification
-Instruction classification scores were measured with ground-truth developed internally, with an out-of-sample accuracy/macro averaged f1 score of 78%/70%. A the largest error mode appears linked with basic uncertainty as to how to classify an instruction. For example, "What are a few words that can be used to describe running?" could be interpeted as a ```generation``` task to write a brief snippet describing running, a ```brainstorming``` task to simply come up ideas for writing about running, or (as was indicated in metadata associated with the instruction) as an ```open-qa``` task to answer what running is. However, model predictions appear unbiased when comparing the distribution of ground truth classes with the predicted. This, model predictions remain useful for tracking overall instruction diversity and representation.
 ## Response quality
 Response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's:

 # Model evaluation
 ## Instruction classification
+Instruction classification scores were measured with ground-truth developed internally, with an out-of-sample accuracy/macro averaged f1 score of 78%/70%. The largest error mode appears linked with basic uncertainty as to how to classify an instruction. For example, "What are a few words that can be used to describe running?" could be interpeted as a ```generation``` task to write a brief snippet describing running, a ```brainstorming``` task to simply come up ideas for writing about running, or (as was indicated in metadata associated with the instruction) as an ```open-qa``` task to answer what running is. However, model predictions appear unbiased when comparing the distribution of ground truth classes with the predicted. Thus, the model remains useful for tracking overall instruction diversity and representation.
 ## Response quality
 Response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's: