snorkelai
/

instruction-response-quality

@@ -24,7 +24,7 @@ tags:
 # Summary
 Instruction tuning has emerged as an important step in developing performant large language models (LLMs) for generative AI tasks. While industry-backed LLMs such as ChatGPT, Bard, Claude, and even the open-source Llama 2 have relied on massive, expensive proprietary datasets unavailable to the public, the open source community has banded together to create similar datasets such as OpenAssistant and Dolly that are available to everyone.  However, high variance in the quality and distribution of responses collected by volunteers has limited the quality of resulting open source models.
-This model (1) classifies instruction with a standardized schema that can be applied across datasets and (2) scores response quality on a scale of 0-1. The purpose is to measure and track instruction diversity across training sets, and enable filtering based on response quality for more targeted fine-tuning.
 The instruction classification schema is based on prior work in large language models:
@@ -35,8 +35,15 @@ The instruction classification schema is based on prior work in large language m
 * <strong>Summarization</strong>: e.g., “Summarize the main points from this news article”
 * <strong>Other</strong>: e.g., anything that did not fit the previous five categories.
 # Model development
-The model consists of a chain of two xbgoost algorithms, one for instruction classification and one for response quality (modeled as binary is/is not an acceptable response classifier). We trained the algorithms with weak supervision techniques and a feature space that includes metadata specific to each dataset, perplexity, measures with [simcse embeddings](https://arxiv.org/pdf/2104.08821.pdf) and attributes involving regex, parts-of-speech tagging and response duration. In order to maintain a lightweight architecture that requires only CPUs for inference, we omitted perplexity from the feature space used to train the xgboost end models (so perplexity was used for weak supervision only).
 # Model evaluation
 ## Instruction classification

 # Summary
 Instruction tuning has emerged as an important step in developing performant large language models (LLMs) for generative AI tasks. While industry-backed LLMs such as ChatGPT, Bard, Claude, and even the open-source Llama 2 have relied on massive, expensive proprietary datasets unavailable to the public, the open source community has banded together to create similar datasets such as OpenAssistant and Dolly that are available to everyone.  However, high variance in the quality and distribution of responses collected by volunteers has limited the quality of resulting open source models.
+This model (1) classifies instructions with a standardized schema that can be applied across datasets and (2) scores response quality on a scale of 0-1. The purpose is to measure and track instruction diversity across training sets, and enable filtering based on response quality for more targeted fine-tuning.
 The instruction classification schema is based on prior work in large language models:
 * <strong>Summarization</strong>: e.g., “Summarize the main points from this news article”
 * <strong>Other</strong>: e.g., anything that did not fit the previous five categories.
+The response quality model was developed as a binary classlifier ("is/is not an acceptable response"), with the following goals:
+  1. Enable filtering of instruction/response datasets for higher quality responses.
+  2. Enable this filtering while maintaining (or increasing) instruction diversity.
+  3. Develop the model with a lightweight architecture and a scalable data labeling process that requires minimal human-hours.
+The response model itself was developed with weak supervision in one day by two FTEs at Snorkel, AI. The model is under continuous development and planned work includes additional datasets and more refined curating labeling functions. We also welcome feedback from the community on any observed patterns of errors!
 # Model development
+The model pipeline currently consists of a chain of two xbgoost algorithms, one for instruction classification and one for response quality (modeled as binary is/is not an acceptable response classifier). We trained the algorithms with weak supervision techniques and a feature space that includes metadata specific to each dataset, perplexity, measures with [simcse embeddings](https://arxiv.org/pdf/2104.08821.pdf) and attributes involving regex, parts-of-speech tagging and response duration. In order to maintain a lightweight architecture that requires only CPUs for inference, we omitted perplexity from the feature space used to train the xgboost end models (so perplexity was used for weak supervision only).
 # Model evaluation
 ## Instruction classification