Expirements in large-scale small-scale preference learning.

This one was a failure, it benchmarks horribly, despite responding okay to trivia questions in testing

falcon-rw-1b trained with PRO (preference ranking optimization, see https://arxiv.org/abs/2306.17492) on SuperMC and PRM800K (only stage 1) for 3 epochs, using my supertrainer2000 framework.

This is an expiremental model.

Benchmarks coming soon.

Hyperparameters:

AdamW, weight decay of 0.01, otherwise default hyperparams
Maximum LR of 1e-5
Cosine schedule with a warmup of 5400 steps
Batch size of 4 (2 real x 2 accumulated)
Maximum of 5 epochs, early stopping (visual observation), stopped after 3
Gradient clipping norm value of 1.0
PRO beta of 4

Training prompt format:

### Query
[insert instruction here]

### Answer
[insert response here]

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	29.12
AI2 Reasoning Challenge (25-Shot)	25.51
HellaSwag (10-Shot)	25.87
MMLU (5-Shot)	24.80
TruthfulQA (0-shot)	48.28
Winogrande (5-shot)	49.41
GSM8k (5-shot)	0.83

Downloads last month: 22

Safetensors

Model size

1B params

Tensor type

F32

Datasets used to train euclaise/crow-1b-attempt1

Paper for euclaise/crow-1b-attempt1

Preference Ranking Optimization for Human Alignment

Paper • 2306.17492 • Published Jun 30, 2023 • 6

Evaluation results

normalized accuracy on AI2 Reasoning Challenge (25-Shot)
test set Open LLM Leaderboard

25.510
normalized accuracy on HellaSwag (10-Shot)
validation set Open LLM Leaderboard

25.870
accuracy on MMLU (5-Shot)
test set Open LLM Leaderboard

24.800
mc2 on TruthfulQA (0-shot)
validation set Open LLM Leaderboard

48.280
accuracy on Winogrande (5-shot)
validation set Open LLM Leaderboard

49.410
accuracy on GSM8k (5-shot)
test set Open LLM Leaderboard

0.830