codebyzeb commited on
Commit
65b7721
·
verified ·
1 Parent(s): ca31674

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Experiment Configuration
2
+ ```yaml
3
+ callbacks:
4
+ grad_accum:
5
+ _target_: src.callbacks.gradient_accumulation.GradientAccumulationScheduler
6
+ scheduling:
7
+ 0: 2
8
+ grad_norm:
9
+ _target_: src.callbacks.grad_norm.GradNorm
10
+ check_clipping: false
11
+ group_separator: /
12
+ histogram_freq: null
13
+ log_weight_distribution: false
14
+ norm_type: 2
15
+ only_total: true
16
+ lr_monitor:
17
+ _target_: src.callbacks.lr_monitor.SimpleLearningRateMonitor
18
+ model_checkpoint:
19
+ _target_: src.callbacks.model_checkpoint.ModelCheckpoint
20
+ dirpath: .checkpoints
21
+ enable_version_counter: false
22
+ every_n_train_steps: 2000
23
+ filename: '{step}'
24
+ save_initial_checkpoint: true
25
+ save_last: link
26
+ save_top_k: -1
27
+ verbose: true
28
+ speed_monitor:
29
+ _target_: src.callbacks.speed_monitor.SpeedMonitor
30
+ data:
31
+ batch_size: 16
32
+ drop_last: false
33
+ eval_batch_size: 64
34
+ multiprocessing_context: null
35
+ num_workers: 12
36
+ persistent_workers: false
37
+ pin_memory: true
38
+ prefetch_factor: 2
39
+ shuffle: true
40
+ dataset: finewebedu-20B
41
+ evaluation:
42
+ blimp: true
43
+ loggers:
44
+ tensorboard:
45
+ _target_: src.trainer.TensorBoardLogger
46
+ name: ''
47
+ save_dir: ./
48
+ version: null
49
+ model: fw57M-tied
50
+ optim:
51
+ lr: 0.0006
52
+ num_warmup_steps: 2000
53
+ optim_kwargs:
54
+ betas:
55
+ - 0.9
56
+ - 0.95
57
+ eps: 1.0e-08
58
+ fused: true
59
+ optim_name: adamw
60
+ scheduler_kwargs:
61
+ min_lr_ratio: 0.01
62
+ num_decay_steps: 4000
63
+ num_stable_steps: 44000
64
+ scheduler_name: warmup_stable_decay
65
+ weight_decay: 0.01
66
+ out_parent_folder: model_train
67
+ pwd: /home/zg258/rds/hpc-work/infotokenization
68
+ resume_from_checkpoint: .checkpoints/last.ckpt
69
+ run_folder: .
70
+ save_initial_checkpoint: true
71
+ seed: 42
72
+ tok_name: frequency_64000
73
+ torch_compile: true
74
+ train_data_path: /home/zg258/rds/hpc-work/infotokenization/data/finewebedu-20B/frequency_64000/train
75
+ trainer:
76
+ accelerator: gpu
77
+ deterministic: false
78
+ devices: 4
79
+ enable_progress_bar: true
80
+ fast_dev_run: false
81
+ gradient_clip_algorithm: norm
82
+ gradient_clip_val: 1.0
83
+ limit_val_batches: 500
84
+ log_every_n_steps: 1
85
+ max_steps: 50000
86
+ precision: bf16-true
87
+ val_check_interval: 2000
88
+ val_data_path: /home/zg258/rds/hpc-work/infotokenization/data/finewebedu-20B/frequency_64000/validation
89
+ ```
blimp_results.json ADDED
@@ -0,0 +1,2965 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "blimp": {
4
+ "acc,none": 0.7874626865671641,
5
+ "acc_stderr,none": 0.0014187246183603329,
6
+ "alias": "blimp"
7
+ },
8
+ "blimp_adjunct_island": {
9
+ "alias": " - blimp_adjunct_island",
10
+ "acc,none": 0.873,
11
+ "acc_stderr,none": 0.010534798620855757
12
+ },
13
+ "blimp_anaphor_gender_agreement": {
14
+ "alias": " - blimp_anaphor_gender_agreement",
15
+ "acc,none": 0.909,
16
+ "acc_stderr,none": 0.009099549538400219
17
+ },
18
+ "blimp_anaphor_number_agreement": {
19
+ "alias": " - blimp_anaphor_number_agreement",
20
+ "acc,none": 0.983,
21
+ "acc_stderr,none": 0.004089954489689082
22
+ },
23
+ "blimp_animate_subject_passive": {
24
+ "alias": " - blimp_animate_subject_passive",
25
+ "acc,none": 0.734,
26
+ "acc_stderr,none": 0.013979965645145155
27
+ },
28
+ "blimp_animate_subject_trans": {
29
+ "alias": " - blimp_animate_subject_trans",
30
+ "acc,none": 0.874,
31
+ "acc_stderr,none": 0.010499249222408018
32
+ },
33
+ "blimp_causative": {
34
+ "alias": " - blimp_causative",
35
+ "acc,none": 0.727,
36
+ "acc_stderr,none": 0.014095022868717583
37
+ },
38
+ "blimp_complex_NP_island": {
39
+ "alias": " - blimp_complex_NP_island",
40
+ "acc,none": 0.547,
41
+ "acc_stderr,none": 0.015749255189977586
42
+ },
43
+ "blimp_coordinate_structure_constraint_complex_left_branch": {
44
+ "alias": " - blimp_coordinate_structure_constraint_complex_left_branch",
45
+ "acc,none": 0.601,
46
+ "acc_stderr,none": 0.015493193313162906
47
+ },
48
+ "blimp_coordinate_structure_constraint_object_extraction": {
49
+ "alias": " - blimp_coordinate_structure_constraint_object_extraction",
50
+ "acc,none": 0.841,
51
+ "acc_stderr,none": 0.0115694793682713
52
+ },
53
+ "blimp_determiner_noun_agreement_1": {
54
+ "alias": " - blimp_determiner_noun_agreement_1",
55
+ "acc,none": 0.981,
56
+ "acc_stderr,none": 0.00431945108291065
57
+ },
58
+ "blimp_determiner_noun_agreement_2": {
59
+ "alias": " - blimp_determiner_noun_agreement_2",
60
+ "acc,none": 0.954,
61
+ "acc_stderr,none": 0.006627814717380719
62
+ },
63
+ "blimp_determiner_noun_agreement_irregular_1": {
64
+ "alias": " - blimp_determiner_noun_agreement_irregular_1",
65
+ "acc,none": 0.92,
66
+ "acc_stderr,none": 0.00858333697775365
67
+ },
68
+ "blimp_determiner_noun_agreement_irregular_2": {
69
+ "alias": " - blimp_determiner_noun_agreement_irregular_2",
70
+ "acc,none": 0.935,
71
+ "acc_stderr,none": 0.0077997330618320105
72
+ },
73
+ "blimp_determiner_noun_agreement_with_adj_2": {
74
+ "alias": " - blimp_determiner_noun_agreement_with_adj_2",
75
+ "acc,none": 0.927,
76
+ "acc_stderr,none": 0.008230354715244073
77
+ },
78
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": {
79
+ "alias": " - blimp_determiner_noun_agreement_with_adj_irregular_1",
80
+ "acc,none": 0.885,
81
+ "acc_stderr,none": 0.010093407594904635
82
+ },
83
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": {
84
+ "alias": " - blimp_determiner_noun_agreement_with_adj_irregular_2",
85
+ "acc,none": 0.925,
86
+ "acc_stderr,none": 0.00833333333333334
87
+ },
88
+ "blimp_determiner_noun_agreement_with_adjective_1": {
89
+ "alias": " - blimp_determiner_noun_agreement_with_adjective_1",
90
+ "acc,none": 0.949,
91
+ "acc_stderr,none": 0.006960420062571412
92
+ },
93
+ "blimp_distractor_agreement_relational_noun": {
94
+ "alias": " - blimp_distractor_agreement_relational_noun",
95
+ "acc,none": 0.874,
96
+ "acc_stderr,none": 0.010499249222408023
97
+ },
98
+ "blimp_distractor_agreement_relative_clause": {
99
+ "alias": " - blimp_distractor_agreement_relative_clause",
100
+ "acc,none": 0.765,
101
+ "acc_stderr,none": 0.01341472903024713
102
+ },
103
+ "blimp_drop_argument": {
104
+ "alias": " - blimp_drop_argument",
105
+ "acc,none": 0.776,
106
+ "acc_stderr,none": 0.013190830072364457
107
+ },
108
+ "blimp_ellipsis_n_bar_1": {
109
+ "alias": " - blimp_ellipsis_n_bar_1",
110
+ "acc,none": 0.803,
111
+ "acc_stderr,none": 0.012583693787968118
112
+ },
113
+ "blimp_ellipsis_n_bar_2": {
114
+ "alias": " - blimp_ellipsis_n_bar_2",
115
+ "acc,none": 0.83,
116
+ "acc_stderr,none": 0.011884495834541656
117
+ },
118
+ "blimp_existential_there_object_raising": {
119
+ "alias": " - blimp_existential_there_object_raising",
120
+ "acc,none": 0.819,
121
+ "acc_stderr,none": 0.012181436179177904
122
+ },
123
+ "blimp_existential_there_quantifiers_1": {
124
+ "alias": " - blimp_existential_there_quantifiers_1",
125
+ "acc,none": 0.979,
126
+ "acc_stderr,none": 0.004536472151306512
127
+ },
128
+ "blimp_existential_there_quantifiers_2": {
129
+ "alias": " - blimp_existential_there_quantifiers_2",
130
+ "acc,none": 0.263,
131
+ "acc_stderr,none": 0.013929286594259734
132
+ },
133
+ "blimp_existential_there_subject_raising": {
134
+ "alias": " - blimp_existential_there_subject_raising",
135
+ "acc,none": 0.867,
136
+ "acc_stderr,none": 0.010743669132397349
137
+ },
138
+ "blimp_expletive_it_object_raising": {
139
+ "alias": " - blimp_expletive_it_object_raising",
140
+ "acc,none": 0.75,
141
+ "acc_stderr,none": 0.013699915608779773
142
+ },
143
+ "blimp_inchoative": {
144
+ "alias": " - blimp_inchoative",
145
+ "acc,none": 0.643,
146
+ "acc_stderr,none": 0.015158521721486767
147
+ },
148
+ "blimp_intransitive": {
149
+ "alias": " - blimp_intransitive",
150
+ "acc,none": 0.777,
151
+ "acc_stderr,none": 0.013169830843425689
152
+ },
153
+ "blimp_irregular_past_participle_adjectives": {
154
+ "alias": " - blimp_irregular_past_participle_adjectives",
155
+ "acc,none": 0.847,
156
+ "acc_stderr,none": 0.01138950045966553
157
+ },
158
+ "blimp_irregular_past_participle_verbs": {
159
+ "alias": " - blimp_irregular_past_participle_verbs",
160
+ "acc,none": 0.839,
161
+ "acc_stderr,none": 0.01162816469672718
162
+ },
163
+ "blimp_irregular_plural_subject_verb_agreement_1": {
164
+ "alias": " - blimp_irregular_plural_subject_verb_agreement_1",
165
+ "acc,none": 0.898,
166
+ "acc_stderr,none": 0.009575368801653885
167
+ },
168
+ "blimp_irregular_plural_subject_verb_agreement_2": {
169
+ "alias": " - blimp_irregular_plural_subject_verb_agreement_2",
170
+ "acc,none": 0.899,
171
+ "acc_stderr,none": 0.009533618929340971
172
+ },
173
+ "blimp_left_branch_island_echo_question": {
174
+ "alias": " - blimp_left_branch_island_echo_question",
175
+ "acc,none": 0.568,
176
+ "acc_stderr,none": 0.015672320237336203
177
+ },
178
+ "blimp_left_branch_island_simple_question": {
179
+ "alias": " - blimp_left_branch_island_simple_question",
180
+ "acc,none": 0.649,
181
+ "acc_stderr,none": 0.015100563798316407
182
+ },
183
+ "blimp_matrix_question_npi_licensor_present": {
184
+ "alias": " - blimp_matrix_question_npi_licensor_present",
185
+ "acc,none": 0.313,
186
+ "acc_stderr,none": 0.014671272822977888
187
+ },
188
+ "blimp_npi_present_1": {
189
+ "alias": " - blimp_npi_present_1",
190
+ "acc,none": 0.591,
191
+ "acc_stderr,none": 0.015555094373257942
192
+ },
193
+ "blimp_npi_present_2": {
194
+ "alias": " - blimp_npi_present_2",
195
+ "acc,none": 0.696,
196
+ "acc_stderr,none": 0.014553205687950436
197
+ },
198
+ "blimp_only_npi_licensor_present": {
199
+ "alias": " - blimp_only_npi_licensor_present",
200
+ "acc,none": 0.961,
201
+ "acc_stderr,none": 0.006125072776426101
202
+ },
203
+ "blimp_only_npi_scope": {
204
+ "alias": " - blimp_only_npi_scope",
205
+ "acc,none": 0.522,
206
+ "acc_stderr,none": 0.015803979428161946
207
+ },
208
+ "blimp_passive_1": {
209
+ "alias": " - blimp_passive_1",
210
+ "acc,none": 0.9,
211
+ "acc_stderr,none": 0.009491579957525038
212
+ },
213
+ "blimp_passive_2": {
214
+ "alias": " - blimp_passive_2",
215
+ "acc,none": 0.89,
216
+ "acc_stderr,none": 0.009899393819724439
217
+ },
218
+ "blimp_principle_A_c_command": {
219
+ "alias": " - blimp_principle_A_c_command",
220
+ "acc,none": 0.669,
221
+ "acc_stderr,none": 0.014888272588203936
222
+ },
223
+ "blimp_principle_A_case_1": {
224
+ "alias": " - blimp_principle_A_case_1",
225
+ "acc,none": 1.0,
226
+ "acc_stderr,none": 0.0
227
+ },
228
+ "blimp_principle_A_case_2": {
229
+ "alias": " - blimp_principle_A_case_2",
230
+ "acc,none": 0.951,
231
+ "acc_stderr,none": 0.0068297617561409165
232
+ },
233
+ "blimp_principle_A_domain_1": {
234
+ "alias": " - blimp_principle_A_domain_1",
235
+ "acc,none": 0.945,
236
+ "acc_stderr,none": 0.007212976294639237
237
+ },
238
+ "blimp_principle_A_domain_2": {
239
+ "alias": " - blimp_principle_A_domain_2",
240
+ "acc,none": 0.832,
241
+ "acc_stderr,none": 0.011828605831454276
242
+ },
243
+ "blimp_principle_A_domain_3": {
244
+ "alias": " - blimp_principle_A_domain_3",
245
+ "acc,none": 0.646,
246
+ "acc_stderr,none": 0.015129868238451772
247
+ },
248
+ "blimp_principle_A_reconstruction": {
249
+ "alias": " - blimp_principle_A_reconstruction",
250
+ "acc,none": 0.391,
251
+ "acc_stderr,none": 0.015438826294681783
252
+ },
253
+ "blimp_regular_plural_subject_verb_agreement_1": {
254
+ "alias": " - blimp_regular_plural_subject_verb_agreement_1",
255
+ "acc,none": 0.93,
256
+ "acc_stderr,none": 0.008072494358323494
257
+ },
258
+ "blimp_regular_plural_subject_verb_agreement_2": {
259
+ "alias": " - blimp_regular_plural_subject_verb_agreement_2",
260
+ "acc,none": 0.888,
261
+ "acc_stderr,none": 0.00997775303139725
262
+ },
263
+ "blimp_sentential_negation_npi_licensor_present": {
264
+ "alias": " - blimp_sentential_negation_npi_licensor_present",
265
+ "acc,none": 0.97,
266
+ "acc_stderr,none": 0.005397140829099205
267
+ },
268
+ "blimp_sentential_negation_npi_scope": {
269
+ "alias": " - blimp_sentential_negation_npi_scope",
270
+ "acc,none": 0.662,
271
+ "acc_stderr,none": 0.014965960710224485
272
+ },
273
+ "blimp_sentential_subject_island": {
274
+ "alias": " - blimp_sentential_subject_island",
275
+ "acc,none": 0.511,
276
+ "acc_stderr,none": 0.01581547119529269
277
+ },
278
+ "blimp_superlative_quantifiers_1": {
279
+ "alias": " - blimp_superlative_quantifiers_1",
280
+ "acc,none": 0.84,
281
+ "acc_stderr,none": 0.011598902298689009
282
+ },
283
+ "blimp_superlative_quantifiers_2": {
284
+ "alias": " - blimp_superlative_quantifiers_2",
285
+ "acc,none": 0.705,
286
+ "acc_stderr,none": 0.01442855443844551
287
+ },
288
+ "blimp_tough_vs_raising_1": {
289
+ "alias": " - blimp_tough_vs_raising_1",
290
+ "acc,none": 0.655,
291
+ "acc_stderr,none": 0.015039986742055237
292
+ },
293
+ "blimp_tough_vs_raising_2": {
294
+ "alias": " - blimp_tough_vs_raising_2",
295
+ "acc,none": 0.882,
296
+ "acc_stderr,none": 0.010206869264381793
297
+ },
298
+ "blimp_transitive": {
299
+ "alias": " - blimp_transitive",
300
+ "acc,none": 0.838,
301
+ "acc_stderr,none": 0.01165726777130441
302
+ },
303
+ "blimp_wh_island": {
304
+ "alias": " - blimp_wh_island",
305
+ "acc,none": 0.763,
306
+ "acc_stderr,none": 0.01345407046257795
307
+ },
308
+ "blimp_wh_questions_object_gap": {
309
+ "alias": " - blimp_wh_questions_object_gap",
310
+ "acc,none": 0.825,
311
+ "acc_stderr,none": 0.012021627157731972
312
+ },
313
+ "blimp_wh_questions_subject_gap": {
314
+ "alias": " - blimp_wh_questions_subject_gap",
315
+ "acc,none": 0.955,
316
+ "acc_stderr,none": 0.00655881224140612
317
+ },
318
+ "blimp_wh_questions_subject_gap_long_distance": {
319
+ "alias": " - blimp_wh_questions_subject_gap_long_distance",
320
+ "acc,none": 0.916,
321
+ "acc_stderr,none": 0.008776162089491122
322
+ },
323
+ "blimp_wh_vs_that_no_gap": {
324
+ "alias": " - blimp_wh_vs_that_no_gap",
325
+ "acc,none": 0.983,
326
+ "acc_stderr,none": 0.004089954489689092
327
+ },
328
+ "blimp_wh_vs_that_no_gap_long_distance": {
329
+ "alias": " - blimp_wh_vs_that_no_gap_long_distance",
330
+ "acc,none": 0.987,
331
+ "acc_stderr,none": 0.0035838308894036368
332
+ },
333
+ "blimp_wh_vs_that_with_gap": {
334
+ "alias": " - blimp_wh_vs_that_with_gap",
335
+ "acc,none": 0.486,
336
+ "acc_stderr,none": 0.015813097547730987
337
+ },
338
+ "blimp_wh_vs_that_with_gap_long_distance": {
339
+ "alias": " - blimp_wh_vs_that_with_gap_long_distance",
340
+ "acc,none": 0.246,
341
+ "acc_stderr,none": 0.013626065817750634
342
+ }
343
+ },
344
+ "groups": {
345
+ "blimp": {
346
+ "acc,none": 0.7874626865671641,
347
+ "acc_stderr,none": 0.0014187246183603329,
348
+ "alias": "blimp"
349
+ }
350
+ },
351
+ "group_subtasks": {
352
+ "blimp": [
353
+ "blimp_adjunct_island",
354
+ "blimp_anaphor_gender_agreement",
355
+ "blimp_anaphor_number_agreement",
356
+ "blimp_animate_subject_passive",
357
+ "blimp_animate_subject_trans",
358
+ "blimp_causative",
359
+ "blimp_complex_NP_island",
360
+ "blimp_coordinate_structure_constraint_complex_left_branch",
361
+ "blimp_coordinate_structure_constraint_object_extraction",
362
+ "blimp_determiner_noun_agreement_1",
363
+ "blimp_determiner_noun_agreement_2",
364
+ "blimp_determiner_noun_agreement_irregular_1",
365
+ "blimp_determiner_noun_agreement_irregular_2",
366
+ "blimp_determiner_noun_agreement_with_adj_2",
367
+ "blimp_determiner_noun_agreement_with_adj_irregular_1",
368
+ "blimp_determiner_noun_agreement_with_adj_irregular_2",
369
+ "blimp_determiner_noun_agreement_with_adjective_1",
370
+ "blimp_distractor_agreement_relational_noun",
371
+ "blimp_distractor_agreement_relative_clause",
372
+ "blimp_drop_argument",
373
+ "blimp_ellipsis_n_bar_1",
374
+ "blimp_ellipsis_n_bar_2",
375
+ "blimp_existential_there_object_raising",
376
+ "blimp_existential_there_quantifiers_1",
377
+ "blimp_existential_there_quantifiers_2",
378
+ "blimp_existential_there_subject_raising",
379
+ "blimp_expletive_it_object_raising",
380
+ "blimp_inchoative",
381
+ "blimp_intransitive",
382
+ "blimp_irregular_past_participle_adjectives",
383
+ "blimp_irregular_past_participle_verbs",
384
+ "blimp_irregular_plural_subject_verb_agreement_1",
385
+ "blimp_irregular_plural_subject_verb_agreement_2",
386
+ "blimp_left_branch_island_echo_question",
387
+ "blimp_left_branch_island_simple_question",
388
+ "blimp_matrix_question_npi_licensor_present",
389
+ "blimp_npi_present_1",
390
+ "blimp_npi_present_2",
391
+ "blimp_only_npi_licensor_present",
392
+ "blimp_only_npi_scope",
393
+ "blimp_passive_1",
394
+ "blimp_passive_2",
395
+ "blimp_principle_A_c_command",
396
+ "blimp_principle_A_case_1",
397
+ "blimp_principle_A_case_2",
398
+ "blimp_principle_A_domain_1",
399
+ "blimp_principle_A_domain_2",
400
+ "blimp_principle_A_domain_3",
401
+ "blimp_principle_A_reconstruction",
402
+ "blimp_regular_plural_subject_verb_agreement_1",
403
+ "blimp_regular_plural_subject_verb_agreement_2",
404
+ "blimp_sentential_negation_npi_licensor_present",
405
+ "blimp_sentential_negation_npi_scope",
406
+ "blimp_sentential_subject_island",
407
+ "blimp_superlative_quantifiers_1",
408
+ "blimp_superlative_quantifiers_2",
409
+ "blimp_tough_vs_raising_1",
410
+ "blimp_tough_vs_raising_2",
411
+ "blimp_transitive",
412
+ "blimp_wh_island",
413
+ "blimp_wh_questions_object_gap",
414
+ "blimp_wh_questions_subject_gap",
415
+ "blimp_wh_questions_subject_gap_long_distance",
416
+ "blimp_wh_vs_that_no_gap",
417
+ "blimp_wh_vs_that_no_gap_long_distance",
418
+ "blimp_wh_vs_that_with_gap",
419
+ "blimp_wh_vs_that_with_gap_long_distance"
420
+ ]
421
+ },
422
+ "configs": {
423
+ "blimp_adjunct_island": {
424
+ "task": "blimp_adjunct_island",
425
+ "dataset_path": "blimp",
426
+ "dataset_name": "adjunct_island",
427
+ "validation_split": "train",
428
+ "doc_to_text": "",
429
+ "doc_to_target": 0,
430
+ "unsafe_code": false,
431
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
432
+ "description": "",
433
+ "target_delimiter": " ",
434
+ "fewshot_delimiter": "\n\n",
435
+ "num_fewshot": 0,
436
+ "metric_list": [
437
+ {
438
+ "metric": "acc",
439
+ "aggregation": "mean",
440
+ "higher_is_better": true
441
+ }
442
+ ],
443
+ "output_type": "multiple_choice",
444
+ "repeats": 1,
445
+ "should_decontaminate": true,
446
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
447
+ "metadata": {
448
+ "version": 1.0
449
+ }
450
+ },
451
+ "blimp_anaphor_gender_agreement": {
452
+ "task": "blimp_anaphor_gender_agreement",
453
+ "dataset_path": "blimp",
454
+ "dataset_name": "anaphor_gender_agreement",
455
+ "validation_split": "train",
456
+ "doc_to_text": "",
457
+ "doc_to_target": 0,
458
+ "unsafe_code": false,
459
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
460
+ "description": "",
461
+ "target_delimiter": " ",
462
+ "fewshot_delimiter": "\n\n",
463
+ "num_fewshot": 0,
464
+ "metric_list": [
465
+ {
466
+ "metric": "acc",
467
+ "aggregation": "mean",
468
+ "higher_is_better": true
469
+ }
470
+ ],
471
+ "output_type": "multiple_choice",
472
+ "repeats": 1,
473
+ "should_decontaminate": true,
474
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
475
+ "metadata": {
476
+ "version": 1.0
477
+ }
478
+ },
479
+ "blimp_anaphor_number_agreement": {
480
+ "task": "blimp_anaphor_number_agreement",
481
+ "dataset_path": "blimp",
482
+ "dataset_name": "anaphor_number_agreement",
483
+ "validation_split": "train",
484
+ "doc_to_text": "",
485
+ "doc_to_target": 0,
486
+ "unsafe_code": false,
487
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
488
+ "description": "",
489
+ "target_delimiter": " ",
490
+ "fewshot_delimiter": "\n\n",
491
+ "num_fewshot": 0,
492
+ "metric_list": [
493
+ {
494
+ "metric": "acc",
495
+ "aggregation": "mean",
496
+ "higher_is_better": true
497
+ }
498
+ ],
499
+ "output_type": "multiple_choice",
500
+ "repeats": 1,
501
+ "should_decontaminate": true,
502
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
503
+ "metadata": {
504
+ "version": 1.0
505
+ }
506
+ },
507
+ "blimp_animate_subject_passive": {
508
+ "task": "blimp_animate_subject_passive",
509
+ "dataset_path": "blimp",
510
+ "dataset_name": "animate_subject_passive",
511
+ "validation_split": "train",
512
+ "doc_to_text": "",
513
+ "doc_to_target": 0,
514
+ "unsafe_code": false,
515
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
516
+ "description": "",
517
+ "target_delimiter": " ",
518
+ "fewshot_delimiter": "\n\n",
519
+ "num_fewshot": 0,
520
+ "metric_list": [
521
+ {
522
+ "metric": "acc",
523
+ "aggregation": "mean",
524
+ "higher_is_better": true
525
+ }
526
+ ],
527
+ "output_type": "multiple_choice",
528
+ "repeats": 1,
529
+ "should_decontaminate": true,
530
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
531
+ "metadata": {
532
+ "version": 1.0
533
+ }
534
+ },
535
+ "blimp_animate_subject_trans": {
536
+ "task": "blimp_animate_subject_trans",
537
+ "dataset_path": "blimp",
538
+ "dataset_name": "animate_subject_trans",
539
+ "validation_split": "train",
540
+ "doc_to_text": "",
541
+ "doc_to_target": 0,
542
+ "unsafe_code": false,
543
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
544
+ "description": "",
545
+ "target_delimiter": " ",
546
+ "fewshot_delimiter": "\n\n",
547
+ "num_fewshot": 0,
548
+ "metric_list": [
549
+ {
550
+ "metric": "acc",
551
+ "aggregation": "mean",
552
+ "higher_is_better": true
553
+ }
554
+ ],
555
+ "output_type": "multiple_choice",
556
+ "repeats": 1,
557
+ "should_decontaminate": true,
558
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
559
+ "metadata": {
560
+ "version": 1.0
561
+ }
562
+ },
563
+ "blimp_causative": {
564
+ "task": "blimp_causative",
565
+ "dataset_path": "blimp",
566
+ "dataset_name": "causative",
567
+ "validation_split": "train",
568
+ "doc_to_text": "",
569
+ "doc_to_target": 0,
570
+ "unsafe_code": false,
571
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
572
+ "description": "",
573
+ "target_delimiter": " ",
574
+ "fewshot_delimiter": "\n\n",
575
+ "num_fewshot": 0,
576
+ "metric_list": [
577
+ {
578
+ "metric": "acc",
579
+ "aggregation": "mean",
580
+ "higher_is_better": true
581
+ }
582
+ ],
583
+ "output_type": "multiple_choice",
584
+ "repeats": 1,
585
+ "should_decontaminate": true,
586
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
587
+ "metadata": {
588
+ "version": 1.0
589
+ }
590
+ },
591
+ "blimp_complex_NP_island": {
592
+ "task": "blimp_complex_NP_island",
593
+ "dataset_path": "blimp",
594
+ "dataset_name": "complex_NP_island",
595
+ "validation_split": "train",
596
+ "doc_to_text": "",
597
+ "doc_to_target": 0,
598
+ "unsafe_code": false,
599
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
600
+ "description": "",
601
+ "target_delimiter": " ",
602
+ "fewshot_delimiter": "\n\n",
603
+ "num_fewshot": 0,
604
+ "metric_list": [
605
+ {
606
+ "metric": "acc",
607
+ "aggregation": "mean",
608
+ "higher_is_better": true
609
+ }
610
+ ],
611
+ "output_type": "multiple_choice",
612
+ "repeats": 1,
613
+ "should_decontaminate": true,
614
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
615
+ "metadata": {
616
+ "version": 1.0
617
+ }
618
+ },
619
+ "blimp_coordinate_structure_constraint_complex_left_branch": {
620
+ "task": "blimp_coordinate_structure_constraint_complex_left_branch",
621
+ "dataset_path": "blimp",
622
+ "dataset_name": "coordinate_structure_constraint_complex_left_branch",
623
+ "validation_split": "train",
624
+ "doc_to_text": "",
625
+ "doc_to_target": 0,
626
+ "unsafe_code": false,
627
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
628
+ "description": "",
629
+ "target_delimiter": " ",
630
+ "fewshot_delimiter": "\n\n",
631
+ "num_fewshot": 0,
632
+ "metric_list": [
633
+ {
634
+ "metric": "acc",
635
+ "aggregation": "mean",
636
+ "higher_is_better": true
637
+ }
638
+ ],
639
+ "output_type": "multiple_choice",
640
+ "repeats": 1,
641
+ "should_decontaminate": true,
642
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
643
+ "metadata": {
644
+ "version": 1.0
645
+ }
646
+ },
647
+ "blimp_coordinate_structure_constraint_object_extraction": {
648
+ "task": "blimp_coordinate_structure_constraint_object_extraction",
649
+ "dataset_path": "blimp",
650
+ "dataset_name": "coordinate_structure_constraint_object_extraction",
651
+ "validation_split": "train",
652
+ "doc_to_text": "",
653
+ "doc_to_target": 0,
654
+ "unsafe_code": false,
655
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
656
+ "description": "",
657
+ "target_delimiter": " ",
658
+ "fewshot_delimiter": "\n\n",
659
+ "num_fewshot": 0,
660
+ "metric_list": [
661
+ {
662
+ "metric": "acc",
663
+ "aggregation": "mean",
664
+ "higher_is_better": true
665
+ }
666
+ ],
667
+ "output_type": "multiple_choice",
668
+ "repeats": 1,
669
+ "should_decontaminate": true,
670
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
671
+ "metadata": {
672
+ "version": 1.0
673
+ }
674
+ },
675
+ "blimp_determiner_noun_agreement_1": {
676
+ "task": "blimp_determiner_noun_agreement_1",
677
+ "dataset_path": "blimp",
678
+ "dataset_name": "determiner_noun_agreement_1",
679
+ "validation_split": "train",
680
+ "doc_to_text": "",
681
+ "doc_to_target": 0,
682
+ "unsafe_code": false,
683
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
684
+ "description": "",
685
+ "target_delimiter": " ",
686
+ "fewshot_delimiter": "\n\n",
687
+ "num_fewshot": 0,
688
+ "metric_list": [
689
+ {
690
+ "metric": "acc",
691
+ "aggregation": "mean",
692
+ "higher_is_better": true
693
+ }
694
+ ],
695
+ "output_type": "multiple_choice",
696
+ "repeats": 1,
697
+ "should_decontaminate": true,
698
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
699
+ "metadata": {
700
+ "version": 1.0
701
+ }
702
+ },
703
+ "blimp_determiner_noun_agreement_2": {
704
+ "task": "blimp_determiner_noun_agreement_2",
705
+ "dataset_path": "blimp",
706
+ "dataset_name": "determiner_noun_agreement_2",
707
+ "validation_split": "train",
708
+ "doc_to_text": "",
709
+ "doc_to_target": 0,
710
+ "unsafe_code": false,
711
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
712
+ "description": "",
713
+ "target_delimiter": " ",
714
+ "fewshot_delimiter": "\n\n",
715
+ "num_fewshot": 0,
716
+ "metric_list": [
717
+ {
718
+ "metric": "acc",
719
+ "aggregation": "mean",
720
+ "higher_is_better": true
721
+ }
722
+ ],
723
+ "output_type": "multiple_choice",
724
+ "repeats": 1,
725
+ "should_decontaminate": true,
726
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
727
+ "metadata": {
728
+ "version": 1.0
729
+ }
730
+ },
731
+ "blimp_determiner_noun_agreement_irregular_1": {
732
+ "task": "blimp_determiner_noun_agreement_irregular_1",
733
+ "dataset_path": "blimp",
734
+ "dataset_name": "determiner_noun_agreement_irregular_1",
735
+ "validation_split": "train",
736
+ "doc_to_text": "",
737
+ "doc_to_target": 0,
738
+ "unsafe_code": false,
739
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
740
+ "description": "",
741
+ "target_delimiter": " ",
742
+ "fewshot_delimiter": "\n\n",
743
+ "num_fewshot": 0,
744
+ "metric_list": [
745
+ {
746
+ "metric": "acc",
747
+ "aggregation": "mean",
748
+ "higher_is_better": true
749
+ }
750
+ ],
751
+ "output_type": "multiple_choice",
752
+ "repeats": 1,
753
+ "should_decontaminate": true,
754
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
755
+ "metadata": {
756
+ "version": 1.0
757
+ }
758
+ },
759
+ "blimp_determiner_noun_agreement_irregular_2": {
760
+ "task": "blimp_determiner_noun_agreement_irregular_2",
761
+ "dataset_path": "blimp",
762
+ "dataset_name": "determiner_noun_agreement_irregular_2",
763
+ "validation_split": "train",
764
+ "doc_to_text": "",
765
+ "doc_to_target": 0,
766
+ "unsafe_code": false,
767
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
768
+ "description": "",
769
+ "target_delimiter": " ",
770
+ "fewshot_delimiter": "\n\n",
771
+ "num_fewshot": 0,
772
+ "metric_list": [
773
+ {
774
+ "metric": "acc",
775
+ "aggregation": "mean",
776
+ "higher_is_better": true
777
+ }
778
+ ],
779
+ "output_type": "multiple_choice",
780
+ "repeats": 1,
781
+ "should_decontaminate": true,
782
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
783
+ "metadata": {
784
+ "version": 1.0
785
+ }
786
+ },
787
+ "blimp_determiner_noun_agreement_with_adj_2": {
788
+ "task": "blimp_determiner_noun_agreement_with_adj_2",
789
+ "dataset_path": "blimp",
790
+ "dataset_name": "determiner_noun_agreement_with_adj_2",
791
+ "validation_split": "train",
792
+ "doc_to_text": "",
793
+ "doc_to_target": 0,
794
+ "unsafe_code": false,
795
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
796
+ "description": "",
797
+ "target_delimiter": " ",
798
+ "fewshot_delimiter": "\n\n",
799
+ "num_fewshot": 0,
800
+ "metric_list": [
801
+ {
802
+ "metric": "acc",
803
+ "aggregation": "mean",
804
+ "higher_is_better": true
805
+ }
806
+ ],
807
+ "output_type": "multiple_choice",
808
+ "repeats": 1,
809
+ "should_decontaminate": true,
810
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
811
+ "metadata": {
812
+ "version": 1.0
813
+ }
814
+ },
815
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": {
816
+ "task": "blimp_determiner_noun_agreement_with_adj_irregular_1",
817
+ "dataset_path": "blimp",
818
+ "dataset_name": "determiner_noun_agreement_with_adj_irregular_1",
819
+ "validation_split": "train",
820
+ "doc_to_text": "",
821
+ "doc_to_target": 0,
822
+ "unsafe_code": false,
823
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
824
+ "description": "",
825
+ "target_delimiter": " ",
826
+ "fewshot_delimiter": "\n\n",
827
+ "num_fewshot": 0,
828
+ "metric_list": [
829
+ {
830
+ "metric": "acc",
831
+ "aggregation": "mean",
832
+ "higher_is_better": true
833
+ }
834
+ ],
835
+ "output_type": "multiple_choice",
836
+ "repeats": 1,
837
+ "should_decontaminate": true,
838
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
839
+ "metadata": {
840
+ "version": 1.0
841
+ }
842
+ },
843
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": {
844
+ "task": "blimp_determiner_noun_agreement_with_adj_irregular_2",
845
+ "dataset_path": "blimp",
846
+ "dataset_name": "determiner_noun_agreement_with_adj_irregular_2",
847
+ "validation_split": "train",
848
+ "doc_to_text": "",
849
+ "doc_to_target": 0,
850
+ "unsafe_code": false,
851
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
852
+ "description": "",
853
+ "target_delimiter": " ",
854
+ "fewshot_delimiter": "\n\n",
855
+ "num_fewshot": 0,
856
+ "metric_list": [
857
+ {
858
+ "metric": "acc",
859
+ "aggregation": "mean",
860
+ "higher_is_better": true
861
+ }
862
+ ],
863
+ "output_type": "multiple_choice",
864
+ "repeats": 1,
865
+ "should_decontaminate": true,
866
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
867
+ "metadata": {
868
+ "version": 1.0
869
+ }
870
+ },
871
+ "blimp_determiner_noun_agreement_with_adjective_1": {
872
+ "task": "blimp_determiner_noun_agreement_with_adjective_1",
873
+ "dataset_path": "blimp",
874
+ "dataset_name": "determiner_noun_agreement_with_adjective_1",
875
+ "validation_split": "train",
876
+ "doc_to_text": "",
877
+ "doc_to_target": 0,
878
+ "unsafe_code": false,
879
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
880
+ "description": "",
881
+ "target_delimiter": " ",
882
+ "fewshot_delimiter": "\n\n",
883
+ "num_fewshot": 0,
884
+ "metric_list": [
885
+ {
886
+ "metric": "acc",
887
+ "aggregation": "mean",
888
+ "higher_is_better": true
889
+ }
890
+ ],
891
+ "output_type": "multiple_choice",
892
+ "repeats": 1,
893
+ "should_decontaminate": true,
894
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
895
+ "metadata": {
896
+ "version": 1.0
897
+ }
898
+ },
899
+ "blimp_distractor_agreement_relational_noun": {
900
+ "task": "blimp_distractor_agreement_relational_noun",
901
+ "dataset_path": "blimp",
902
+ "dataset_name": "distractor_agreement_relational_noun",
903
+ "validation_split": "train",
904
+ "doc_to_text": "",
905
+ "doc_to_target": 0,
906
+ "unsafe_code": false,
907
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
908
+ "description": "",
909
+ "target_delimiter": " ",
910
+ "fewshot_delimiter": "\n\n",
911
+ "num_fewshot": 0,
912
+ "metric_list": [
913
+ {
914
+ "metric": "acc",
915
+ "aggregation": "mean",
916
+ "higher_is_better": true
917
+ }
918
+ ],
919
+ "output_type": "multiple_choice",
920
+ "repeats": 1,
921
+ "should_decontaminate": true,
922
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
923
+ "metadata": {
924
+ "version": 1.0
925
+ }
926
+ },
927
+ "blimp_distractor_agreement_relative_clause": {
928
+ "task": "blimp_distractor_agreement_relative_clause",
929
+ "dataset_path": "blimp",
930
+ "dataset_name": "distractor_agreement_relative_clause",
931
+ "validation_split": "train",
932
+ "doc_to_text": "",
933
+ "doc_to_target": 0,
934
+ "unsafe_code": false,
935
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
936
+ "description": "",
937
+ "target_delimiter": " ",
938
+ "fewshot_delimiter": "\n\n",
939
+ "num_fewshot": 0,
940
+ "metric_list": [
941
+ {
942
+ "metric": "acc",
943
+ "aggregation": "mean",
944
+ "higher_is_better": true
945
+ }
946
+ ],
947
+ "output_type": "multiple_choice",
948
+ "repeats": 1,
949
+ "should_decontaminate": true,
950
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
951
+ "metadata": {
952
+ "version": 1.0
953
+ }
954
+ },
955
+ "blimp_drop_argument": {
956
+ "task": "blimp_drop_argument",
957
+ "dataset_path": "blimp",
958
+ "dataset_name": "drop_argument",
959
+ "validation_split": "train",
960
+ "doc_to_text": "",
961
+ "doc_to_target": 0,
962
+ "unsafe_code": false,
963
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
964
+ "description": "",
965
+ "target_delimiter": " ",
966
+ "fewshot_delimiter": "\n\n",
967
+ "num_fewshot": 0,
968
+ "metric_list": [
969
+ {
970
+ "metric": "acc",
971
+ "aggregation": "mean",
972
+ "higher_is_better": true
973
+ }
974
+ ],
975
+ "output_type": "multiple_choice",
976
+ "repeats": 1,
977
+ "should_decontaminate": true,
978
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
979
+ "metadata": {
980
+ "version": 1.0
981
+ }
982
+ },
983
+ "blimp_ellipsis_n_bar_1": {
984
+ "task": "blimp_ellipsis_n_bar_1",
985
+ "dataset_path": "blimp",
986
+ "dataset_name": "ellipsis_n_bar_1",
987
+ "validation_split": "train",
988
+ "doc_to_text": "",
989
+ "doc_to_target": 0,
990
+ "unsafe_code": false,
991
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
992
+ "description": "",
993
+ "target_delimiter": " ",
994
+ "fewshot_delimiter": "\n\n",
995
+ "num_fewshot": 0,
996
+ "metric_list": [
997
+ {
998
+ "metric": "acc",
999
+ "aggregation": "mean",
1000
+ "higher_is_better": true
1001
+ }
1002
+ ],
1003
+ "output_type": "multiple_choice",
1004
+ "repeats": 1,
1005
+ "should_decontaminate": true,
1006
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1007
+ "metadata": {
1008
+ "version": 1.0
1009
+ }
1010
+ },
1011
+ "blimp_ellipsis_n_bar_2": {
1012
+ "task": "blimp_ellipsis_n_bar_2",
1013
+ "dataset_path": "blimp",
1014
+ "dataset_name": "ellipsis_n_bar_2",
1015
+ "validation_split": "train",
1016
+ "doc_to_text": "",
1017
+ "doc_to_target": 0,
1018
+ "unsafe_code": false,
1019
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1020
+ "description": "",
1021
+ "target_delimiter": " ",
1022
+ "fewshot_delimiter": "\n\n",
1023
+ "num_fewshot": 0,
1024
+ "metric_list": [
1025
+ {
1026
+ "metric": "acc",
1027
+ "aggregation": "mean",
1028
+ "higher_is_better": true
1029
+ }
1030
+ ],
1031
+ "output_type": "multiple_choice",
1032
+ "repeats": 1,
1033
+ "should_decontaminate": true,
1034
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1035
+ "metadata": {
1036
+ "version": 1.0
1037
+ }
1038
+ },
1039
+ "blimp_existential_there_object_raising": {
1040
+ "task": "blimp_existential_there_object_raising",
1041
+ "dataset_path": "blimp",
1042
+ "dataset_name": "existential_there_object_raising",
1043
+ "validation_split": "train",
1044
+ "doc_to_text": "",
1045
+ "doc_to_target": 0,
1046
+ "unsafe_code": false,
1047
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1048
+ "description": "",
1049
+ "target_delimiter": " ",
1050
+ "fewshot_delimiter": "\n\n",
1051
+ "num_fewshot": 0,
1052
+ "metric_list": [
1053
+ {
1054
+ "metric": "acc",
1055
+ "aggregation": "mean",
1056
+ "higher_is_better": true
1057
+ }
1058
+ ],
1059
+ "output_type": "multiple_choice",
1060
+ "repeats": 1,
1061
+ "should_decontaminate": true,
1062
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1063
+ "metadata": {
1064
+ "version": 1.0
1065
+ }
1066
+ },
1067
+ "blimp_existential_there_quantifiers_1": {
1068
+ "task": "blimp_existential_there_quantifiers_1",
1069
+ "dataset_path": "blimp",
1070
+ "dataset_name": "existential_there_quantifiers_1",
1071
+ "validation_split": "train",
1072
+ "doc_to_text": "",
1073
+ "doc_to_target": 0,
1074
+ "unsafe_code": false,
1075
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1076
+ "description": "",
1077
+ "target_delimiter": " ",
1078
+ "fewshot_delimiter": "\n\n",
1079
+ "num_fewshot": 0,
1080
+ "metric_list": [
1081
+ {
1082
+ "metric": "acc",
1083
+ "aggregation": "mean",
1084
+ "higher_is_better": true
1085
+ }
1086
+ ],
1087
+ "output_type": "multiple_choice",
1088
+ "repeats": 1,
1089
+ "should_decontaminate": true,
1090
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1091
+ "metadata": {
1092
+ "version": 1.0
1093
+ }
1094
+ },
1095
+ "blimp_existential_there_quantifiers_2": {
1096
+ "task": "blimp_existential_there_quantifiers_2",
1097
+ "dataset_path": "blimp",
1098
+ "dataset_name": "existential_there_quantifiers_2",
1099
+ "validation_split": "train",
1100
+ "doc_to_text": "",
1101
+ "doc_to_target": 0,
1102
+ "unsafe_code": false,
1103
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1104
+ "description": "",
1105
+ "target_delimiter": " ",
1106
+ "fewshot_delimiter": "\n\n",
1107
+ "num_fewshot": 0,
1108
+ "metric_list": [
1109
+ {
1110
+ "metric": "acc",
1111
+ "aggregation": "mean",
1112
+ "higher_is_better": true
1113
+ }
1114
+ ],
1115
+ "output_type": "multiple_choice",
1116
+ "repeats": 1,
1117
+ "should_decontaminate": true,
1118
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1119
+ "metadata": {
1120
+ "version": 1.0
1121
+ }
1122
+ },
1123
+ "blimp_existential_there_subject_raising": {
1124
+ "task": "blimp_existential_there_subject_raising",
1125
+ "dataset_path": "blimp",
1126
+ "dataset_name": "existential_there_subject_raising",
1127
+ "validation_split": "train",
1128
+ "doc_to_text": "",
1129
+ "doc_to_target": 0,
1130
+ "unsafe_code": false,
1131
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1132
+ "description": "",
1133
+ "target_delimiter": " ",
1134
+ "fewshot_delimiter": "\n\n",
1135
+ "num_fewshot": 0,
1136
+ "metric_list": [
1137
+ {
1138
+ "metric": "acc",
1139
+ "aggregation": "mean",
1140
+ "higher_is_better": true
1141
+ }
1142
+ ],
1143
+ "output_type": "multiple_choice",
1144
+ "repeats": 1,
1145
+ "should_decontaminate": true,
1146
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1147
+ "metadata": {
1148
+ "version": 1.0
1149
+ }
1150
+ },
1151
+ "blimp_expletive_it_object_raising": {
1152
+ "task": "blimp_expletive_it_object_raising",
1153
+ "dataset_path": "blimp",
1154
+ "dataset_name": "expletive_it_object_raising",
1155
+ "validation_split": "train",
1156
+ "doc_to_text": "",
1157
+ "doc_to_target": 0,
1158
+ "unsafe_code": false,
1159
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1160
+ "description": "",
1161
+ "target_delimiter": " ",
1162
+ "fewshot_delimiter": "\n\n",
1163
+ "num_fewshot": 0,
1164
+ "metric_list": [
1165
+ {
1166
+ "metric": "acc",
1167
+ "aggregation": "mean",
1168
+ "higher_is_better": true
1169
+ }
1170
+ ],
1171
+ "output_type": "multiple_choice",
1172
+ "repeats": 1,
1173
+ "should_decontaminate": true,
1174
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1175
+ "metadata": {
1176
+ "version": 1.0
1177
+ }
1178
+ },
1179
+ "blimp_inchoative": {
1180
+ "task": "blimp_inchoative",
1181
+ "dataset_path": "blimp",
1182
+ "dataset_name": "inchoative",
1183
+ "validation_split": "train",
1184
+ "doc_to_text": "",
1185
+ "doc_to_target": 0,
1186
+ "unsafe_code": false,
1187
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1188
+ "description": "",
1189
+ "target_delimiter": " ",
1190
+ "fewshot_delimiter": "\n\n",
1191
+ "num_fewshot": 0,
1192
+ "metric_list": [
1193
+ {
1194
+ "metric": "acc",
1195
+ "aggregation": "mean",
1196
+ "higher_is_better": true
1197
+ }
1198
+ ],
1199
+ "output_type": "multiple_choice",
1200
+ "repeats": 1,
1201
+ "should_decontaminate": true,
1202
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1203
+ "metadata": {
1204
+ "version": 1.0
1205
+ }
1206
+ },
1207
+ "blimp_intransitive": {
1208
+ "task": "blimp_intransitive",
1209
+ "dataset_path": "blimp",
1210
+ "dataset_name": "intransitive",
1211
+ "validation_split": "train",
1212
+ "doc_to_text": "",
1213
+ "doc_to_target": 0,
1214
+ "unsafe_code": false,
1215
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1216
+ "description": "",
1217
+ "target_delimiter": " ",
1218
+ "fewshot_delimiter": "\n\n",
1219
+ "num_fewshot": 0,
1220
+ "metric_list": [
1221
+ {
1222
+ "metric": "acc",
1223
+ "aggregation": "mean",
1224
+ "higher_is_better": true
1225
+ }
1226
+ ],
1227
+ "output_type": "multiple_choice",
1228
+ "repeats": 1,
1229
+ "should_decontaminate": true,
1230
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1231
+ "metadata": {
1232
+ "version": 1.0
1233
+ }
1234
+ },
1235
+ "blimp_irregular_past_participle_adjectives": {
1236
+ "task": "blimp_irregular_past_participle_adjectives",
1237
+ "dataset_path": "blimp",
1238
+ "dataset_name": "irregular_past_participle_adjectives",
1239
+ "validation_split": "train",
1240
+ "doc_to_text": "",
1241
+ "doc_to_target": 0,
1242
+ "unsafe_code": false,
1243
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1244
+ "description": "",
1245
+ "target_delimiter": " ",
1246
+ "fewshot_delimiter": "\n\n",
1247
+ "num_fewshot": 0,
1248
+ "metric_list": [
1249
+ {
1250
+ "metric": "acc",
1251
+ "aggregation": "mean",
1252
+ "higher_is_better": true
1253
+ }
1254
+ ],
1255
+ "output_type": "multiple_choice",
1256
+ "repeats": 1,
1257
+ "should_decontaminate": true,
1258
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1259
+ "metadata": {
1260
+ "version": 1.0
1261
+ }
1262
+ },
1263
+ "blimp_irregular_past_participle_verbs": {
1264
+ "task": "blimp_irregular_past_participle_verbs",
1265
+ "dataset_path": "blimp",
1266
+ "dataset_name": "irregular_past_participle_verbs",
1267
+ "validation_split": "train",
1268
+ "doc_to_text": "",
1269
+ "doc_to_target": 0,
1270
+ "unsafe_code": false,
1271
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1272
+ "description": "",
1273
+ "target_delimiter": " ",
1274
+ "fewshot_delimiter": "\n\n",
1275
+ "num_fewshot": 0,
1276
+ "metric_list": [
1277
+ {
1278
+ "metric": "acc",
1279
+ "aggregation": "mean",
1280
+ "higher_is_better": true
1281
+ }
1282
+ ],
1283
+ "output_type": "multiple_choice",
1284
+ "repeats": 1,
1285
+ "should_decontaminate": true,
1286
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1287
+ "metadata": {
1288
+ "version": 1.0
1289
+ }
1290
+ },
1291
+ "blimp_irregular_plural_subject_verb_agreement_1": {
1292
+ "task": "blimp_irregular_plural_subject_verb_agreement_1",
1293
+ "dataset_path": "blimp",
1294
+ "dataset_name": "irregular_plural_subject_verb_agreement_1",
1295
+ "validation_split": "train",
1296
+ "doc_to_text": "",
1297
+ "doc_to_target": 0,
1298
+ "unsafe_code": false,
1299
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1300
+ "description": "",
1301
+ "target_delimiter": " ",
1302
+ "fewshot_delimiter": "\n\n",
1303
+ "num_fewshot": 0,
1304
+ "metric_list": [
1305
+ {
1306
+ "metric": "acc",
1307
+ "aggregation": "mean",
1308
+ "higher_is_better": true
1309
+ }
1310
+ ],
1311
+ "output_type": "multiple_choice",
1312
+ "repeats": 1,
1313
+ "should_decontaminate": true,
1314
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1315
+ "metadata": {
1316
+ "version": 1.0
1317
+ }
1318
+ },
1319
+ "blimp_irregular_plural_subject_verb_agreement_2": {
1320
+ "task": "blimp_irregular_plural_subject_verb_agreement_2",
1321
+ "dataset_path": "blimp",
1322
+ "dataset_name": "irregular_plural_subject_verb_agreement_2",
1323
+ "validation_split": "train",
1324
+ "doc_to_text": "",
1325
+ "doc_to_target": 0,
1326
+ "unsafe_code": false,
1327
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1328
+ "description": "",
1329
+ "target_delimiter": " ",
1330
+ "fewshot_delimiter": "\n\n",
1331
+ "num_fewshot": 0,
1332
+ "metric_list": [
1333
+ {
1334
+ "metric": "acc",
1335
+ "aggregation": "mean",
1336
+ "higher_is_better": true
1337
+ }
1338
+ ],
1339
+ "output_type": "multiple_choice",
1340
+ "repeats": 1,
1341
+ "should_decontaminate": true,
1342
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1343
+ "metadata": {
1344
+ "version": 1.0
1345
+ }
1346
+ },
1347
+ "blimp_left_branch_island_echo_question": {
1348
+ "task": "blimp_left_branch_island_echo_question",
1349
+ "dataset_path": "blimp",
1350
+ "dataset_name": "left_branch_island_echo_question",
1351
+ "validation_split": "train",
1352
+ "doc_to_text": "",
1353
+ "doc_to_target": 0,
1354
+ "unsafe_code": false,
1355
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1356
+ "description": "",
1357
+ "target_delimiter": " ",
1358
+ "fewshot_delimiter": "\n\n",
1359
+ "num_fewshot": 0,
1360
+ "metric_list": [
1361
+ {
1362
+ "metric": "acc",
1363
+ "aggregation": "mean",
1364
+ "higher_is_better": true
1365
+ }
1366
+ ],
1367
+ "output_type": "multiple_choice",
1368
+ "repeats": 1,
1369
+ "should_decontaminate": true,
1370
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1371
+ "metadata": {
1372
+ "version": 1.0
1373
+ }
1374
+ },
1375
+ "blimp_left_branch_island_simple_question": {
1376
+ "task": "blimp_left_branch_island_simple_question",
1377
+ "dataset_path": "blimp",
1378
+ "dataset_name": "left_branch_island_simple_question",
1379
+ "validation_split": "train",
1380
+ "doc_to_text": "",
1381
+ "doc_to_target": 0,
1382
+ "unsafe_code": false,
1383
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1384
+ "description": "",
1385
+ "target_delimiter": " ",
1386
+ "fewshot_delimiter": "\n\n",
1387
+ "num_fewshot": 0,
1388
+ "metric_list": [
1389
+ {
1390
+ "metric": "acc",
1391
+ "aggregation": "mean",
1392
+ "higher_is_better": true
1393
+ }
1394
+ ],
1395
+ "output_type": "multiple_choice",
1396
+ "repeats": 1,
1397
+ "should_decontaminate": true,
1398
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1399
+ "metadata": {
1400
+ "version": 1.0
1401
+ }
1402
+ },
1403
+ "blimp_matrix_question_npi_licensor_present": {
1404
+ "task": "blimp_matrix_question_npi_licensor_present",
1405
+ "dataset_path": "blimp",
1406
+ "dataset_name": "matrix_question_npi_licensor_present",
1407
+ "validation_split": "train",
1408
+ "doc_to_text": "",
1409
+ "doc_to_target": 0,
1410
+ "unsafe_code": false,
1411
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1412
+ "description": "",
1413
+ "target_delimiter": " ",
1414
+ "fewshot_delimiter": "\n\n",
1415
+ "num_fewshot": 0,
1416
+ "metric_list": [
1417
+ {
1418
+ "metric": "acc",
1419
+ "aggregation": "mean",
1420
+ "higher_is_better": true
1421
+ }
1422
+ ],
1423
+ "output_type": "multiple_choice",
1424
+ "repeats": 1,
1425
+ "should_decontaminate": true,
1426
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1427
+ "metadata": {
1428
+ "version": 1.0
1429
+ }
1430
+ },
1431
+ "blimp_npi_present_1": {
1432
+ "task": "blimp_npi_present_1",
1433
+ "dataset_path": "blimp",
1434
+ "dataset_name": "npi_present_1",
1435
+ "validation_split": "train",
1436
+ "doc_to_text": "",
1437
+ "doc_to_target": 0,
1438
+ "unsafe_code": false,
1439
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1440
+ "description": "",
1441
+ "target_delimiter": " ",
1442
+ "fewshot_delimiter": "\n\n",
1443
+ "num_fewshot": 0,
1444
+ "metric_list": [
1445
+ {
1446
+ "metric": "acc",
1447
+ "aggregation": "mean",
1448
+ "higher_is_better": true
1449
+ }
1450
+ ],
1451
+ "output_type": "multiple_choice",
1452
+ "repeats": 1,
1453
+ "should_decontaminate": true,
1454
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1455
+ "metadata": {
1456
+ "version": 1.0
1457
+ }
1458
+ },
1459
+ "blimp_npi_present_2": {
1460
+ "task": "blimp_npi_present_2",
1461
+ "dataset_path": "blimp",
1462
+ "dataset_name": "npi_present_2",
1463
+ "validation_split": "train",
1464
+ "doc_to_text": "",
1465
+ "doc_to_target": 0,
1466
+ "unsafe_code": false,
1467
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1468
+ "description": "",
1469
+ "target_delimiter": " ",
1470
+ "fewshot_delimiter": "\n\n",
1471
+ "num_fewshot": 0,
1472
+ "metric_list": [
1473
+ {
1474
+ "metric": "acc",
1475
+ "aggregation": "mean",
1476
+ "higher_is_better": true
1477
+ }
1478
+ ],
1479
+ "output_type": "multiple_choice",
1480
+ "repeats": 1,
1481
+ "should_decontaminate": true,
1482
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1483
+ "metadata": {
1484
+ "version": 1.0
1485
+ }
1486
+ },
1487
+ "blimp_only_npi_licensor_present": {
1488
+ "task": "blimp_only_npi_licensor_present",
1489
+ "dataset_path": "blimp",
1490
+ "dataset_name": "only_npi_licensor_present",
1491
+ "validation_split": "train",
1492
+ "doc_to_text": "",
1493
+ "doc_to_target": 0,
1494
+ "unsafe_code": false,
1495
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1496
+ "description": "",
1497
+ "target_delimiter": " ",
1498
+ "fewshot_delimiter": "\n\n",
1499
+ "num_fewshot": 0,
1500
+ "metric_list": [
1501
+ {
1502
+ "metric": "acc",
1503
+ "aggregation": "mean",
1504
+ "higher_is_better": true
1505
+ }
1506
+ ],
1507
+ "output_type": "multiple_choice",
1508
+ "repeats": 1,
1509
+ "should_decontaminate": true,
1510
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1511
+ "metadata": {
1512
+ "version": 1.0
1513
+ }
1514
+ },
1515
+ "blimp_only_npi_scope": {
1516
+ "task": "blimp_only_npi_scope",
1517
+ "dataset_path": "blimp",
1518
+ "dataset_name": "only_npi_scope",
1519
+ "validation_split": "train",
1520
+ "doc_to_text": "",
1521
+ "doc_to_target": 0,
1522
+ "unsafe_code": false,
1523
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1524
+ "description": "",
1525
+ "target_delimiter": " ",
1526
+ "fewshot_delimiter": "\n\n",
1527
+ "num_fewshot": 0,
1528
+ "metric_list": [
1529
+ {
1530
+ "metric": "acc",
1531
+ "aggregation": "mean",
1532
+ "higher_is_better": true
1533
+ }
1534
+ ],
1535
+ "output_type": "multiple_choice",
1536
+ "repeats": 1,
1537
+ "should_decontaminate": true,
1538
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1539
+ "metadata": {
1540
+ "version": 1.0
1541
+ }
1542
+ },
1543
+ "blimp_passive_1": {
1544
+ "task": "blimp_passive_1",
1545
+ "dataset_path": "blimp",
1546
+ "dataset_name": "passive_1",
1547
+ "validation_split": "train",
1548
+ "doc_to_text": "",
1549
+ "doc_to_target": 0,
1550
+ "unsafe_code": false,
1551
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1552
+ "description": "",
1553
+ "target_delimiter": " ",
1554
+ "fewshot_delimiter": "\n\n",
1555
+ "num_fewshot": 0,
1556
+ "metric_list": [
1557
+ {
1558
+ "metric": "acc",
1559
+ "aggregation": "mean",
1560
+ "higher_is_better": true
1561
+ }
1562
+ ],
1563
+ "output_type": "multiple_choice",
1564
+ "repeats": 1,
1565
+ "should_decontaminate": true,
1566
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1567
+ "metadata": {
1568
+ "version": 1.0
1569
+ }
1570
+ },
1571
+ "blimp_passive_2": {
1572
+ "task": "blimp_passive_2",
1573
+ "dataset_path": "blimp",
1574
+ "dataset_name": "passive_2",
1575
+ "validation_split": "train",
1576
+ "doc_to_text": "",
1577
+ "doc_to_target": 0,
1578
+ "unsafe_code": false,
1579
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1580
+ "description": "",
1581
+ "target_delimiter": " ",
1582
+ "fewshot_delimiter": "\n\n",
1583
+ "num_fewshot": 0,
1584
+ "metric_list": [
1585
+ {
1586
+ "metric": "acc",
1587
+ "aggregation": "mean",
1588
+ "higher_is_better": true
1589
+ }
1590
+ ],
1591
+ "output_type": "multiple_choice",
1592
+ "repeats": 1,
1593
+ "should_decontaminate": true,
1594
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1595
+ "metadata": {
1596
+ "version": 1.0
1597
+ }
1598
+ },
1599
+ "blimp_principle_A_c_command": {
1600
+ "task": "blimp_principle_A_c_command",
1601
+ "dataset_path": "blimp",
1602
+ "dataset_name": "principle_A_c_command",
1603
+ "validation_split": "train",
1604
+ "doc_to_text": "",
1605
+ "doc_to_target": 0,
1606
+ "unsafe_code": false,
1607
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1608
+ "description": "",
1609
+ "target_delimiter": " ",
1610
+ "fewshot_delimiter": "\n\n",
1611
+ "num_fewshot": 0,
1612
+ "metric_list": [
1613
+ {
1614
+ "metric": "acc",
1615
+ "aggregation": "mean",
1616
+ "higher_is_better": true
1617
+ }
1618
+ ],
1619
+ "output_type": "multiple_choice",
1620
+ "repeats": 1,
1621
+ "should_decontaminate": true,
1622
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1623
+ "metadata": {
1624
+ "version": 1.0
1625
+ }
1626
+ },
1627
+ "blimp_principle_A_case_1": {
1628
+ "task": "blimp_principle_A_case_1",
1629
+ "dataset_path": "blimp",
1630
+ "dataset_name": "principle_A_case_1",
1631
+ "validation_split": "train",
1632
+ "doc_to_text": "",
1633
+ "doc_to_target": 0,
1634
+ "unsafe_code": false,
1635
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1636
+ "description": "",
1637
+ "target_delimiter": " ",
1638
+ "fewshot_delimiter": "\n\n",
1639
+ "num_fewshot": 0,
1640
+ "metric_list": [
1641
+ {
1642
+ "metric": "acc",
1643
+ "aggregation": "mean",
1644
+ "higher_is_better": true
1645
+ }
1646
+ ],
1647
+ "output_type": "multiple_choice",
1648
+ "repeats": 1,
1649
+ "should_decontaminate": true,
1650
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1651
+ "metadata": {
1652
+ "version": 1.0
1653
+ }
1654
+ },
1655
+ "blimp_principle_A_case_2": {
1656
+ "task": "blimp_principle_A_case_2",
1657
+ "dataset_path": "blimp",
1658
+ "dataset_name": "principle_A_case_2",
1659
+ "validation_split": "train",
1660
+ "doc_to_text": "",
1661
+ "doc_to_target": 0,
1662
+ "unsafe_code": false,
1663
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1664
+ "description": "",
1665
+ "target_delimiter": " ",
1666
+ "fewshot_delimiter": "\n\n",
1667
+ "num_fewshot": 0,
1668
+ "metric_list": [
1669
+ {
1670
+ "metric": "acc",
1671
+ "aggregation": "mean",
1672
+ "higher_is_better": true
1673
+ }
1674
+ ],
1675
+ "output_type": "multiple_choice",
1676
+ "repeats": 1,
1677
+ "should_decontaminate": true,
1678
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1679
+ "metadata": {
1680
+ "version": 1.0
1681
+ }
1682
+ },
1683
+ "blimp_principle_A_domain_1": {
1684
+ "task": "blimp_principle_A_domain_1",
1685
+ "dataset_path": "blimp",
1686
+ "dataset_name": "principle_A_domain_1",
1687
+ "validation_split": "train",
1688
+ "doc_to_text": "",
1689
+ "doc_to_target": 0,
1690
+ "unsafe_code": false,
1691
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1692
+ "description": "",
1693
+ "target_delimiter": " ",
1694
+ "fewshot_delimiter": "\n\n",
1695
+ "num_fewshot": 0,
1696
+ "metric_list": [
1697
+ {
1698
+ "metric": "acc",
1699
+ "aggregation": "mean",
1700
+ "higher_is_better": true
1701
+ }
1702
+ ],
1703
+ "output_type": "multiple_choice",
1704
+ "repeats": 1,
1705
+ "should_decontaminate": true,
1706
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1707
+ "metadata": {
1708
+ "version": 1.0
1709
+ }
1710
+ },
1711
+ "blimp_principle_A_domain_2": {
1712
+ "task": "blimp_principle_A_domain_2",
1713
+ "dataset_path": "blimp",
1714
+ "dataset_name": "principle_A_domain_2",
1715
+ "validation_split": "train",
1716
+ "doc_to_text": "",
1717
+ "doc_to_target": 0,
1718
+ "unsafe_code": false,
1719
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1720
+ "description": "",
1721
+ "target_delimiter": " ",
1722
+ "fewshot_delimiter": "\n\n",
1723
+ "num_fewshot": 0,
1724
+ "metric_list": [
1725
+ {
1726
+ "metric": "acc",
1727
+ "aggregation": "mean",
1728
+ "higher_is_better": true
1729
+ }
1730
+ ],
1731
+ "output_type": "multiple_choice",
1732
+ "repeats": 1,
1733
+ "should_decontaminate": true,
1734
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1735
+ "metadata": {
1736
+ "version": 1.0
1737
+ }
1738
+ },
1739
+ "blimp_principle_A_domain_3": {
1740
+ "task": "blimp_principle_A_domain_3",
1741
+ "dataset_path": "blimp",
1742
+ "dataset_name": "principle_A_domain_3",
1743
+ "validation_split": "train",
1744
+ "doc_to_text": "",
1745
+ "doc_to_target": 0,
1746
+ "unsafe_code": false,
1747
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1748
+ "description": "",
1749
+ "target_delimiter": " ",
1750
+ "fewshot_delimiter": "\n\n",
1751
+ "num_fewshot": 0,
1752
+ "metric_list": [
1753
+ {
1754
+ "metric": "acc",
1755
+ "aggregation": "mean",
1756
+ "higher_is_better": true
1757
+ }
1758
+ ],
1759
+ "output_type": "multiple_choice",
1760
+ "repeats": 1,
1761
+ "should_decontaminate": true,
1762
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1763
+ "metadata": {
1764
+ "version": 1.0
1765
+ }
1766
+ },
1767
+ "blimp_principle_A_reconstruction": {
1768
+ "task": "blimp_principle_A_reconstruction",
1769
+ "dataset_path": "blimp",
1770
+ "dataset_name": "principle_A_reconstruction",
1771
+ "validation_split": "train",
1772
+ "doc_to_text": "",
1773
+ "doc_to_target": 0,
1774
+ "unsafe_code": false,
1775
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1776
+ "description": "",
1777
+ "target_delimiter": " ",
1778
+ "fewshot_delimiter": "\n\n",
1779
+ "num_fewshot": 0,
1780
+ "metric_list": [
1781
+ {
1782
+ "metric": "acc",
1783
+ "aggregation": "mean",
1784
+ "higher_is_better": true
1785
+ }
1786
+ ],
1787
+ "output_type": "multiple_choice",
1788
+ "repeats": 1,
1789
+ "should_decontaminate": true,
1790
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1791
+ "metadata": {
1792
+ "version": 1.0
1793
+ }
1794
+ },
1795
+ "blimp_regular_plural_subject_verb_agreement_1": {
1796
+ "task": "blimp_regular_plural_subject_verb_agreement_1",
1797
+ "dataset_path": "blimp",
1798
+ "dataset_name": "regular_plural_subject_verb_agreement_1",
1799
+ "validation_split": "train",
1800
+ "doc_to_text": "",
1801
+ "doc_to_target": 0,
1802
+ "unsafe_code": false,
1803
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1804
+ "description": "",
1805
+ "target_delimiter": " ",
1806
+ "fewshot_delimiter": "\n\n",
1807
+ "num_fewshot": 0,
1808
+ "metric_list": [
1809
+ {
1810
+ "metric": "acc",
1811
+ "aggregation": "mean",
1812
+ "higher_is_better": true
1813
+ }
1814
+ ],
1815
+ "output_type": "multiple_choice",
1816
+ "repeats": 1,
1817
+ "should_decontaminate": true,
1818
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1819
+ "metadata": {
1820
+ "version": 1.0
1821
+ }
1822
+ },
1823
+ "blimp_regular_plural_subject_verb_agreement_2": {
1824
+ "task": "blimp_regular_plural_subject_verb_agreement_2",
1825
+ "dataset_path": "blimp",
1826
+ "dataset_name": "regular_plural_subject_verb_agreement_2",
1827
+ "validation_split": "train",
1828
+ "doc_to_text": "",
1829
+ "doc_to_target": 0,
1830
+ "unsafe_code": false,
1831
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1832
+ "description": "",
1833
+ "target_delimiter": " ",
1834
+ "fewshot_delimiter": "\n\n",
1835
+ "num_fewshot": 0,
1836
+ "metric_list": [
1837
+ {
1838
+ "metric": "acc",
1839
+ "aggregation": "mean",
1840
+ "higher_is_better": true
1841
+ }
1842
+ ],
1843
+ "output_type": "multiple_choice",
1844
+ "repeats": 1,
1845
+ "should_decontaminate": true,
1846
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1847
+ "metadata": {
1848
+ "version": 1.0
1849
+ }
1850
+ },
1851
+ "blimp_sentential_negation_npi_licensor_present": {
1852
+ "task": "blimp_sentential_negation_npi_licensor_present",
1853
+ "dataset_path": "blimp",
1854
+ "dataset_name": "sentential_negation_npi_licensor_present",
1855
+ "validation_split": "train",
1856
+ "doc_to_text": "",
1857
+ "doc_to_target": 0,
1858
+ "unsafe_code": false,
1859
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1860
+ "description": "",
1861
+ "target_delimiter": " ",
1862
+ "fewshot_delimiter": "\n\n",
1863
+ "num_fewshot": 0,
1864
+ "metric_list": [
1865
+ {
1866
+ "metric": "acc",
1867
+ "aggregation": "mean",
1868
+ "higher_is_better": true
1869
+ }
1870
+ ],
1871
+ "output_type": "multiple_choice",
1872
+ "repeats": 1,
1873
+ "should_decontaminate": true,
1874
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1875
+ "metadata": {
1876
+ "version": 1.0
1877
+ }
1878
+ },
1879
+ "blimp_sentential_negation_npi_scope": {
1880
+ "task": "blimp_sentential_negation_npi_scope",
1881
+ "dataset_path": "blimp",
1882
+ "dataset_name": "sentential_negation_npi_scope",
1883
+ "validation_split": "train",
1884
+ "doc_to_text": "",
1885
+ "doc_to_target": 0,
1886
+ "unsafe_code": false,
1887
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1888
+ "description": "",
1889
+ "target_delimiter": " ",
1890
+ "fewshot_delimiter": "\n\n",
1891
+ "num_fewshot": 0,
1892
+ "metric_list": [
1893
+ {
1894
+ "metric": "acc",
1895
+ "aggregation": "mean",
1896
+ "higher_is_better": true
1897
+ }
1898
+ ],
1899
+ "output_type": "multiple_choice",
1900
+ "repeats": 1,
1901
+ "should_decontaminate": true,
1902
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1903
+ "metadata": {
1904
+ "version": 1.0
1905
+ }
1906
+ },
1907
+ "blimp_sentential_subject_island": {
1908
+ "task": "blimp_sentential_subject_island",
1909
+ "dataset_path": "blimp",
1910
+ "dataset_name": "sentential_subject_island",
1911
+ "validation_split": "train",
1912
+ "doc_to_text": "",
1913
+ "doc_to_target": 0,
1914
+ "unsafe_code": false,
1915
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1916
+ "description": "",
1917
+ "target_delimiter": " ",
1918
+ "fewshot_delimiter": "\n\n",
1919
+ "num_fewshot": 0,
1920
+ "metric_list": [
1921
+ {
1922
+ "metric": "acc",
1923
+ "aggregation": "mean",
1924
+ "higher_is_better": true
1925
+ }
1926
+ ],
1927
+ "output_type": "multiple_choice",
1928
+ "repeats": 1,
1929
+ "should_decontaminate": true,
1930
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1931
+ "metadata": {
1932
+ "version": 1.0
1933
+ }
1934
+ },
1935
+ "blimp_superlative_quantifiers_1": {
1936
+ "task": "blimp_superlative_quantifiers_1",
1937
+ "dataset_path": "blimp",
1938
+ "dataset_name": "superlative_quantifiers_1",
1939
+ "validation_split": "train",
1940
+ "doc_to_text": "",
1941
+ "doc_to_target": 0,
1942
+ "unsafe_code": false,
1943
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1944
+ "description": "",
1945
+ "target_delimiter": " ",
1946
+ "fewshot_delimiter": "\n\n",
1947
+ "num_fewshot": 0,
1948
+ "metric_list": [
1949
+ {
1950
+ "metric": "acc",
1951
+ "aggregation": "mean",
1952
+ "higher_is_better": true
1953
+ }
1954
+ ],
1955
+ "output_type": "multiple_choice",
1956
+ "repeats": 1,
1957
+ "should_decontaminate": true,
1958
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1959
+ "metadata": {
1960
+ "version": 1.0
1961
+ }
1962
+ },
1963
+ "blimp_superlative_quantifiers_2": {
1964
+ "task": "blimp_superlative_quantifiers_2",
1965
+ "dataset_path": "blimp",
1966
+ "dataset_name": "superlative_quantifiers_2",
1967
+ "validation_split": "train",
1968
+ "doc_to_text": "",
1969
+ "doc_to_target": 0,
1970
+ "unsafe_code": false,
1971
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1972
+ "description": "",
1973
+ "target_delimiter": " ",
1974
+ "fewshot_delimiter": "\n\n",
1975
+ "num_fewshot": 0,
1976
+ "metric_list": [
1977
+ {
1978
+ "metric": "acc",
1979
+ "aggregation": "mean",
1980
+ "higher_is_better": true
1981
+ }
1982
+ ],
1983
+ "output_type": "multiple_choice",
1984
+ "repeats": 1,
1985
+ "should_decontaminate": true,
1986
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1987
+ "metadata": {
1988
+ "version": 1.0
1989
+ }
1990
+ },
1991
+ "blimp_tough_vs_raising_1": {
1992
+ "task": "blimp_tough_vs_raising_1",
1993
+ "dataset_path": "blimp",
1994
+ "dataset_name": "tough_vs_raising_1",
1995
+ "validation_split": "train",
1996
+ "doc_to_text": "",
1997
+ "doc_to_target": 0,
1998
+ "unsafe_code": false,
1999
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2000
+ "description": "",
2001
+ "target_delimiter": " ",
2002
+ "fewshot_delimiter": "\n\n",
2003
+ "num_fewshot": 0,
2004
+ "metric_list": [
2005
+ {
2006
+ "metric": "acc",
2007
+ "aggregation": "mean",
2008
+ "higher_is_better": true
2009
+ }
2010
+ ],
2011
+ "output_type": "multiple_choice",
2012
+ "repeats": 1,
2013
+ "should_decontaminate": true,
2014
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2015
+ "metadata": {
2016
+ "version": 1.0
2017
+ }
2018
+ },
2019
+ "blimp_tough_vs_raising_2": {
2020
+ "task": "blimp_tough_vs_raising_2",
2021
+ "dataset_path": "blimp",
2022
+ "dataset_name": "tough_vs_raising_2",
2023
+ "validation_split": "train",
2024
+ "doc_to_text": "",
2025
+ "doc_to_target": 0,
2026
+ "unsafe_code": false,
2027
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2028
+ "description": "",
2029
+ "target_delimiter": " ",
2030
+ "fewshot_delimiter": "\n\n",
2031
+ "num_fewshot": 0,
2032
+ "metric_list": [
2033
+ {
2034
+ "metric": "acc",
2035
+ "aggregation": "mean",
2036
+ "higher_is_better": true
2037
+ }
2038
+ ],
2039
+ "output_type": "multiple_choice",
2040
+ "repeats": 1,
2041
+ "should_decontaminate": true,
2042
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2043
+ "metadata": {
2044
+ "version": 1.0
2045
+ }
2046
+ },
2047
+ "blimp_transitive": {
2048
+ "task": "blimp_transitive",
2049
+ "dataset_path": "blimp",
2050
+ "dataset_name": "transitive",
2051
+ "validation_split": "train",
2052
+ "doc_to_text": "",
2053
+ "doc_to_target": 0,
2054
+ "unsafe_code": false,
2055
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2056
+ "description": "",
2057
+ "target_delimiter": " ",
2058
+ "fewshot_delimiter": "\n\n",
2059
+ "num_fewshot": 0,
2060
+ "metric_list": [
2061
+ {
2062
+ "metric": "acc",
2063
+ "aggregation": "mean",
2064
+ "higher_is_better": true
2065
+ }
2066
+ ],
2067
+ "output_type": "multiple_choice",
2068
+ "repeats": 1,
2069
+ "should_decontaminate": true,
2070
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2071
+ "metadata": {
2072
+ "version": 1.0
2073
+ }
2074
+ },
2075
+ "blimp_wh_island": {
2076
+ "task": "blimp_wh_island",
2077
+ "dataset_path": "blimp",
2078
+ "dataset_name": "wh_island",
2079
+ "validation_split": "train",
2080
+ "doc_to_text": "",
2081
+ "doc_to_target": 0,
2082
+ "unsafe_code": false,
2083
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2084
+ "description": "",
2085
+ "target_delimiter": " ",
2086
+ "fewshot_delimiter": "\n\n",
2087
+ "num_fewshot": 0,
2088
+ "metric_list": [
2089
+ {
2090
+ "metric": "acc",
2091
+ "aggregation": "mean",
2092
+ "higher_is_better": true
2093
+ }
2094
+ ],
2095
+ "output_type": "multiple_choice",
2096
+ "repeats": 1,
2097
+ "should_decontaminate": true,
2098
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2099
+ "metadata": {
2100
+ "version": 1.0
2101
+ }
2102
+ },
2103
+ "blimp_wh_questions_object_gap": {
2104
+ "task": "blimp_wh_questions_object_gap",
2105
+ "dataset_path": "blimp",
2106
+ "dataset_name": "wh_questions_object_gap",
2107
+ "validation_split": "train",
2108
+ "doc_to_text": "",
2109
+ "doc_to_target": 0,
2110
+ "unsafe_code": false,
2111
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2112
+ "description": "",
2113
+ "target_delimiter": " ",
2114
+ "fewshot_delimiter": "\n\n",
2115
+ "num_fewshot": 0,
2116
+ "metric_list": [
2117
+ {
2118
+ "metric": "acc",
2119
+ "aggregation": "mean",
2120
+ "higher_is_better": true
2121
+ }
2122
+ ],
2123
+ "output_type": "multiple_choice",
2124
+ "repeats": 1,
2125
+ "should_decontaminate": true,
2126
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2127
+ "metadata": {
2128
+ "version": 1.0
2129
+ }
2130
+ },
2131
+ "blimp_wh_questions_subject_gap": {
2132
+ "task": "blimp_wh_questions_subject_gap",
2133
+ "dataset_path": "blimp",
2134
+ "dataset_name": "wh_questions_subject_gap",
2135
+ "validation_split": "train",
2136
+ "doc_to_text": "",
2137
+ "doc_to_target": 0,
2138
+ "unsafe_code": false,
2139
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2140
+ "description": "",
2141
+ "target_delimiter": " ",
2142
+ "fewshot_delimiter": "\n\n",
2143
+ "num_fewshot": 0,
2144
+ "metric_list": [
2145
+ {
2146
+ "metric": "acc",
2147
+ "aggregation": "mean",
2148
+ "higher_is_better": true
2149
+ }
2150
+ ],
2151
+ "output_type": "multiple_choice",
2152
+ "repeats": 1,
2153
+ "should_decontaminate": true,
2154
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2155
+ "metadata": {
2156
+ "version": 1.0
2157
+ }
2158
+ },
2159
+ "blimp_wh_questions_subject_gap_long_distance": {
2160
+ "task": "blimp_wh_questions_subject_gap_long_distance",
2161
+ "dataset_path": "blimp",
2162
+ "dataset_name": "wh_questions_subject_gap_long_distance",
2163
+ "validation_split": "train",
2164
+ "doc_to_text": "",
2165
+ "doc_to_target": 0,
2166
+ "unsafe_code": false,
2167
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2168
+ "description": "",
2169
+ "target_delimiter": " ",
2170
+ "fewshot_delimiter": "\n\n",
2171
+ "num_fewshot": 0,
2172
+ "metric_list": [
2173
+ {
2174
+ "metric": "acc",
2175
+ "aggregation": "mean",
2176
+ "higher_is_better": true
2177
+ }
2178
+ ],
2179
+ "output_type": "multiple_choice",
2180
+ "repeats": 1,
2181
+ "should_decontaminate": true,
2182
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2183
+ "metadata": {
2184
+ "version": 1.0
2185
+ }
2186
+ },
2187
+ "blimp_wh_vs_that_no_gap": {
2188
+ "task": "blimp_wh_vs_that_no_gap",
2189
+ "dataset_path": "blimp",
2190
+ "dataset_name": "wh_vs_that_no_gap",
2191
+ "validation_split": "train",
2192
+ "doc_to_text": "",
2193
+ "doc_to_target": 0,
2194
+ "unsafe_code": false,
2195
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2196
+ "description": "",
2197
+ "target_delimiter": " ",
2198
+ "fewshot_delimiter": "\n\n",
2199
+ "num_fewshot": 0,
2200
+ "metric_list": [
2201
+ {
2202
+ "metric": "acc",
2203
+ "aggregation": "mean",
2204
+ "higher_is_better": true
2205
+ }
2206
+ ],
2207
+ "output_type": "multiple_choice",
2208
+ "repeats": 1,
2209
+ "should_decontaminate": true,
2210
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2211
+ "metadata": {
2212
+ "version": 1.0
2213
+ }
2214
+ },
2215
+ "blimp_wh_vs_that_no_gap_long_distance": {
2216
+ "task": "blimp_wh_vs_that_no_gap_long_distance",
2217
+ "dataset_path": "blimp",
2218
+ "dataset_name": "wh_vs_that_no_gap_long_distance",
2219
+ "validation_split": "train",
2220
+ "doc_to_text": "",
2221
+ "doc_to_target": 0,
2222
+ "unsafe_code": false,
2223
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2224
+ "description": "",
2225
+ "target_delimiter": " ",
2226
+ "fewshot_delimiter": "\n\n",
2227
+ "num_fewshot": 0,
2228
+ "metric_list": [
2229
+ {
2230
+ "metric": "acc",
2231
+ "aggregation": "mean",
2232
+ "higher_is_better": true
2233
+ }
2234
+ ],
2235
+ "output_type": "multiple_choice",
2236
+ "repeats": 1,
2237
+ "should_decontaminate": true,
2238
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2239
+ "metadata": {
2240
+ "version": 1.0
2241
+ }
2242
+ },
2243
+ "blimp_wh_vs_that_with_gap": {
2244
+ "task": "blimp_wh_vs_that_with_gap",
2245
+ "dataset_path": "blimp",
2246
+ "dataset_name": "wh_vs_that_with_gap",
2247
+ "validation_split": "train",
2248
+ "doc_to_text": "",
2249
+ "doc_to_target": 0,
2250
+ "unsafe_code": false,
2251
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2252
+ "description": "",
2253
+ "target_delimiter": " ",
2254
+ "fewshot_delimiter": "\n\n",
2255
+ "num_fewshot": 0,
2256
+ "metric_list": [
2257
+ {
2258
+ "metric": "acc",
2259
+ "aggregation": "mean",
2260
+ "higher_is_better": true
2261
+ }
2262
+ ],
2263
+ "output_type": "multiple_choice",
2264
+ "repeats": 1,
2265
+ "should_decontaminate": true,
2266
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2267
+ "metadata": {
2268
+ "version": 1.0
2269
+ }
2270
+ },
2271
+ "blimp_wh_vs_that_with_gap_long_distance": {
2272
+ "task": "blimp_wh_vs_that_with_gap_long_distance",
2273
+ "dataset_path": "blimp",
2274
+ "dataset_name": "wh_vs_that_with_gap_long_distance",
2275
+ "validation_split": "train",
2276
+ "doc_to_text": "",
2277
+ "doc_to_target": 0,
2278
+ "unsafe_code": false,
2279
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2280
+ "description": "",
2281
+ "target_delimiter": " ",
2282
+ "fewshot_delimiter": "\n\n",
2283
+ "num_fewshot": 0,
2284
+ "metric_list": [
2285
+ {
2286
+ "metric": "acc",
2287
+ "aggregation": "mean",
2288
+ "higher_is_better": true
2289
+ }
2290
+ ],
2291
+ "output_type": "multiple_choice",
2292
+ "repeats": 1,
2293
+ "should_decontaminate": true,
2294
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2295
+ "metadata": {
2296
+ "version": 1.0
2297
+ }
2298
+ }
2299
+ },
2300
+ "versions": {
2301
+ "blimp": 2.0,
2302
+ "blimp_adjunct_island": 1.0,
2303
+ "blimp_anaphor_gender_agreement": 1.0,
2304
+ "blimp_anaphor_number_agreement": 1.0,
2305
+ "blimp_animate_subject_passive": 1.0,
2306
+ "blimp_animate_subject_trans": 1.0,
2307
+ "blimp_causative": 1.0,
2308
+ "blimp_complex_NP_island": 1.0,
2309
+ "blimp_coordinate_structure_constraint_complex_left_branch": 1.0,
2310
+ "blimp_coordinate_structure_constraint_object_extraction": 1.0,
2311
+ "blimp_determiner_noun_agreement_1": 1.0,
2312
+ "blimp_determiner_noun_agreement_2": 1.0,
2313
+ "blimp_determiner_noun_agreement_irregular_1": 1.0,
2314
+ "blimp_determiner_noun_agreement_irregular_2": 1.0,
2315
+ "blimp_determiner_noun_agreement_with_adj_2": 1.0,
2316
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": 1.0,
2317
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": 1.0,
2318
+ "blimp_determiner_noun_agreement_with_adjective_1": 1.0,
2319
+ "blimp_distractor_agreement_relational_noun": 1.0,
2320
+ "blimp_distractor_agreement_relative_clause": 1.0,
2321
+ "blimp_drop_argument": 1.0,
2322
+ "blimp_ellipsis_n_bar_1": 1.0,
2323
+ "blimp_ellipsis_n_bar_2": 1.0,
2324
+ "blimp_existential_there_object_raising": 1.0,
2325
+ "blimp_existential_there_quantifiers_1": 1.0,
2326
+ "blimp_existential_there_quantifiers_2": 1.0,
2327
+ "blimp_existential_there_subject_raising": 1.0,
2328
+ "blimp_expletive_it_object_raising": 1.0,
2329
+ "blimp_inchoative": 1.0,
2330
+ "blimp_intransitive": 1.0,
2331
+ "blimp_irregular_past_participle_adjectives": 1.0,
2332
+ "blimp_irregular_past_participle_verbs": 1.0,
2333
+ "blimp_irregular_plural_subject_verb_agreement_1": 1.0,
2334
+ "blimp_irregular_plural_subject_verb_agreement_2": 1.0,
2335
+ "blimp_left_branch_island_echo_question": 1.0,
2336
+ "blimp_left_branch_island_simple_question": 1.0,
2337
+ "blimp_matrix_question_npi_licensor_present": 1.0,
2338
+ "blimp_npi_present_1": 1.0,
2339
+ "blimp_npi_present_2": 1.0,
2340
+ "blimp_only_npi_licensor_present": 1.0,
2341
+ "blimp_only_npi_scope": 1.0,
2342
+ "blimp_passive_1": 1.0,
2343
+ "blimp_passive_2": 1.0,
2344
+ "blimp_principle_A_c_command": 1.0,
2345
+ "blimp_principle_A_case_1": 1.0,
2346
+ "blimp_principle_A_case_2": 1.0,
2347
+ "blimp_principle_A_domain_1": 1.0,
2348
+ "blimp_principle_A_domain_2": 1.0,
2349
+ "blimp_principle_A_domain_3": 1.0,
2350
+ "blimp_principle_A_reconstruction": 1.0,
2351
+ "blimp_regular_plural_subject_verb_agreement_1": 1.0,
2352
+ "blimp_regular_plural_subject_verb_agreement_2": 1.0,
2353
+ "blimp_sentential_negation_npi_licensor_present": 1.0,
2354
+ "blimp_sentential_negation_npi_scope": 1.0,
2355
+ "blimp_sentential_subject_island": 1.0,
2356
+ "blimp_superlative_quantifiers_1": 1.0,
2357
+ "blimp_superlative_quantifiers_2": 1.0,
2358
+ "blimp_tough_vs_raising_1": 1.0,
2359
+ "blimp_tough_vs_raising_2": 1.0,
2360
+ "blimp_transitive": 1.0,
2361
+ "blimp_wh_island": 1.0,
2362
+ "blimp_wh_questions_object_gap": 1.0,
2363
+ "blimp_wh_questions_subject_gap": 1.0,
2364
+ "blimp_wh_questions_subject_gap_long_distance": 1.0,
2365
+ "blimp_wh_vs_that_no_gap": 1.0,
2366
+ "blimp_wh_vs_that_no_gap_long_distance": 1.0,
2367
+ "blimp_wh_vs_that_with_gap": 1.0,
2368
+ "blimp_wh_vs_that_with_gap_long_distance": 1.0
2369
+ },
2370
+ "n-shot": {
2371
+ "blimp_adjunct_island": 0,
2372
+ "blimp_anaphor_gender_agreement": 0,
2373
+ "blimp_anaphor_number_agreement": 0,
2374
+ "blimp_animate_subject_passive": 0,
2375
+ "blimp_animate_subject_trans": 0,
2376
+ "blimp_causative": 0,
2377
+ "blimp_complex_NP_island": 0,
2378
+ "blimp_coordinate_structure_constraint_complex_left_branch": 0,
2379
+ "blimp_coordinate_structure_constraint_object_extraction": 0,
2380
+ "blimp_determiner_noun_agreement_1": 0,
2381
+ "blimp_determiner_noun_agreement_2": 0,
2382
+ "blimp_determiner_noun_agreement_irregular_1": 0,
2383
+ "blimp_determiner_noun_agreement_irregular_2": 0,
2384
+ "blimp_determiner_noun_agreement_with_adj_2": 0,
2385
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": 0,
2386
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": 0,
2387
+ "blimp_determiner_noun_agreement_with_adjective_1": 0,
2388
+ "blimp_distractor_agreement_relational_noun": 0,
2389
+ "blimp_distractor_agreement_relative_clause": 0,
2390
+ "blimp_drop_argument": 0,
2391
+ "blimp_ellipsis_n_bar_1": 0,
2392
+ "blimp_ellipsis_n_bar_2": 0,
2393
+ "blimp_existential_there_object_raising": 0,
2394
+ "blimp_existential_there_quantifiers_1": 0,
2395
+ "blimp_existential_there_quantifiers_2": 0,
2396
+ "blimp_existential_there_subject_raising": 0,
2397
+ "blimp_expletive_it_object_raising": 0,
2398
+ "blimp_inchoative": 0,
2399
+ "blimp_intransitive": 0,
2400
+ "blimp_irregular_past_participle_adjectives": 0,
2401
+ "blimp_irregular_past_participle_verbs": 0,
2402
+ "blimp_irregular_plural_subject_verb_agreement_1": 0,
2403
+ "blimp_irregular_plural_subject_verb_agreement_2": 0,
2404
+ "blimp_left_branch_island_echo_question": 0,
2405
+ "blimp_left_branch_island_simple_question": 0,
2406
+ "blimp_matrix_question_npi_licensor_present": 0,
2407
+ "blimp_npi_present_1": 0,
2408
+ "blimp_npi_present_2": 0,
2409
+ "blimp_only_npi_licensor_present": 0,
2410
+ "blimp_only_npi_scope": 0,
2411
+ "blimp_passive_1": 0,
2412
+ "blimp_passive_2": 0,
2413
+ "blimp_principle_A_c_command": 0,
2414
+ "blimp_principle_A_case_1": 0,
2415
+ "blimp_principle_A_case_2": 0,
2416
+ "blimp_principle_A_domain_1": 0,
2417
+ "blimp_principle_A_domain_2": 0,
2418
+ "blimp_principle_A_domain_3": 0,
2419
+ "blimp_principle_A_reconstruction": 0,
2420
+ "blimp_regular_plural_subject_verb_agreement_1": 0,
2421
+ "blimp_regular_plural_subject_verb_agreement_2": 0,
2422
+ "blimp_sentential_negation_npi_licensor_present": 0,
2423
+ "blimp_sentential_negation_npi_scope": 0,
2424
+ "blimp_sentential_subject_island": 0,
2425
+ "blimp_superlative_quantifiers_1": 0,
2426
+ "blimp_superlative_quantifiers_2": 0,
2427
+ "blimp_tough_vs_raising_1": 0,
2428
+ "blimp_tough_vs_raising_2": 0,
2429
+ "blimp_transitive": 0,
2430
+ "blimp_wh_island": 0,
2431
+ "blimp_wh_questions_object_gap": 0,
2432
+ "blimp_wh_questions_subject_gap": 0,
2433
+ "blimp_wh_questions_subject_gap_long_distance": 0,
2434
+ "blimp_wh_vs_that_no_gap": 0,
2435
+ "blimp_wh_vs_that_no_gap_long_distance": 0,
2436
+ "blimp_wh_vs_that_with_gap": 0,
2437
+ "blimp_wh_vs_that_with_gap_long_distance": 0
2438
+ },
2439
+ "higher_is_better": {
2440
+ "blimp": {
2441
+ "acc": true
2442
+ },
2443
+ "blimp_adjunct_island": {
2444
+ "acc": true
2445
+ },
2446
+ "blimp_anaphor_gender_agreement": {
2447
+ "acc": true
2448
+ },
2449
+ "blimp_anaphor_number_agreement": {
2450
+ "acc": true
2451
+ },
2452
+ "blimp_animate_subject_passive": {
2453
+ "acc": true
2454
+ },
2455
+ "blimp_animate_subject_trans": {
2456
+ "acc": true
2457
+ },
2458
+ "blimp_causative": {
2459
+ "acc": true
2460
+ },
2461
+ "blimp_complex_NP_island": {
2462
+ "acc": true
2463
+ },
2464
+ "blimp_coordinate_structure_constraint_complex_left_branch": {
2465
+ "acc": true
2466
+ },
2467
+ "blimp_coordinate_structure_constraint_object_extraction": {
2468
+ "acc": true
2469
+ },
2470
+ "blimp_determiner_noun_agreement_1": {
2471
+ "acc": true
2472
+ },
2473
+ "blimp_determiner_noun_agreement_2": {
2474
+ "acc": true
2475
+ },
2476
+ "blimp_determiner_noun_agreement_irregular_1": {
2477
+ "acc": true
2478
+ },
2479
+ "blimp_determiner_noun_agreement_irregular_2": {
2480
+ "acc": true
2481
+ },
2482
+ "blimp_determiner_noun_agreement_with_adj_2": {
2483
+ "acc": true
2484
+ },
2485
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": {
2486
+ "acc": true
2487
+ },
2488
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": {
2489
+ "acc": true
2490
+ },
2491
+ "blimp_determiner_noun_agreement_with_adjective_1": {
2492
+ "acc": true
2493
+ },
2494
+ "blimp_distractor_agreement_relational_noun": {
2495
+ "acc": true
2496
+ },
2497
+ "blimp_distractor_agreement_relative_clause": {
2498
+ "acc": true
2499
+ },
2500
+ "blimp_drop_argument": {
2501
+ "acc": true
2502
+ },
2503
+ "blimp_ellipsis_n_bar_1": {
2504
+ "acc": true
2505
+ },
2506
+ "blimp_ellipsis_n_bar_2": {
2507
+ "acc": true
2508
+ },
2509
+ "blimp_existential_there_object_raising": {
2510
+ "acc": true
2511
+ },
2512
+ "blimp_existential_there_quantifiers_1": {
2513
+ "acc": true
2514
+ },
2515
+ "blimp_existential_there_quantifiers_2": {
2516
+ "acc": true
2517
+ },
2518
+ "blimp_existential_there_subject_raising": {
2519
+ "acc": true
2520
+ },
2521
+ "blimp_expletive_it_object_raising": {
2522
+ "acc": true
2523
+ },
2524
+ "blimp_inchoative": {
2525
+ "acc": true
2526
+ },
2527
+ "blimp_intransitive": {
2528
+ "acc": true
2529
+ },
2530
+ "blimp_irregular_past_participle_adjectives": {
2531
+ "acc": true
2532
+ },
2533
+ "blimp_irregular_past_participle_verbs": {
2534
+ "acc": true
2535
+ },
2536
+ "blimp_irregular_plural_subject_verb_agreement_1": {
2537
+ "acc": true
2538
+ },
2539
+ "blimp_irregular_plural_subject_verb_agreement_2": {
2540
+ "acc": true
2541
+ },
2542
+ "blimp_left_branch_island_echo_question": {
2543
+ "acc": true
2544
+ },
2545
+ "blimp_left_branch_island_simple_question": {
2546
+ "acc": true
2547
+ },
2548
+ "blimp_matrix_question_npi_licensor_present": {
2549
+ "acc": true
2550
+ },
2551
+ "blimp_npi_present_1": {
2552
+ "acc": true
2553
+ },
2554
+ "blimp_npi_present_2": {
2555
+ "acc": true
2556
+ },
2557
+ "blimp_only_npi_licensor_present": {
2558
+ "acc": true
2559
+ },
2560
+ "blimp_only_npi_scope": {
2561
+ "acc": true
2562
+ },
2563
+ "blimp_passive_1": {
2564
+ "acc": true
2565
+ },
2566
+ "blimp_passive_2": {
2567
+ "acc": true
2568
+ },
2569
+ "blimp_principle_A_c_command": {
2570
+ "acc": true
2571
+ },
2572
+ "blimp_principle_A_case_1": {
2573
+ "acc": true
2574
+ },
2575
+ "blimp_principle_A_case_2": {
2576
+ "acc": true
2577
+ },
2578
+ "blimp_principle_A_domain_1": {
2579
+ "acc": true
2580
+ },
2581
+ "blimp_principle_A_domain_2": {
2582
+ "acc": true
2583
+ },
2584
+ "blimp_principle_A_domain_3": {
2585
+ "acc": true
2586
+ },
2587
+ "blimp_principle_A_reconstruction": {
2588
+ "acc": true
2589
+ },
2590
+ "blimp_regular_plural_subject_verb_agreement_1": {
2591
+ "acc": true
2592
+ },
2593
+ "blimp_regular_plural_subject_verb_agreement_2": {
2594
+ "acc": true
2595
+ },
2596
+ "blimp_sentential_negation_npi_licensor_present": {
2597
+ "acc": true
2598
+ },
2599
+ "blimp_sentential_negation_npi_scope": {
2600
+ "acc": true
2601
+ },
2602
+ "blimp_sentential_subject_island": {
2603
+ "acc": true
2604
+ },
2605
+ "blimp_superlative_quantifiers_1": {
2606
+ "acc": true
2607
+ },
2608
+ "blimp_superlative_quantifiers_2": {
2609
+ "acc": true
2610
+ },
2611
+ "blimp_tough_vs_raising_1": {
2612
+ "acc": true
2613
+ },
2614
+ "blimp_tough_vs_raising_2": {
2615
+ "acc": true
2616
+ },
2617
+ "blimp_transitive": {
2618
+ "acc": true
2619
+ },
2620
+ "blimp_wh_island": {
2621
+ "acc": true
2622
+ },
2623
+ "blimp_wh_questions_object_gap": {
2624
+ "acc": true
2625
+ },
2626
+ "blimp_wh_questions_subject_gap": {
2627
+ "acc": true
2628
+ },
2629
+ "blimp_wh_questions_subject_gap_long_distance": {
2630
+ "acc": true
2631
+ },
2632
+ "blimp_wh_vs_that_no_gap": {
2633
+ "acc": true
2634
+ },
2635
+ "blimp_wh_vs_that_no_gap_long_distance": {
2636
+ "acc": true
2637
+ },
2638
+ "blimp_wh_vs_that_with_gap": {
2639
+ "acc": true
2640
+ },
2641
+ "blimp_wh_vs_that_with_gap_long_distance": {
2642
+ "acc": true
2643
+ }
2644
+ },
2645
+ "n-samples": {
2646
+ "blimp_adjunct_island": {
2647
+ "original": 1000,
2648
+ "effective": 1000
2649
+ },
2650
+ "blimp_anaphor_gender_agreement": {
2651
+ "original": 1000,
2652
+ "effective": 1000
2653
+ },
2654
+ "blimp_anaphor_number_agreement": {
2655
+ "original": 1000,
2656
+ "effective": 1000
2657
+ },
2658
+ "blimp_animate_subject_passive": {
2659
+ "original": 1000,
2660
+ "effective": 1000
2661
+ },
2662
+ "blimp_animate_subject_trans": {
2663
+ "original": 1000,
2664
+ "effective": 1000
2665
+ },
2666
+ "blimp_causative": {
2667
+ "original": 1000,
2668
+ "effective": 1000
2669
+ },
2670
+ "blimp_complex_NP_island": {
2671
+ "original": 1000,
2672
+ "effective": 1000
2673
+ },
2674
+ "blimp_coordinate_structure_constraint_complex_left_branch": {
2675
+ "original": 1000,
2676
+ "effective": 1000
2677
+ },
2678
+ "blimp_coordinate_structure_constraint_object_extraction": {
2679
+ "original": 1000,
2680
+ "effective": 1000
2681
+ },
2682
+ "blimp_determiner_noun_agreement_1": {
2683
+ "original": 1000,
2684
+ "effective": 1000
2685
+ },
2686
+ "blimp_determiner_noun_agreement_2": {
2687
+ "original": 1000,
2688
+ "effective": 1000
2689
+ },
2690
+ "blimp_determiner_noun_agreement_irregular_1": {
2691
+ "original": 1000,
2692
+ "effective": 1000
2693
+ },
2694
+ "blimp_determiner_noun_agreement_irregular_2": {
2695
+ "original": 1000,
2696
+ "effective": 1000
2697
+ },
2698
+ "blimp_determiner_noun_agreement_with_adj_2": {
2699
+ "original": 1000,
2700
+ "effective": 1000
2701
+ },
2702
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": {
2703
+ "original": 1000,
2704
+ "effective": 1000
2705
+ },
2706
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": {
2707
+ "original": 1000,
2708
+ "effective": 1000
2709
+ },
2710
+ "blimp_determiner_noun_agreement_with_adjective_1": {
2711
+ "original": 1000,
2712
+ "effective": 1000
2713
+ },
2714
+ "blimp_distractor_agreement_relational_noun": {
2715
+ "original": 1000,
2716
+ "effective": 1000
2717
+ },
2718
+ "blimp_distractor_agreement_relative_clause": {
2719
+ "original": 1000,
2720
+ "effective": 1000
2721
+ },
2722
+ "blimp_drop_argument": {
2723
+ "original": 1000,
2724
+ "effective": 1000
2725
+ },
2726
+ "blimp_ellipsis_n_bar_1": {
2727
+ "original": 1000,
2728
+ "effective": 1000
2729
+ },
2730
+ "blimp_ellipsis_n_bar_2": {
2731
+ "original": 1000,
2732
+ "effective": 1000
2733
+ },
2734
+ "blimp_existential_there_object_raising": {
2735
+ "original": 1000,
2736
+ "effective": 1000
2737
+ },
2738
+ "blimp_existential_there_quantifiers_1": {
2739
+ "original": 1000,
2740
+ "effective": 1000
2741
+ },
2742
+ "blimp_existential_there_quantifiers_2": {
2743
+ "original": 1000,
2744
+ "effective": 1000
2745
+ },
2746
+ "blimp_existential_there_subject_raising": {
2747
+ "original": 1000,
2748
+ "effective": 1000
2749
+ },
2750
+ "blimp_expletive_it_object_raising": {
2751
+ "original": 1000,
2752
+ "effective": 1000
2753
+ },
2754
+ "blimp_inchoative": {
2755
+ "original": 1000,
2756
+ "effective": 1000
2757
+ },
2758
+ "blimp_intransitive": {
2759
+ "original": 1000,
2760
+ "effective": 1000
2761
+ },
2762
+ "blimp_irregular_past_participle_adjectives": {
2763
+ "original": 1000,
2764
+ "effective": 1000
2765
+ },
2766
+ "blimp_irregular_past_participle_verbs": {
2767
+ "original": 1000,
2768
+ "effective": 1000
2769
+ },
2770
+ "blimp_irregular_plural_subject_verb_agreement_1": {
2771
+ "original": 1000,
2772
+ "effective": 1000
2773
+ },
2774
+ "blimp_irregular_plural_subject_verb_agreement_2": {
2775
+ "original": 1000,
2776
+ "effective": 1000
2777
+ },
2778
+ "blimp_left_branch_island_echo_question": {
2779
+ "original": 1000,
2780
+ "effective": 1000
2781
+ },
2782
+ "blimp_left_branch_island_simple_question": {
2783
+ "original": 1000,
2784
+ "effective": 1000
2785
+ },
2786
+ "blimp_matrix_question_npi_licensor_present": {
2787
+ "original": 1000,
2788
+ "effective": 1000
2789
+ },
2790
+ "blimp_npi_present_1": {
2791
+ "original": 1000,
2792
+ "effective": 1000
2793
+ },
2794
+ "blimp_npi_present_2": {
2795
+ "original": 1000,
2796
+ "effective": 1000
2797
+ },
2798
+ "blimp_only_npi_licensor_present": {
2799
+ "original": 1000,
2800
+ "effective": 1000
2801
+ },
2802
+ "blimp_only_npi_scope": {
2803
+ "original": 1000,
2804
+ "effective": 1000
2805
+ },
2806
+ "blimp_passive_1": {
2807
+ "original": 1000,
2808
+ "effective": 1000
2809
+ },
2810
+ "blimp_passive_2": {
2811
+ "original": 1000,
2812
+ "effective": 1000
2813
+ },
2814
+ "blimp_principle_A_c_command": {
2815
+ "original": 1000,
2816
+ "effective": 1000
2817
+ },
2818
+ "blimp_principle_A_case_1": {
2819
+ "original": 1000,
2820
+ "effective": 1000
2821
+ },
2822
+ "blimp_principle_A_case_2": {
2823
+ "original": 1000,
2824
+ "effective": 1000
2825
+ },
2826
+ "blimp_principle_A_domain_1": {
2827
+ "original": 1000,
2828
+ "effective": 1000
2829
+ },
2830
+ "blimp_principle_A_domain_2": {
2831
+ "original": 1000,
2832
+ "effective": 1000
2833
+ },
2834
+ "blimp_principle_A_domain_3": {
2835
+ "original": 1000,
2836
+ "effective": 1000
2837
+ },
2838
+ "blimp_principle_A_reconstruction": {
2839
+ "original": 1000,
2840
+ "effective": 1000
2841
+ },
2842
+ "blimp_regular_plural_subject_verb_agreement_1": {
2843
+ "original": 1000,
2844
+ "effective": 1000
2845
+ },
2846
+ "blimp_regular_plural_subject_verb_agreement_2": {
2847
+ "original": 1000,
2848
+ "effective": 1000
2849
+ },
2850
+ "blimp_sentential_negation_npi_licensor_present": {
2851
+ "original": 1000,
2852
+ "effective": 1000
2853
+ },
2854
+ "blimp_sentential_negation_npi_scope": {
2855
+ "original": 1000,
2856
+ "effective": 1000
2857
+ },
2858
+ "blimp_sentential_subject_island": {
2859
+ "original": 1000,
2860
+ "effective": 1000
2861
+ },
2862
+ "blimp_superlative_quantifiers_1": {
2863
+ "original": 1000,
2864
+ "effective": 1000
2865
+ },
2866
+ "blimp_superlative_quantifiers_2": {
2867
+ "original": 1000,
2868
+ "effective": 1000
2869
+ },
2870
+ "blimp_tough_vs_raising_1": {
2871
+ "original": 1000,
2872
+ "effective": 1000
2873
+ },
2874
+ "blimp_tough_vs_raising_2": {
2875
+ "original": 1000,
2876
+ "effective": 1000
2877
+ },
2878
+ "blimp_transitive": {
2879
+ "original": 1000,
2880
+ "effective": 1000
2881
+ },
2882
+ "blimp_wh_island": {
2883
+ "original": 1000,
2884
+ "effective": 1000
2885
+ },
2886
+ "blimp_wh_questions_object_gap": {
2887
+ "original": 1000,
2888
+ "effective": 1000
2889
+ },
2890
+ "blimp_wh_questions_subject_gap": {
2891
+ "original": 1000,
2892
+ "effective": 1000
2893
+ },
2894
+ "blimp_wh_questions_subject_gap_long_distance": {
2895
+ "original": 1000,
2896
+ "effective": 1000
2897
+ },
2898
+ "blimp_wh_vs_that_no_gap": {
2899
+ "original": 1000,
2900
+ "effective": 1000
2901
+ },
2902
+ "blimp_wh_vs_that_no_gap_long_distance": {
2903
+ "original": 1000,
2904
+ "effective": 1000
2905
+ },
2906
+ "blimp_wh_vs_that_with_gap": {
2907
+ "original": 1000,
2908
+ "effective": 1000
2909
+ },
2910
+ "blimp_wh_vs_that_with_gap_long_distance": {
2911
+ "original": 1000,
2912
+ "effective": 1000
2913
+ }
2914
+ },
2915
+ "config": {
2916
+ "model": "hf",
2917
+ "model_args": "pretrained=outputs/fw57M-tied/42/frequency_64000/.cache/eval_model",
2918
+ "model_num_parameters": 105785088,
2919
+ "model_dtype": "torch.bfloat16",
2920
+ "model_revision": "main",
2921
+ "model_sha": "",
2922
+ "batch_size": 1,
2923
+ "batch_sizes": [],
2924
+ "device": null,
2925
+ "use_cache": null,
2926
+ "limit": null,
2927
+ "bootstrap_iters": 100000,
2928
+ "gen_kwargs": null,
2929
+ "random_seed": 0,
2930
+ "numpy_seed": 1234,
2931
+ "torch_seed": 1234,
2932
+ "fewshot_seed": 1234
2933
+ },
2934
+ "git_hash": "778f288",
2935
+ "date": 1749552653.540547,
2936
+ "pretty_env_info": "'NoneType' object has no attribute 'splitlines'",
2937
+ "transformers_version": "4.52.4",
2938
+ "upper_git_hash": null,
2939
+ "tokenizer_pad_token": [
2940
+ "<|padding|>",
2941
+ "0"
2942
+ ],
2943
+ "tokenizer_eos_token": [
2944
+ "<|endoftext|>",
2945
+ "1"
2946
+ ],
2947
+ "tokenizer_bos_token": [
2948
+ null,
2949
+ "None"
2950
+ ],
2951
+ "eot_token_id": 1,
2952
+ "max_length": 2048,
2953
+ "task_hashes": {},
2954
+ "model_source": "hf",
2955
+ "model_name": "outputs/fw57M-tied/42/frequency_64000/.cache/eval_model",
2956
+ "model_name_sanitized": "outputs__fw57M-tied__42__frequency_64000__.cache__eval_model",
2957
+ "system_instruction": null,
2958
+ "system_instruction_sha": null,
2959
+ "fewshot_as_multiturn": false,
2960
+ "chat_template": null,
2961
+ "chat_template_sha": null,
2962
+ "start_time": 99068.118683634,
2963
+ "end_time": 100062.752246584,
2964
+ "total_evaluation_time_seconds": "994.6335629499954"
2965
+ }
hparams.yaml ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ loggers:
2
+ tensorboard:
3
+ _target_: src.trainer.TensorBoardLogger
4
+ save_dir: ./
5
+ name: ''
6
+ version: null
7
+ callbacks:
8
+ lr_monitor:
9
+ _target_: src.callbacks.lr_monitor.SimpleLearningRateMonitor
10
+ grad_norm:
11
+ _target_: src.callbacks.grad_norm.GradNorm
12
+ norm_type: 2
13
+ group_separator: /
14
+ histogram_freq: null
15
+ check_clipping: false
16
+ log_weight_distribution: false
17
+ only_total: true
18
+ speed_monitor:
19
+ _target_: src.callbacks.speed_monitor.SpeedMonitor
20
+ grad_accum:
21
+ _target_: src.callbacks.gradient_accumulation.GradientAccumulationScheduler
22
+ scheduling:
23
+ 0: 2
24
+ model_checkpoint:
25
+ _target_: src.callbacks.model_checkpoint.ModelCheckpoint
26
+ dirpath: .checkpoints
27
+ filename: '{step}'
28
+ enable_version_counter: false
29
+ every_n_train_steps: 2000
30
+ save_top_k: -1
31
+ save_last: link
32
+ verbose: true
33
+ save_initial_checkpoint: true
34
+ out_parent_folder: model_train
35
+ tok_name: frequency_64000
36
+ run_folder: .
37
+ dataset: finewebedu-20B
38
+ pwd: /home/zg258/rds/hpc-work/infotokenization
39
+ train_data_path: /home/zg258/rds/hpc-work/infotokenization/data/finewebedu-20B/frequency_64000/train
40
+ val_data_path: /home/zg258/rds/hpc-work/infotokenization/data/finewebedu-20B/frequency_64000/validation
41
+ model: fw57M-tied
42
+ resume_from_checkpoint: .checkpoints/last.ckpt
43
+ save_initial_checkpoint: true
44
+ seed: 42
45
+ torch_compile: true
46
+ data:
47
+ batch_size: 16
48
+ eval_batch_size: 64
49
+ shuffle: true
50
+ drop_last: false
51
+ num_workers: 12
52
+ pin_memory: true
53
+ persistent_workers: false
54
+ prefetch_factor: 2
55
+ multiprocessing_context: null
56
+ optim:
57
+ optim_name: adamw
58
+ lr: 0.0006
59
+ weight_decay: 0.01
60
+ optim_kwargs:
61
+ fused: true
62
+ eps: 1.0e-08
63
+ betas:
64
+ - 0.9
65
+ - 0.95
66
+ scheduler_name: warmup_stable_decay
67
+ num_warmup_steps: 2000
68
+ scheduler_kwargs:
69
+ num_stable_steps: 44000
70
+ num_decay_steps: 4000
71
+ min_lr_ratio: 0.01
72
+ trainer:
73
+ accelerator: gpu
74
+ devices: 4
75
+ precision: bf16-true
76
+ deterministic: false
77
+ log_every_n_steps: 1
78
+ enable_progress_bar: true
79
+ fast_dev_run: false
80
+ gradient_clip_val: 1.0
81
+ gradient_clip_algorithm: norm
82
+ val_check_interval: 2000
83
+ max_steps: 50000
84
+ limit_val_batches: 500
85
+ evaluation:
86
+ blimp: true
tb_logs.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19a6326b4af70a8f64f77180f97e0aa6af9ce7e16e0ac7c6e8a692e58d79edf8
3
+ size 281342
version_0/events.out.tfevents.1749402624.gpu-q-33.2148233.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa0324597ddee88455ac2a5c4fcf477252f78b6d7a2d7ac1b12db93f500cd159
3
+ size 29904723
version_0/hparams.yaml ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dataloader_config: !!python/object:src.data.DataloaderConfig
2
+ batch_size: 16
3
+ drop_last: false
4
+ eval_batch_size: 64
5
+ multiprocessing_context: null
6
+ num_workers: 12
7
+ persistent_workers: false
8
+ pin_memory: true
9
+ prefetch_factor: 2
10
+ shuffle: true
11
+ eod_token_id: 1
12
+ max_position_embeddings: 2048
13
+ optim_config: !!python/object:src.trainer.OptimCofig
14
+ keller_kwargs: {}
15
+ lr: 0.0006
16
+ num_warmup_steps: 2000
17
+ optim_kwargs:
18
+ betas:
19
+ - 0.9
20
+ - 0.95
21
+ eps: 1.0e-08
22
+ fused: true
23
+ optim_name: adamw
24
+ scheduler_kwargs:
25
+ min_lr_ratio: 0.01
26
+ num_decay_steps: 4000
27
+ num_stable_steps: 44000
28
+ scheduler_name: warmup_stable_decay
29
+ weight_decay: 0.01
30
+ train_data_path: /home/zg258/rds/hpc-work/infotokenization/data/finewebedu-20B/frequency_64000/train
31
+ val_data_path: /home/zg258/rds/hpc-work/infotokenization/data/finewebedu-20B/frequency_64000/validation
version_1/events.out.tfevents.1749467794.gpu-q-8.1336422.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:818c7dd626480a25a4499aa8b94a89c0ef262a8786b6c46be4ff34b3b8fa4479
3
+ size 4057405
version_1/hparams.yaml ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dataloader_config: !!python/object:src.data.DataloaderConfig
2
+ batch_size: 16
3
+ drop_last: false
4
+ eval_batch_size: 64
5
+ multiprocessing_context: null
6
+ num_workers: 12
7
+ persistent_workers: false
8
+ pin_memory: true
9
+ prefetch_factor: 2
10
+ shuffle: true
11
+ eod_token_id: 1
12
+ max_position_embeddings: 2048
13
+ optim_config: !!python/object:src.trainer.OptimCofig
14
+ keller_kwargs: {}
15
+ lr: 0.0006
16
+ num_warmup_steps: 2000
17
+ optim_kwargs:
18
+ betas:
19
+ - 0.9
20
+ - 0.95
21
+ eps: 1.0e-08
22
+ fused: true
23
+ optim_name: adamw
24
+ scheduler_kwargs:
25
+ min_lr_ratio: 0.01
26
+ num_decay_steps: 4000
27
+ num_stable_steps: 44000
28
+ scheduler_name: warmup_stable_decay
29
+ weight_decay: 0.01
30
+ train_data_path: /home/zg258/rds/hpc-work/infotokenization/data/finewebedu-20B/frequency_64000/train
31
+ val_data_path: /home/zg258/rds/hpc-work/infotokenization/data/finewebedu-20B/frequency_64000/validation