Benchmarks (Hellaswag, IFEval, MMLU, Swag, xstorycloze)

by TheDrummer - opened 22 days ago

Owner 22 days ago

Ran a couple of cheap benchmarks that seem relevant without breaking the bank. (not gpqa, longbench, bbh, mmlu_pro, they all cost $$$)

If you guys have any more relevant benchmarks, send them my way!

tldr: Results show marginal loss/gain, but keep in mind that I've met my tuning objectives while retaining the base's strength.

Hellaswag

Tasks to predict the ending of stories or scenarios, testing comprehension and creativity.

Mistral 24B 3.2

hf (pretrained=anthracite-core/Mistral-Small-3.2-24B-Instruct-2506-Text-Only,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|  Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag|      1|none  |     0|acc     |↑  |0.6396|±  |0.0048|
|         |       |none  |     0|acc_norm|↑  |0.8403|±  |0.0037|

Cydonia 24B v4.1

hf (pretrained=TheDrummer/Cydonia-24B-v4.1,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|  Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag|      1|none  |     0|acc     |↑  |0.6471|±  |0.0048|
|         |       |none  |     0|acc_norm|↑  |0.8449|±  |0.0036|

Unaffected, seemingly improved slightly past the margin of error. Seems like a good metric for reading comprehension and common sense.

IFEval

Focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times".

Mistral 24B 3.2

hf (pretrained=anthracite-core/Mistral-Small-3.2-24B-Instruct-2506-Text-Only,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|Tasks |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|---|-----:|---|------|
|ifeval|      4|none  |     0|inst_level_loose_acc   |↑  |0.7374|±  |   N/A|
|      |       |none  |     0|inst_level_strict_acc  |↑  |0.6655|±  |   N/A|
|      |       |none  |     0|prompt_level_loose_acc |↑  |0.6470|±  |0.0206|
|      |       |none  |     0|prompt_level_strict_acc|↑  |0.5582|±  |0.0214|

Cydonia 24B v4.1

hf (pretrained=TheDrummer/Cydonia-24B-v4.1,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|Tasks |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|---|-----:|---|------|
|ifeval|      4|none  |     0|inst_level_loose_acc   |↑  |0.7098|±  |   N/A|
|      |       |none  |     0|inst_level_strict_acc  |↑  |0.6571|±  |   N/A|
|      |       |none  |     0|prompt_level_loose_acc |↑  |0.6303|±  |0.0208|
|      |       |none  |     0|prompt_level_strict_acc|↑  |0.5638|±  |0.0213|

Barely affected, better prompt-level strict accuracy. I associate IFEval with the model's ability to embody the characters accurately.

MMLU

Massive Multitask Language Understanding benchmark for broad domain language evaluation.

Mistral 24B 3.2

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7739|±  |0.0033|
| - humanities     |      2|none  |      |acc   |↑  |0.6903|±  |0.0063|
| - other          |      2|none  |      |acc   |↑  |0.8275|±  |0.0065|
| - social sciences|      2|none  |      |acc   |↑  |0.8700|±  |0.0060|
| - stem           |      2|none  |      |acc   |↑  |0.7520|±  |0.0074|

Cydonia 24B v4.1

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7736|±  |0.0033|
| - humanities     |      2|none  |      |acc   |↑  |0.6876|±  |0.0063|
| - other          |      2|none  |      |acc   |↑  |0.8272|±  |0.0065|
| - social sciences|      2|none  |      |acc   |↑  |0.8720|±  |0.0059|
| - stem           |      2|none  |      |acc   |↑  |0.7533|±  |0.0073|

Good ol' MMLU is barely affected / within margins. Tests knowledge and intelligence.

Swag

Situations With Adversarial Generations, predicting the next event in videos.

Mistral 24B 3.2

hf (pretrained=anthracite-core/Mistral-Small-3.2-24B-Instruct-2506-Text-Only,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|Tasks|Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-----|------:|------|-----:|--------|---|-----:|---|-----:|
|swag |      1|none  |     0|acc     |↑  |0.5920|±  |0.0035|
|     |       |none  |     0|acc_norm|↑  |0.7906|±  |0.0029|

Cydonia 24B v4.1

hf (pretrained=TheDrummer/Cydonia-24B-v4.1,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|Tasks|Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-----|------:|------|-----:|--------|---|-----:|---|-----:|
|swag |      1|none  |     0|acc     |↑  |0.5950|±  |0.0035|
|     |       |none  |     0|acc_norm|↑  |0.7921|±  |0.0029|

Within margins, barely affected. Just like Hellaswag, tests its reading comprehension and common sense.

xstorycloze

Cross-lingual narrative understanding tasks to predict story endings in multiple languages.

Mistral 24B 3.2

hf (pretrained=anthracite-core/Mistral-Small-3.2-24B-Instruct-2506-Text-Only,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|      Tasks      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|-----------------|------:|------|-----:|------|---|-----:|---|-----:|
|xstorycloze      |      1|none  |      |acc   |↑  |0.6510|±  |0.0036|
| - xstorycloze_ar|      1|none  |     0|acc   |↑  |0.6254|±  |0.0125|
| - xstorycloze_en|      1|none  |     0|acc   |↑  |0.8081|±  |0.0101|
| - xstorycloze_es|      1|none  |     0|acc   |↑  |0.7604|±  |0.0110|
| - xstorycloze_eu|      1|none  |     0|acc   |↑  |0.6128|±  |0.0125|
| - xstorycloze_hi|      1|none  |     0|acc   |↑  |0.5758|±  |0.0127|
| - xstorycloze_id|      1|none  |     0|acc   |↑  |0.7048|±  |0.0117|
| - xstorycloze_my|      1|none  |     0|acc   |↑  |0.4772|±  |0.0129|
| - xstorycloze_ru|      1|none  |     0|acc   |↑  |0.7498|±  |0.0111|
| - xstorycloze_sw|      1|none  |     0|acc   |↑  |0.5917|±  |0.0126|
| - xstorycloze_te|      1|none  |     0|acc   |↑  |0.5467|±  |0.0128|
| - xstorycloze_zh|      1|none  |     0|acc   |↑  |0.7081|±  |0.0117|

|  Groups   |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|-----------|------:|------|------|------|---|----:|---|-----:|
|xstorycloze|      1|none  |      |acc   |↑  |0.651|±  |0.0036|

Cydonia 24B v4.1

hf (pretrained=TheDrummer/Cydonia-24B-v4.1,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|      Tasks      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|-----------------|------:|------|-----:|------|---|-----:|---|-----:|
|xstorycloze      |      1|none  |      |acc   |↑  |0.6537|±  |0.0036|
| - xstorycloze_ar|      1|none  |     0|acc   |↑  |0.6373|±  |0.0124|
| - xstorycloze_en|      1|none  |     0|acc   |↑  |0.8081|±  |0.0101|
| - xstorycloze_es|      1|none  |     0|acc   |↑  |0.7571|±  |0.0110|
| - xstorycloze_eu|      1|none  |     0|acc   |↑  |0.6069|±  |0.0126|
| - xstorycloze_hi|      1|none  |     0|acc   |↑  |0.5983|±  |0.0126|
| - xstorycloze_id|      1|none  |     0|acc   |↑  |0.6989|±  |0.0118|
| - xstorycloze_my|      1|none  |     0|acc   |↑  |0.4765|±  |0.0129|
| - xstorycloze_ru|      1|none  |     0|acc   |↑  |0.7518|±  |0.0111|
| - xstorycloze_sw|      1|none  |     0|acc   |↑  |0.5884|±  |0.0127|
| - xstorycloze_te|      1|none  |     0|acc   |↑  |0.5520|±  |0.0128|
| - xstorycloze_zh|      1|none  |     0|acc   |↑  |0.7154|±  |0.0116|

|  Groups   |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|-----------|------:|------|------|------|---|-----:|---|-----:|
|xstorycloze|      1|none  |      |acc   |↑  |0.6537|±  |0.0036|

Within margins, barely affected. Stress test on multilingual capabilities.

SerialKicked

21 days ago

Thanks for running those. I wish it was a more common practice in the community.

Interesting. I've always considered Cydonia of doing a good job at keeping mistral's abilities more or less intact (at least where it matters), this more or less proves it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment