Benchmarks (Hellaswag, IFEval, MMLU, Swag, xstorycloze)
Ran a couple of cheap benchmarks that seem relevant without breaking the bank. (not gpqa, longbench, bbh, mmlu_pro, they all cost $$$)
If you guys have any more relevant benchmarks, send them my way!
tldr: Results show marginal loss/gain, but keep in mind that I've met my tuning objectives while retaining the base's strength.
Hellaswag
Tasks to predict the ending of stories or scenarios, testing comprehension and creativity.
Mistral 24B 3.2
hf (pretrained=anthracite-core/Mistral-Small-3.2-24B-Instruct-2506-Text-Only,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag| 1|none | 0|acc |↑ |0.6396|± |0.0048|
| | |none | 0|acc_norm|↑ |0.8403|± |0.0037|
Cydonia 24B v4.1
hf (pretrained=TheDrummer/Cydonia-24B-v4.1,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag| 1|none | 0|acc |↑ |0.6471|± |0.0048|
| | |none | 0|acc_norm|↑ |0.8449|± |0.0036|
Unaffected, seemingly improved slightly past the margin of error. Seems like a good metric for reading comprehension and common sense.
IFEval
Focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times".
Mistral 24B 3.2
hf (pretrained=anthracite-core/Mistral-Small-3.2-24B-Instruct-2506-Text-Only,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------|------:|------|-----:|-----------------------|---|-----:|---|------|
|ifeval| 4|none | 0|inst_level_loose_acc |↑ |0.7374|± | N/A|
| | |none | 0|inst_level_strict_acc |↑ |0.6655|± | N/A|
| | |none | 0|prompt_level_loose_acc |↑ |0.6470|± |0.0206|
| | |none | 0|prompt_level_strict_acc|↑ |0.5582|± |0.0214|
Cydonia 24B v4.1
hf (pretrained=TheDrummer/Cydonia-24B-v4.1,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------|------:|------|-----:|-----------------------|---|-----:|---|------|
|ifeval| 4|none | 0|inst_level_loose_acc |↑ |0.7098|± | N/A|
| | |none | 0|inst_level_strict_acc |↑ |0.6571|± | N/A|
| | |none | 0|prompt_level_loose_acc |↑ |0.6303|± |0.0208|
| | |none | 0|prompt_level_strict_acc|↑ |0.5638|± |0.0213|
Barely affected, better prompt-level strict accuracy. I associate IFEval with the model's ability to embody the characters accurately.
MMLU
Massive Multitask Language Understanding benchmark for broad domain language evaluation.
Mistral 24B 3.2
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc |↑ |0.7739|± |0.0033|
| - humanities | 2|none | |acc |↑ |0.6903|± |0.0063|
| - other | 2|none | |acc |↑ |0.8275|± |0.0065|
| - social sciences| 2|none | |acc |↑ |0.8700|± |0.0060|
| - stem | 2|none | |acc |↑ |0.7520|± |0.0074|
Cydonia 24B v4.1
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc |↑ |0.7736|± |0.0033|
| - humanities | 2|none | |acc |↑ |0.6876|± |0.0063|
| - other | 2|none | |acc |↑ |0.8272|± |0.0065|
| - social sciences| 2|none | |acc |↑ |0.8720|± |0.0059|
| - stem | 2|none | |acc |↑ |0.7533|± |0.0073|
Good ol' MMLU is barely affected / within margins. Tests knowledge and intelligence.
Swag
Situations With Adversarial Generations, predicting the next event in videos.
Mistral 24B 3.2
hf (pretrained=anthracite-core/Mistral-Small-3.2-24B-Instruct-2506-Text-Only,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|Tasks|Version|Filter|n-shot| Metric | |Value | |Stderr|
|-----|------:|------|-----:|--------|---|-----:|---|-----:|
|swag | 1|none | 0|acc |↑ |0.5920|± |0.0035|
| | |none | 0|acc_norm|↑ |0.7906|± |0.0029|
Cydonia 24B v4.1
hf (pretrained=TheDrummer/Cydonia-24B-v4.1,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|Tasks|Version|Filter|n-shot| Metric | |Value | |Stderr|
|-----|------:|------|-----:|--------|---|-----:|---|-----:|
|swag | 1|none | 0|acc |↑ |0.5950|± |0.0035|
| | |none | 0|acc_norm|↑ |0.7921|± |0.0029|
Within margins, barely affected. Just like Hellaswag, tests its reading comprehension and common sense.
xstorycloze
Cross-lingual narrative understanding tasks to predict story endings in multiple languages.
Mistral 24B 3.2
hf (pretrained=anthracite-core/Mistral-Small-3.2-24B-Instruct-2506-Text-Only,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|-----------------|------:|------|-----:|------|---|-----:|---|-----:|
|xstorycloze | 1|none | |acc |↑ |0.6510|± |0.0036|
| - xstorycloze_ar| 1|none | 0|acc |↑ |0.6254|± |0.0125|
| - xstorycloze_en| 1|none | 0|acc |↑ |0.8081|± |0.0101|
| - xstorycloze_es| 1|none | 0|acc |↑ |0.7604|± |0.0110|
| - xstorycloze_eu| 1|none | 0|acc |↑ |0.6128|± |0.0125|
| - xstorycloze_hi| 1|none | 0|acc |↑ |0.5758|± |0.0127|
| - xstorycloze_id| 1|none | 0|acc |↑ |0.7048|± |0.0117|
| - xstorycloze_my| 1|none | 0|acc |↑ |0.4772|± |0.0129|
| - xstorycloze_ru| 1|none | 0|acc |↑ |0.7498|± |0.0111|
| - xstorycloze_sw| 1|none | 0|acc |↑ |0.5917|± |0.0126|
| - xstorycloze_te| 1|none | 0|acc |↑ |0.5467|± |0.0128|
| - xstorycloze_zh| 1|none | 0|acc |↑ |0.7081|± |0.0117|
| Groups |Version|Filter|n-shot|Metric| |Value| |Stderr|
|-----------|------:|------|------|------|---|----:|---|-----:|
|xstorycloze| 1|none | |acc |↑ |0.651|± |0.0036|
Cydonia 24B v4.1
hf (pretrained=TheDrummer/Cydonia-24B-v4.1,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|-----------------|------:|------|-----:|------|---|-----:|---|-----:|
|xstorycloze | 1|none | |acc |↑ |0.6537|± |0.0036|
| - xstorycloze_ar| 1|none | 0|acc |↑ |0.6373|± |0.0124|
| - xstorycloze_en| 1|none | 0|acc |↑ |0.8081|± |0.0101|
| - xstorycloze_es| 1|none | 0|acc |↑ |0.7571|± |0.0110|
| - xstorycloze_eu| 1|none | 0|acc |↑ |0.6069|± |0.0126|
| - xstorycloze_hi| 1|none | 0|acc |↑ |0.5983|± |0.0126|
| - xstorycloze_id| 1|none | 0|acc |↑ |0.6989|± |0.0118|
| - xstorycloze_my| 1|none | 0|acc |↑ |0.4765|± |0.0129|
| - xstorycloze_ru| 1|none | 0|acc |↑ |0.7518|± |0.0111|
| - xstorycloze_sw| 1|none | 0|acc |↑ |0.5884|± |0.0127|
| - xstorycloze_te| 1|none | 0|acc |↑ |0.5520|± |0.0128|
| - xstorycloze_zh| 1|none | 0|acc |↑ |0.7154|± |0.0116|
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|-----------|------:|------|------|------|---|-----:|---|-----:|
|xstorycloze| 1|none | |acc |↑ |0.6537|± |0.0036|
Within margins, barely affected. Stress test on multilingual capabilities.
Thanks for running those. I wish it was a more common practice in the community.
Interesting. I've always considered Cydonia of doing a good job at keeping mistral's abilities more or less intact (at least where it matters), this more or less proves it.