|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
- ko |
|
|
tags: |
|
|
- text-generation-inference |
|
|
- conversational |
|
|
- custom_code |
|
|
- text-generation |
|
|
- Motif |
|
|
--- |
|
|
|
|
|
Last update: 31 Oct. 2025 |
|
|
|
|
|
# Introduction |
|
|
|
|
|
We are pleased to announce **Motif-2-12.7B-Base**, a 12.7-billion-parameter language model. Detailed information including technical report will be released later. |
|
|
|
|
|
# Evaluation |
|
|
|
|
|
All models listed in the table below are **base models**. *The results of Qwen3 and Gemma 3 are <U>sourced directly from their technical reports.</U>* |
|
|
|
|
|
|Benchmark|Evaluation setting|Motif-2-12.7B|Qwen3-14B|Qwen3-32B|Qwen3-30B-A3B|Gemma-3-12B|Gemma-3-27B| |
|
|
|---|---|---|---|---|---|---|---| |
|
|
|MMLU|5-shot|78.1|81.05|83.61|81.38|74.5|78.6| |
|
|
|MMLU-Redux|5-shot|78.68|79.88|83.41|81.17|-|-| |
|
|
|MMLU-Pro|5-shot, CoT|66.38|61.03|65.54|61.49|45.3|52.2| |
|
|
|SuperGPQA|5-shot, CoT|32.68|34.27|39.78|35.72|-|-| |
|
|
|BBH|3-shot, CoT|81.34|81.07|87.38|81.54|-|-| |
|
|
|GPQA|5-shot, CoT|42.18|39.9|49.49|43.94|-|-| |
|
|
|GPQA-Diamond|5-shot, CoT|42.92|-|-|-|25.4|24.3| |
|
|
|GSM8K|4-shot, CoT|93.85|92.49|93.4|91.81|-|-| |
|
|
|GSM8K|8-shot, CoT|94.92|-|-|-|71|82.6| |
|
|
|MATH|4-shot, CoT|73.62|62.02|61.62|59.04|43.3|50| |
|
|
|EvalPlus|0-shot|72.22|72.23|72.05|71.45|-|-| |
|
|
|MBPP|3-shot|81.5|73.4|78.2|74.4|60.4|65.6| |
|
|
|CRUX-O|1-shot|63.1|68.6|72.5|67.2|-|-| |
|
|
|HumanEval|0-shot|65.9|-|-|-|45.7|48.8| |
|
|
|DROP|1-shot|69.9|-|-|-|72.2|77.2| |
|
|
|HellaSwag|10-shot|84|-|-|-|84.2|85.6| |
|
|
|BoolQ|0-shot|78.5|-|-|-|78.8|82.4| |
|
|
|PIQA|0-shot|81.6|-|-|-|81.8|83.3| |
|
|
|SIQA|0-shot|53.8|-|-|-|53.4|54.9| |
|
|
|TriviaQA|5-shot|72.2|-|-|-|78.2|85.5| |
|
|
|Natural Question|5-shot|29.6|-|-|-|31.4|36.1| |
|
|
|ARC-C|25-shot|69.6|-|-|-|68.9|70.6| |
|
|
|ARC-E|0-shot|84.1|-|-|-|88.3|89| |
|
|
|WinoGrande|5-shot|79.6|-|-|-|74.3|78.8| |
|
|
|BBH|few-shot|81.3|-|-|-|72.6|77.7| |
|
|
|
|
|
## Averages and improvements of the corresponding benchmark scores: |
|
|
|
|
|
### v.s. Gemma 3-Base |
|
|
|
|
|
||Motif-2-12.7B|Gemma-3-12B|Gemma-3-27B| |
|
|
|---|---|---|---| |
|
|
|**Average**|71.53|63.87|67.96| |
|
|
|**Improvement**||+11.99%|+5.26%| |
|
|
|
|
|
### v.s. Qwen3-Base |
|
|
|
|
|
||Motif-2-12.7B|Qwen3-14B|Qwen3-32B|Qwen3-30B-A3B| |
|
|
|---|---|---|---|---| |
|
|
|**Average**|69.42|67.81|71.54|68.10| |
|
|
|**Improvement**||+2.37%|-2.96%|+1.94%| |