astanahub/alemllm · Hugging Face

Description:

AlemLLM is a large language model customized by Astana Hub to improve the helpfulness of LLM generated responses in the Kazakh language.

Evaluation Metrics

Model evaluations were conducted using established benchmarks, employing a systematic process to test performance across various cognitive and technical tasks.

Kazakh Leaderboard

Model	Average	MMLU	Winogrande	Hellaswag	ARC	GSM8k	DROP
Yi-Lightning	0.812	0.720	0.852	0.820	0.940	0.880	0.660
DeepSeek V3 37A	0.715	0.650	0.628	0.640	0.900	0.890	0.580
DeepSeek R1	0.798	0.753	0.764	0.680	0.868	0.937	0.784
Llama-3.1-70b-inst.	0.639	0.610	0.585	0.520	0.820	0.780	0.520
KazLLM-1.0-70B	0.766	0.660	0.806	0.790	0.920	0.770	0.650
GPT-4o	0.776	0.730	0.704	0.830	0.940	0.900	0.550
AlemLLM	0.826	0.757	0.837	0.775	0.949	0.917	0.719
QwQ 32В	0.628	0.591	0.613	0.499	0.661	0.826	0.576

Russian Leaderboard

Model	Average	MMLU	Winogrande	Hellaswag	ARC	GSM8k	DROP
Yi-Lightning	0.834	0.750	0.854	0.870	0.960	0.890	0.680
DeepSeek V3 37A	0.818	0.784	0.756	0.840	0.960	0.910	0.660
DeepSeek R1	0.845	0.838	0.811	0.827	0.972	0.928	0.694
Llama-3.1-70b-inst.	0.752	0.660	0.691	0.730	0.920	0.880	0.630
KazLLM-1.0-70B	0.748	0.650	0.806	0.860	0.790	0.810	0.570
GPT-4o	0.808	0.776	0.771	0.880	0.960	0.890	0.570
AlemLLM	0.848	0.801	0.858	0.843	0.959	0.896	0.729
QwQ 32B	0.840	0.810	0.807	0.823	0.964	0.926	0.709

English Leaderboard

Model	Average	MMLU	Winogrande	Hellaswag	ARC	GSM8k	DROP
Yi-Lightning	0.909	0.820	0.936	0.930	0.980	0.930	0.860
DeepSeek V3 37A	0.880	0.840	0.790	0.900	0.980	0.950	0.820
DeepSeek R1	0.908	0.855	0.857	0.882	0.977	0.960	0.915
Llama-3.1-70b-inst.	0.841	0.770	0.718	0.880	0.960	0.900	0.820
KazLLM-1.0-70B	0.855	0.820	0.843	0.920	0.970	0.820	0.760
GPT-4o	0.862	0.830	0.793	0.940	0.980	0.910	0.720
AlemLLM	0.921	0.874	0.928	0.909	0.978	0.926	0.911
QwQ 32В	0.914	0.864	0.886	0.897	0.969	0.969	0.896

Model specification

Architecture: Mixture of Experts
Total Parameters: 247B
Activated Parameters: 22B
Tokenizer: SentencePiece
Quantization: BF16
Vocabulary Size: 100352
Number of Layers: 56
Activation Function: SwiGLU
Positional Encoding Method: ROPE
Optimizer: AdamW