🎉 License Updated! We are pleased to announce our more flexible licensing terms 🤗
✈️ Try on FriendliAI (licensed under commercial purposes)

📢 EXAONE 4.0 is officially supported by llama.cpp! Please check the guide below

EXAONE-4.0-32B-GGUF

Introduction

We introduce EXAONE 4.0, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean.

The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications.

In the EXAONE 4.0 architecture, we apply new architectural changes compared to previous EXAONE models as below:

Hybrid Attention: For the 32B model, we adopt hybrid attention scheme, which combines Local attention (sliding window attention) with Global attention (full attention) in a 3:1 ratio. We do not use RoPE (Rotary Positional Embedding) for global attention for better global context understanding.
QK-Reorder-Norm: We reorder the LayerNorm position from the traditional Pre-LN scheme by applying LayerNorm directly to the attention and MLP outputs, and we add RMS normalization right after the Q and K projection. It helps yield better performance on downstream tasks despite consuming more computation.

For more details, please refer to our technical report, HuggingFace paper, blog, and GitHub.

Model Configuration

Number of Parameters (without embeddings): 30.95B
Number of Layers: 64
Number of Attention Heads: GQA with 40-heads and 8-KV heads
Vocab Size: 102,400
Context Length: 131,072 tokens
Quantization: Q8_0, Q6_K, Q5_K_M, Q4_K_M, IQ4_XS in GGUF format (also includes BF16 weights)

Quickstart

llama.cpp

You can run EXAONE models locally using llama.cpp by following these steps:

Install the latest version of llama.cpp (version >= b5932). Please check the official installation guide from llama.cpp.

Download the EXAONE 4.0 model weights in GGUF format.

huggingface-cli download LGAI-EXAONE/EXAONE-4.0-32B-GGUF \
    --include "EXAONE-4.0-32B-Q4_K_M.gguf" \
    --local-dir .

When you use GGUF model split into multiple files, you should merge them into a single file before running the model.

First, download the GGUF model weights.

huggingface-cli download LGAI-EXAONE/EXAONE-4.0-32B-GGUF \
    --include "EXAONE-4.0-32B-BF16*.gguf" \
    --local-dir .

Merge the split files into a single file.

llama-gguf-split --merge \
    ./EXAONE-4.0-32B-BF16-00001-of-00002.gguf \
    ./EXAONE-4.0-32B-BF16.gguf

Generation with `llama-cli`

Apply chat template using transformers.

This process is necessary to avoid issues with current EXAONE modeling code in llama.cpp. This is work in progress at our PR. We will update this once these issues are solved.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "LGAI-EXAONE/EXAONE-4.0-32B"
tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "user", "content": "Let's work together on local system!"}
]
input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

print(repr(input_text))
with open("inputs.txt", "w") as f:
    f.write(input_text)

Generate result with greedy decoding.

llama-cli -m EXAONE-4.0-32B-Q4_K_M.gguf \
    -fa -ngl 65 \
    --temp 0.0 --top-k 1 \
    -f inputs.txt -no-cnv

OpenAI compatible server with `llama-server`

Run llama-server with EXAONE 4.0 Jinja template. You can find the chat template file in this repository.

llama-server -m EXAONE-4.0-32B-Q4_K_M.gguf \
    -c 131072 -fa -ngl 65 \
    --temp 0.6 --top-p 0.95 \
    --jinja --chat-template-file chat_template.jinja \
    --host 0.0.0.0 --port 8820 \
    -a EXAONE-4.0-32B-Q4_K_M

Use OpenAI chat completion to test the GGUF model.

curl -X POST http://localhost:8820/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "EXAONE-4.0-32B-Q4_K_M",
        "messages": [
            {"role": "user", "content": "Let'\''s work together on server!"}
        ],
        "max_tokens": 1024,
        "temperature": 0.6,
        "top_p": 0.95,
        "chat_template_kwargs": {"enable_thinking": false}
    }'

Performance

The following tables show the evaluation results of each model, with reasoning and non-reasoning mode. The evaluation details can be found in the technical report.

✅ denotes the model has a hybrid reasoning capability, evaluated by selecting reasoning / non-reasoning on the purpose.
To assess Korean practical and professional knowledge, we adopt both the KMMLU-Redux and KMMLU-Pro benchmarks. Both datasets are publicly released!
The evaluation results are based on the original model, not quantized model.

32B Reasoning Mode

	EXAONE 4.0 32B	Phi 4 reasoning-plus	Magistral Small-2506	Qwen 3 32B	Qwen 3 235B	DeepSeek R1-0528
Model Size	32.0B	14.7B	23.6B	32.8B	235B	671B
Hybrid Reasoning	✅			✅	✅
World Knowledge
MMLU-Redux	92.3	90.8	86.8	90.9	92.7	93.4
MMLU-Pro	81.8	76.0	73.4	80.0	83.0	85.0
GPQA-Diamond	75.4	68.9	68.2	68.4	71.1	81.0
Math/Coding
AIME 2025	85.3	78.0	62.8	72.9	81.5	87.5
HMMT Feb 2025	72.9	53.6	43.5	50.4	62.5	79.4
LiveCodeBench v5	72.6	51.7	55.8	65.7	70.7	75.2
LiveCodeBench v6	66.7	47.1	47.4	60.1	58.9	70.3
Instruction Following
IFEval	83.7	84.9	37.9	85.0	83.4	80.8
Multi-IF (EN)	73.5	56.1	27.4	73.4	73.4	72.0
Agentic Tool Use
BFCL-v3	63.9	N/A	40.4	70.3	70.8	64.7
Tau-Bench (Airline)	51.5	N/A	38.5	34.5	37.5	53.5
Tau-Bench (Retail)	62.8	N/A	10.2	55.2	58.3	63.9
Multilinguality
KMMLU-Pro	67.7	55.8	51.5	61.4	68.1	71.7
KMMLU-Redux	72.7	62.7	54.6	67.5	74.5	77.0
KSM	87.6	79.8	71.9	82.8	86.2	86.7
MMMLU (ES)	85.6	84.3	68.9	82.8	86.7	88.2
MATH500 (ES)	95.8	94.2	83.5	94.3	95.1	96.0

32B Non-Reasoning Mode

	EXAONE 4.0 32B	Phi 4	Mistral-Small-2506	Gemma3 27B	Qwen3 32B	Qwen3 235B	Llama-4-Maverick	DeepSeek V3-0324
Model Size	32.0B	14.7B	24.0B	27.4B	32.8B	235B	402B	671B
Hybrid Reasoning	✅				✅	✅
World Knowledge
MMLU-Redux	89.8	88.3	85.9	85.0	85.7	89.2	92.3	92.3
MMLU-Pro	77.6	70.4	69.1	67.5	74.4	77.4	80.5	81.2
GPQA-Diamond	63.7	56.1	46.1	42.4	54.6	62.9	69.8	68.4
Math/Coding
AIME 2025	35.9	17.8	30.2	23.8	20.2	24.7	18.0	50.0
HMMT Feb 2025	21.8	4.0	16.9	10.3	9.8	11.9	7.3	29.2
LiveCodeBench v5	43.3	24.6	25.8	27.5	31.3	35.3	43.4	46.7
LiveCodeBench v6	43.1	27.4	26.9	29.7	28.0	31.4	32.7	44.0
Instruction Following
IFEval	84.8	63.0	77.8	82.6	83.2	83.2	85.4	81.2
Multi-IF (EN)	71.6	47.7	63.2	72.1	71.9	72.5	77.9	68.3
Long Context
HELMET	58.3	N/A	61.9	58.3	54.5	63.3	13.7	N/A
RULER	88.2	N/A	71.8	66.0	85.6	90.6	2.9	N/A
LongBench v1	48.1	N/A	51.5	51.5	44.2	45.3	34.7	N/A
Agentic Tool Use
BFCL-v3	65.2	N/A	57.7	N/A	63.0	68.0	52.9	63.8
Tau-Bench (Airline)	25.5	N/A	36.1	N/A	16.0	27.0	38.0	40.5
Tau-Bench (Retail)	55.9	N/A	35.5	N/A	47.6	56.5	6.5	68.5
Multilinguality
KMMLU-Pro	60.0	44.8	51.0	50.7	58.3	64.4	68.8	67.3
KMMLU-Redux	64.8	50.1	53.6	53.3	64.4	71.7	76.9	72.2
KSM	59.8	29.1	35.5	36.1	41.3	46.6	40.6	63.5
Ko-LongBench	76.9	N/A	55.4	72.0	73.9	74.6	65.6	N/A
MMMLU (ES)	80.6	81.2	78.4	78.7	82.1	83.7	86.9	86.7
MATH500 (ES)	87.3	78.2	83.4	86.8	84.7	87.2	78.7	89.2
WMT24++ (ES)	90.7	89.3	92.2	93.1	91.4	92.9	92.7	94.3

1.2B Reasoning Mode

	EXAONE 4.0 1.2B	EXAONE Deep 2.4B	Qwen 3 0.6B	Qwen 3 1.7B	SmolLM 3 3B
Model Size	1.28B	2.41B	596M	1.72B	3.08B
Hybrid Reasoning	✅		✅	✅	✅
World Knowledge
MMLU-Redux	71.5	68.9	55.6	73.9	74.8
MMLU-Pro	59.3	56.4	38.3	57.7	57.8
GPQA-Diamond	52.0	54.3	27.9	40.1	41.7
Math/Coding
AIME 2025	45.2	47.9	15.1	36.8	36.7
HMMT Feb 2025	34.0	27.3	7.0	21.8	26.0
LiveCodeBench v5	44.6	47.2	12.3	33.2	27.6
LiveCodeBench v6	45.3	43.1	16.4	29.9	29.1
Instruction Following
IFEval	67.8	71.0	59.2	72.5	71.2
Multi-IF (EN)	53.9	54.5	37.5	53.5	47.5
Agentic Tool Use
BFCL-v3	52.9	N/A	46.4	56.6	37.1
Tau-Bench (Airline)	20.5	N/A	22.0	31.0	37.0
Tau-Bench (Retail)	28.1	N/A	3.3	6.5	5.4
Multilinguality
KMMLU-Pro	42.7	24.6	21.6	38.3	30.5
KMMLU-Redux	46.9	25.0	24.5	38.0	33.7
KSM	60.6	60.9	22.8	52.9	49.7
MMMLU (ES)	62.4	51.4	48.8	64.5	64.7
MATH500 (ES)	88.8	84.5	70.6	87.9	87.5

1.2B Non-Reasoning Mode

	EXAONE 4.0 1.2B	Qwen 3 0.6B	Gemma 3 1B	Qwen 3 1.7B	SmolLM 3 3B
Model Size	1.28B	596M	1.00B	1.72B	3.08B
Hybrid Reasoning	✅	✅		✅	✅
World Knowledge
MMLU-Redux	66.9	44.6	40.9	63.4	65.0
MMLU-Pro	52.0	26.6	14.7	43.7	43.6
GPQA-Diamond	40.1	22.9	19.2	28.6	35.7
Math/Coding
AIME 2025	23.5	2.6	2.1	9.8	9.3
HMMT Feb 2025	13.0	1.0	1.5	5.1	4.7
LiveCodeBench v5	26.4	3.6	1.8	11.6	11.4
LiveCodeBench v6	30.1	6.9	2.3	16.6	20.6
Instruction Following
IFEval	74.7	54.5	80.2	68.2	76.7
Multi-IF (EN)	62.1	37.5	32.5	51.0	51.9
Long Context
HELMET	41.2	21.1	N/A	33.8	38.6
RULER	77.4	55.1	N/A	65.9	66.3
LongBench v1	36.9	32.4	N/A	41.9	39.9
Agentic Tool Use
BFCL-v3	55.7	44.1	N/A	52.2	47.3
Tau-Bench (Airline)	10.0	31.5	N/A	13.5	38.0
Tau-Bench (Retail)	21.7	5.7	N/A	4.6	6.7
Multilinguality
KMMLU-Pro	37.5	24.6	9.7	29.5	27.6
KMMLU-Redux	40.4	22.8	19.4	29.8	26.4
KSM	26.3	0.1	22.8	16.3	16.1
Ko-LongBench	69.8	16.4	N/A	57.1	15.7
MMMLU (ES)	54.6	39.5	35.9	54.3	55.1
MATH500 (ES)	71.2	38.5	41.2	66.0	62.4
WMT24++ (ES)	65.9	58.2	76.9	76.7	84.0

Usage Guideline

To achieve the expected performance, we recommend using the following configurations:

For non-reasoning mode, we recommend using a lower temperature value such as temperature<0.6 for better performance.

For reasoning mode (using <think> block), we recommend using temperature=0.6 and top_p=0.95.

If you suffer from the model degeneration, we recommend using presence_penalty=1.5.

For Korean general conversation with 1.2B model, we suggest to use temperature=0.1 to avoid code switching.

Limitation

The EXAONE language model has certain limitations and may occasionally generate inappropriate responses. The language model generates responses based on the output probability of tokens, and it is determined during learning from training data. While we have made every effort to exclude personal, harmful, and biased information from the training data, some problematic content may still be included, potentially leading to undesirable responses. Please note that the text generated by EXAONE language model does not reflect the views of LG AI Research.

Inappropriate answers may be generated, which contain personal, harmful or other inappropriate information.
Biased responses may be generated, which are associated with age, gender, race, and so on.
The generated responses rely heavily on statistics from the training data, which can result in the generation of semantically or syntactically incorrect sentences.
Since the model does not reflect the latest information, the responses may be false or contradictory.

LG AI Research strives to reduce potential risks that may arise from EXAONE language models. Users are not allowed to engage in any malicious activities (e.g., keying in illegal information) that may induce the creation of inappropriate outputs violating LG AI's ethical principles when using EXAONE language models.

License

The model is licensed under EXAONE AI Model License Agreement 1.2 - NC

The main difference from the older version is as below:

We removed the claim of model output ownership from the license.

We restrict the model use against the development of models that compete with EXAONE.

We allow the model to be used for educational purposes, not just research.

Citation

@article{exaone-4.0,
  title={EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes},
  author={{LG AI Research}},
  journal={arXiv preprint arXiv:2507.11407},
  year={2025}
}

Contact

LG AI Research Technical Support: [email protected]

Downloads last month: 1,772

GGUF

Model size

32B params

Architecture

exaone4

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for LGAI-EXAONE/EXAONE-4.0-32B-GGUF

Base model

LGAI-EXAONE/EXAONE-4.0-32B

Quantized

(27)

this model

Collection including LGAI-EXAONE/EXAONE-4.0-32B-GGUF

EXAONE-4.0

Collection

EXAONE unified model series of 1.2B and 32B, integrating non-reasoning and reasoning modes. • 20 items • Updated Jul 29 • 50