AMD is thrilled to introduce Instella-Math, a reasoning-focused language model that marks a major milestone for AMD: as far as we know, it's the first language model trained with long chain-of-thought reinforcement learning entirely on AMD GPUs. Starting from Instella-3B-Instruct, we extended the model’s capabilities through a multi-stage training pipeline—featuring two stages of supervised fine-tuning and three stages of reinforcement learning using the VERL framework —executed entirely on AMD Instinct™ MI300X GPUs.
Key Takeaways
- Introducing Instella-Math — first reasoning-centric language model with 3 billion parameters from AMD, fully trained on 32 AMD Instinct MI300X GPUs.
- Built on the AMD ROCm software stack, Instella-3B-Math leverages efficient distributed training techniques, including reinforcement learning across 4 MI300X nodes (8 GPUs each), demonstrating the scalability and performance of AMD hardware for cutting-edge AI workloads.
- Instella-Math is an open language model whose architecture, training code, weights, and datasets are publicly available, allowing anyone to inspect, use, modify, or build upon the model.
Instella-Math
Derived from Instella-3B-Instruct with an identical architecture, Instella-3B-Math is optimized for logical reasoning, mathematical problem-solving, and chain-of-thought tasks. The training pipeline features two stages of supervised fine-tuning followed by three reinforcement learning stages using the GRPO algorithm, as shown in figure 1.

Example Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "amd/Instella-3B-Math"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", trust_remote_code=True)
prompt = [{"role": "user", "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer within \\boxed{}."}]
inputs = tokenizer.apply_chat_template(
prompt,
add_generation_prompt=True,
return_tensors='pt'
)
tokens = model.generate(
inputs.to(model.device),
max_new_tokens=1024,
temperature=0.8,
do_sample=True
)
print(tokenizer.decode(tokens[0], skip_special_tokens=False))
Supervised Finetuning (SFT)
We perform a two-stage supervised fine-tuning process to gradually enhance the reasoning capabilities of the Instella-3B-Instruct model. The first stage we use instruction tuning for mathematical coverage. The second stage enables the model to generate in-depth analyses and structured reasoning steps, which are crucial for tackling complex problems like Olympiad-level math questions.
Stage 1: Instruction Tuning with OpenMathInstruct-2 for Mathematical Coverage
In the first stage of SFT, we begin with instruction tuning, following instructions or prompts properly, especially in a question-answer or problem-solution format. Using the OpenMathInstruct-2 dataset, which consists of 14 million problem-solution pairs generated from the GSM8K and MATH training sets. The model is trained to follow mathematical prompts covering a diverse range of topics from arithmetic and algebra to probability and calculus.
Stage 2: Deep Reasoning with Long-Context Training on AM-DeepSeek-R1-Distilled
In the second SFT stage, we further improve the model’s reasoning capability by training on AM-DeepSeek-R1-Distilled-1.4M, which is a large-scale general reasoning task dataset with high-quality and challenging reasoning problems. In this stage, we increase the context length of the model from 4K to 32K to allow the model to learn from the long chain-of-thought responses distilled from large reasoning models such as DeepSeek-R1.
Reinforcement Learning (GRPO)
Stage 1: GRPO with 8 Rollouts and 8K Output Contexts
Training: In the first stage of reinforcement learning, we apply the Group Relative Policy Optimization (GRPO) algorithm to train the model on Big-Math-RL-Verified, a curated set of complex multi-step math problems. We generate 8 rollouts per prompt, each allowing up to 8K output tokens, to explore diverse reasoning trajectories. The model is trained for 1,200 GRPO steps, using ruled-based reward signals designed by Prime-RL that favor correctness of solutions in the desired format. Training is distributed over 16 MI300X GPUs across 2 nodes, with VERL and VLLM enabling stable and efficient rollout collection, reward evaluation, and policy updates.
Stage 2: GRPO with Extended 16 Rollouts and 16K Output Contexts on DeepMath
Training: To push the limits of long-form reasoning, we conduct a second GRPO stage on DeepMath using 16 rollouts per prompt with up to 16K output tokens. This stage is designed to maximize the model's capacity for deep mathematical reasoning, enabling it to solve problems that require extended derivations, multiple nested logical steps, or structured proof-like outputs. In this stage, training is distributed over 32 MI300X GPUs across 4 nodes, and the model is trained for 600 GRPO steps.
Stage 3: GRPO with Extended 16 Rollouts and 16K Output Contexts on DeepScaleR
Training: To further improve the performance on Olympiad-level math questions, we conduct a third GRPO stage on DeepScaleR, which contains original questions from real Olympiad math competitions like AIME (1984-2023) and AMC (prior to 2023). Same as Stage 2, Stage 3 training uses 16 rollouts per prompt with up to 16K output tokens. In this stage, training is distributed over 32 MI300X GPUs across 4 nodes, and the model is trained for 740 GRPO steps.
Results
Size | MATH 500 | GSM8K | GPQA-D | AIME 2024 | AIME 2025 | AMC | Minerva | OlympiadBench | Average | |
---|---|---|---|---|---|---|---|---|---|---|
Open Weight Models | ||||||||||
Qwen2.5-Math-1.5B | 1.5B | 57.81 | 66.31 | 15.40 | 7.71 | 3.96 | 35.77 | 15.72 | 25.98 | 28.58 |
DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | 82.58 | 84.06 | 16.48 | 27.50 | 22.50 | 63.48 | 26.52 | 43.00 | 45.76 |
STILL-3-1.5B-preview | 1.5B | 84.59 | 86.57 | 19.48 | 30.63 | 25.21 | 66.72 | 28.58 | 45.29 | 48.38 |
DeepScaleR-1.5B-Preview | 1.5B | 87.43 | 87.34 | 16.45 | 40.63 | 30.83 | 73.19 | 30.06 | 49.89 | 51.98 |
Fully Open Models | ||||||||||
SmolLM3-3B | 3B | 90.16 | 92.26 | 44.85 | 52.50 | 35.83 | 78.69 | 31.76 | 55.35 | 60.18 |
OLMo-2-1124-7B-Instruct | 7B | 32.5 | 80.86 | 11.14 | 1.25 | 0.21 | 12.27 | 10.30 | 8.48 | 19.63 |
Instella-Math SFT | 3B | 77.55 | 88.03 | 23.36 | 20.00 | 18.96 | 53.92 | 18.82 | 43.27 | 42.99 |
Instella-Math RL Stage 1 | 3B | 82.16 | 90.90 | 34.15 | 27.92 | 22.50 | 58.81 | 25.05 | 49.23 | 48.84 |
Instella-Math RL Stage 2 | 3B | 85.84 | 91.72 | 37.37 | 29.58 | 22.92 | 66.72 | 27.53 | 52.67 | 51.79 |
Instella-Math RL Stage 3 | 3B | 86.49 | 92.48 | 37.63 | 35.63 | 27.71 | 69.73 | 27.67 | 53.11 | 53.80 |
oTTT | dTTT | cTTT | sTTT | Average | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Open Weight Models | ||||||||||
Qwen2.5-Math-1.5B | 12.5 | 10.00 | 18.89 | 7.50 | 12.22 | |||||
DeepSeek-R1-Distill-Qwen-1.5B | 22.92 | 10.06 | 18.19 | 3.49 | 13.67 | |||||
STILL-3-1.5B-preview | 24.51 | 12.25 | 19.79 | 3.18 | 14.93 | |||||
DeepScaleR-1.5B-Preview | 23.04 | 16.50 | 22.99 | 8.18 | 17.68 | |||||
Fully Open Models | ||||||||||
SmolLM3-3B | 51.22 | 40.06 | 41.32 | 42.34 | 43.74 | |||||
Instella-Math RL Stage 1 | 56.31 | 31.37 | 39.65 | 41.93 | 42.32 | |||||
Instella-Math RL Stage 2 | 66.2 | 37.31 | 39.17 | 44.48 | 46.79 | |||||
Instella-Math RL Stage 3 | 70.25 | 39.56 | 40.28 | 48.96 | 49.76 |
- Following the same evaluation setting as DeepScaleR-1.5B, we report Pass@1 accuracy averaged over 16 responses.
- Instella-Math delivers competitive performance when compared to leading small-scale open-weight models such as Deepseek-R1-Distilled-Qwen-1.5B, Still-3-1.5B, DeepScaleR-1.5B, and SmolLM3-3B.
- Beyond achieving competitive average performance across all benchmarks, Instella-Math demonstrates the effectiveness of our RL training recipe—improving over its supervised finetuned variant (Instella-Math-SFT) by 10.81 points, compared to a 6.22-point improvement seen in DeepScaleR over its base model (Deepseek-R1-Distilled-Qwen-1.5B).
- Additionally, we test Instella-Math on TTT-Bench, a new benchmark targeting strategic, spatial, and logical reasoning. Remarkably, without any exposure to TTT-Bench–style or similar strategic gaming data during any stage of training, Instella-Math achieves the best performance among all evaluated models.
Conclusion
The release of the Instella-Math model marks a major step forward in open-source AI, showcasing the potential of reasoning-focused language models and the scalability of AMD hardware for reinforcement learning and fine-tuning. To our knowledge, Instella-Math is the fully open math reasoning model that is trained on AMD GPUs. As part of AMD's commitment to open innovation, we’re sharing the full model weights, training setup, codebase, and datasets to foster collaboration, transparency, and progress across the AI community.
We invite researchers, educators, and developers to explore Instella-Math, build on its foundation, and collaborate with us in shaping the next generation of open, interpretable, and high-reasoning language models.
Additional Resources
- Blog: Introducing Instella-Math: Fully Open Language Model with Reasoning Capability
- Code: https://github.com/AMD-AIG-AIMA/Instella-Math
- Models:
Please refer to the following blogs to get started with using these techniques on AMD GPUs:
- Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration
- PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCmâ„¢
- Accelerating Large Language Models with Flash Attention on AMD GPUs
- Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCmâ„¢
Bias, Risks, and Limitations
- The models are being released for research purposes only and are not intended for use cases that require high levels of factuality, safety critical situations, health, or medical applications, generating false information, facilitating toxic conversations.
- Model checkpoints are made accessible without any safety promises. It is crucial for users to conduct comprehensive evaluations and implement safety filtering mechanisms as per their respective use cases.
- It may be possible to prompt the model to generate content that may be factually inaccurate, harmful, violent, toxic, biased, or otherwise objectionable. Such content may also get generated by prompts that did not intend to produce output as such. Users are thus requested to be aware of this and exercise caution and responsible thinking when using the model.
- Multi-lingual abilities of the models have not been tested and thus may misunderstand and generate erroneous responses across different languages.
License
The Instella-Math model is licensed for academic and research purposes under a ResearchRAIL license. Refer to the LICENSE and NOTICE files for more information.
Citations
Feel free to cite our Instella models:
@misc{Instella,
title = {Instella: Fully Open Language Models with Stellar Performance},
url = {https://huggingface.co/amd/Instella-3B},
author = {Jiang Liu, Jialian Wu, Xiaodong Yu, Prakamya Mishra, Sudhanshu Ranjan, Zicheng Liu, Chaitanya Manem, Yusheng Su, Pratik Prabhanjan Brahma, Gowtham Ramesh, Ximeng Sun, Ze Wang, Emad Barsoum},
month = {March},
year = {2025}
}
- Downloads last month
- 2
Model tree for amd/Instella-3B-Math-SFT
Base model
amd/Instella-3B-Instruct