Supercharge Edge AI with High Accuracy Reasoning Using Llama Nemotron Nano 4B
Reasoning models that generate “thinking” tokens have been shown to be essential for a variety of agentic AI applications. This year, we’ve seen remarkable progress in this space with smaller, yet higher accuracy models. There's a growing demand for models that are not only capable but also lightweight—Reasoning Small Language Models (SLMs) that can operate efficiently under memory and compute constraints.
We are excited to introduce Llama 3.1 Nemotron Nano 4B v1.1, the latest and most compact member of the Llama Nemotron family. Designed for accuracy and efficiency, Nano 4B v1.1 sets a new standard for lightweight small language models. Despite its smaller size, it delivers higher accuracy and up to 50% greater throughput compared to other state-of-the-art open models up-to 8 billion parameters, making it ideal for a wide range of applications where speed and accuracy matter.
🚀Advancing Reasoning SLMs for On-device Agentic AI
Nano 4B is built with controllable System 1 and System 2 reasoning capabilities, enabling it to perform both reasoning and non-reasoning tasks more effectively. It’s optimized for low-cost inference, making it a compelling choice for developers and enterprises looking to scale AI with constrained resources.
With just 4 billion parameters, Nano 4B v1.1 is compact enough to be deployed at the edge on NVIDIA Jetson and NVIDIA RTX GPUs. This allows for faster response times, enhanced data privacy, and greater deployment flexibility—all while keeping inference costs low. Whether you're building AI-powered applications in the cloud or at the edge, the Nano model delivers robust performance in a remarkably efficient footprint.
As shown in the above image, Llama 3.1 Nemotron Nano 4B v1.1 outperformed other leading open models with 8B or few parameters across most of the benchmarks for reasoning, coding, and math, demonstrating exceptional accuracy in quantitative, competitive, and logical reasoning tasks. Check out this link for live benchmarks.
With its smaller size, Nano 4B v1.1 can offer 1.5X higher throughput than the popular open-source 8B models.
🧪 Training Recipe for Llama 3.1 Nemotron Nano 4B v1.1
Llama 3.1 Nemotron Nano 4B v1.1 is created from Llama 3.1 8B, but there was a long reasoning and alignment training journey.
Nano 4B v1.1 is a fine-tuned model of Llama 3.1 Minitron 4B Width Base, which is created from Llama 3.1 8B Base using NVIDIA’s Minitron LLM compression technique. Details of the Llama 3.1 Minitron 4B Width Base including efficiency improvements can be found in this blog.
The model went through SFT stages consuming mixed reasoning on/off datasets across several areas including math, coding, science, safety and tool-calling. Some of these datasets have been recently open-sourced as part of the Llama-Nemotron-Post-Training-Dataset.
For SFT, the dataset was sequence packed using the NeMo-Skills library , which provides up to 10x training speed improvement by packing shorter samples into a single batch, eliminating sequence padding.
The model has been further fine-tuned using the NeMo-Aligner library for multiple RPO stages in which on-policy data was used for more stable and effective training. For long-context improvement, RoPe scaling was applied, which was followed-up with light-weight reasoning accuracy recovery RPO round.
For more details, see the Llama-Nemotron technical report and the tech blog to learn more about the family of Llama-Nemotron models, which covers the training recipe and evaluation results for the previous version of Llama 3.1 Nemotron Nano 8B v1 model.
💻 How to Use Llama 3.1 Nemotron Nano 4B v1.1
Here are HF code examples for reasoning on and off use-cases. What you need to do is to add the phrase “detailed thinking on” at the beginning of the system prompt. You can also add additional content to the system prompt (such as tool specifications). It’s recommended to insert \n\n
to separate sections.
The recommended generation configuration is temperature=0.6
and top_p=0.95
for the “reasoning on” mode. We strongly recommend avoiding temperature=0
for this model in “reasoning on” mode.
import torch
import transformers
model_id = "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
tokenizer=tokenizer,
max_new_tokens=32768,
temperature=0.6,
top_p=0.95,
**model_kwargs
)
# Thinking can be "on" or "off"
thinking = "on"
# Users can optionally add additional system prompt
system_prompt = "You're a helpful assistant."
print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}\n\n{system_prompt}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
To use the reasoning-off mode, you just need to use “detailed thinking off” within the system prompt. Please note that for the “reasoning off mode”, the recommended generation configuration is temperature=0
, and max_new_tokens
can be set lower (e.g., 8192) than when using detailed thinking on mode.
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
tokenizer=tokenizer,
max_new_tokens=8192,
temperature=0.0,
do_sample=False,
**model_kwargs
)
# Thinking can be "on" or "off"
thinking = "off”
# Users can optionally add additional system prompt
system_prompt = "You're a helpful assistant."
print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}\n\n{system_prompt}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
💻 Function Calling with Llama 3.1 Nemotron Nano 4B v1.1
You can also launch a vLLM server with function calling support using the model. The chat template Jinja file (llama_nemotron_nano_generic_tool_calling.jinja
) and the tool-call parser (llama_nemotron_nano_toolcall_parser.py
) required for native function calling with vLLM are hosted in the HF repo. This example simply clones the repo to use the downloaded parser and the function calling chat template. You must be using the vllm/vllm openai:v0.6.6
or newer for this model to be supported.
To try the function calling example, download the Jupyter notebook and watch this tutorial on YouTube.
#!/bin/bash
CWD=$(pwd)
PORT=5000
git clone https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1
docker run -it --rm \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-p ${PORT}:${PORT} \
-v ${CWD}:${CWD} \
vllm/vllm-openai:v0.6.6 \
--model $CWD/Llama-3.1-Nemotron-Nano-4B-v1.1 \
--trust-remote-code \
--seed 1 \
--host "0.0.0.0" \
--port $PORT \
--served-model-name "Llama-Nemotron-Nano-4B-v1.1" \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--gpu-memory-utilization 0.95 \
--enforce-eager \
--enable-auto-tool-choice \
--tool-parser-plugin "${CWD}/Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_toolcall_parser.py" \
--tool-call-parser "llama_nemotron_json" \
--chat-template "${CWD}/Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_generic_tool_calling.jinja"
After launching a vLLM server, you can use an OpenAI client to use function calling. For example:
from openai import OpenAI
client = OpenAI(
base_url="http://0.0.0.0:5000/v1",
api_key="dummy",
)
completion = client.chat.completions.create(
model="Llama-Nemotron-Nano-4B-v1.1",
messages=[
{"role": "system", "content": "detailed thinking on"},
{"role": "user", "content": "My bill is $100. What will be the amount for 18% tip?"},
],
tools=[
{"type": "function", "function": {"name": "calculate_tip", "parameters": {"type": "object", "properties": {"bill_total": {"type": "integer", "description": "The total amount of the bill"}, "tip_percentage": {"type": "integer", "description": "The percentage of tip to be applied"}}, "required": ["bill_total", "tip_percentage"]}}},
{"type": "function", "function": {"name": "convert_currency", "parameters": {"type": "object", "properties": {"amount": {"type": "integer", "description": "The amount to be converted"}, "from_currency": {"type": "string", "description": "The currency code to convert from"}, "to_currency": {"type": "string", "description": "The currency code to convert to"}}, "required": ["from_currency", "amount", "to_currency"]}}},
],
)
completion.choices[0].message.content
# '<think>\nOkay, let\'s see. The user has a bill of $100 and wants to know the amount of a 18% tip. So, I need to calculate the tip amount. The available tools include calculate_tip, which requires bill_total and tip_percentage. The parameters are both integers. The bill_total is 100, and the tip percentage is 18. So, the function should multiply 100 by 18% and return 18.0. But wait, maybe the user wants the total including the tip? The question says "the amount for 18% tip," which could be interpreted as the tip amount itself. Since the function is called calculate_tip, it\'s likely that it\'s designed to compute the tip, not the total. So, using calculate_tip with bill_total=100 and tip_percentage=18 should give the correct result. The other function, convert_currency, isn\'t relevant here. So, I should call calculate_tip with those values.\n</think>\n\n'
completion.choices[0].message.tool_calls
# [ChatCompletionMessageToolCall(id='chatcmpl-tool-2972d86817344edc9c1e0f9cd398e999', function=Function(arguments='{"bill_total": 100, "tip_percentage": 18}', name='calculate_tip'), type='function')]
🚀 Try It Now!
To get started, check out the Llama Nemotron Nano 4B v1.1 Hugging Face repository and download the model checkpoints.
NVIDIA has also packaged Llama Nemotron Nano as an NVIDIA NIM inference microservice, optimized for high throughput and low latency. NVIDIA NIM delivers seamless, scalable AI inferencing, on-premises or in the cloud, leveraging industry-standard APIs.
Try the Llama Nemotron Nano 4B v1.1 NIM here.