duyntnet
/

Llama-3.1-Nemotron-Nano-8B-v1-imatrix-GGUF

+---
+license: other
+language:
+- en
+pipeline_tag: text-generation
+inference: false
+tags:
+- transformers
+- gguf
+- imatrix
+- Llama-3.1-Nemotron-Nano-8B-v1
+---
+Quantizations of https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1
+### Open source inference clients/UIs
+* [llama.cpp](https://github.com/ggerganov/llama.cpp)
+* [KoboldCPP](https://github.com/LostRuins/koboldcpp)
+* [ollama](https://github.com/ollama/ollama)
+* [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
+* [jan](https://github.com/janhq/jan)
+* [GPT4All](https://github.com/nomic-ai/gpt4all)
+### Closed source inference clients/UIs
+* [LM Studio](https://lmstudio.ai/)
+* [Backyard AI](https://backyard.ai/)
+* More will be added...
+---
+# From original readme
+Llama-3.1-Nemotron-Nano-8B-v1 is a large language model (LLM) which is a derivative of [Meta Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling.
+Llama-3.1-Nemotron-Nano-8B-v1 is a model which offers a great tradeoff between model accuracy and efficiency. It is created from Llama 3.1 8B Instruct and offers improvements in model accuracy. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K.
+This model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using REINFORCE (RLOO) and Online Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and Online RPO checkpoints. Improved using Qwen.
+This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here:
+[Llama-3.3-Nemotron-Super-49B-v1](https://huggingface.co/nvidia/Llama-3.3-Nemotron-Super-49B-v1)
+This model is ready for commercial use.
+## Quick Start and Usage Recommendations:
+1. Reasoning mode (ON/OFF) is controlled via the system prompt, which must be set as shown in the example below. All instructions should be contained within the user prompt
+2. We recommend setting temperature to `0.6`, and Top P to `0.95` for Reasoning ON mode
+3. We recommend using greedy decoding for Reasoning OFF mode
+4. We have provided a list of prompts to use for evaluation for each benchmark where a specific template is required
+You can try this model out through the preview API, using this link: [Llama-3.1-Nemotron-Nano-8B-v1](https://build.nvidia.com/nvidia/llama-3_1-nemotron-nano-8b-v1).
+See the snippet below for usage with Hugging Face Transformers library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below.
+Our code requires the transformers package version to be `4.44.2` or higher.
+### Example of “Reasoning On:”
+```python
+import torch
+import transformers
+model_id = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
+model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
+tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
+tokenizer.pad_token_id = tokenizer.eos_token_id
+pipeline = transformers.pipeline(
+   "text-generation",
+   model=model_id,
+   tokenizer=tokenizer,
+   max_new_tokens=32768,
+   temperature=0.6,
+   top_p=0.95,
+   **model_kwargs
+)
+# Thinking can be "on" or "off"
+thinking = "on"
+print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
+```
+### Example of “Reasoning Off:”
+```python
+import torch
+import transformers
+model_id = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
+model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
+tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
+tokenizer.pad_token_id = tokenizer.eos_token_id
+pipeline = transformers.pipeline(
+   "text-generation",
+   model=model_id,
+   tokenizer=tokenizer,
+   max_new_tokens=32768,
+   do_sample=False,
+   **model_kwargs
+)
+# Thinking can be "on" or "off"
+thinking = "off"
+print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
+```
+For some prompts, even though thinking is disabled, the model emergently prefers to think before responding. But if desired, the users can prevent it by pre-filling the assistant response.
+```python
+import torch
+import transformers
+model_id = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
+model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
+tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
+tokenizer.pad_token_id = tokenizer.eos_token_id
+# Thinking can be "on" or "off"
+thinking = "off"
+pipeline = transformers.pipeline(
+   "text-generation",
+   model=model_id,
+   tokenizer=tokenizer,
+   max_new_tokens=32768,
+   do_sample=False,
+   **model_kwargs
+)
+print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}, {"role":"assistant", "content":"<think>\n</think>"}]))
+```