duyntnet commited on
Commit
650ac47
·
verified ·
1 Parent(s): 342e201

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +134 -0
README.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - transformers
9
+ - gguf
10
+ - imatrix
11
+ - Llama-3.1-Nemotron-Nano-8B-v1
12
+ ---
13
+ Quantizations of https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1
14
+
15
+
16
+ ### Open source inference clients/UIs
17
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp)
18
+ * [KoboldCPP](https://github.com/LostRuins/koboldcpp)
19
+ * [ollama](https://github.com/ollama/ollama)
20
+ * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
21
+ * [jan](https://github.com/janhq/jan)
22
+ * [GPT4All](https://github.com/nomic-ai/gpt4all)
23
+
24
+ ### Closed source inference clients/UIs
25
+ * [LM Studio](https://lmstudio.ai/)
26
+ * [Backyard AI](https://backyard.ai/)
27
+ * More will be added...
28
+ ---
29
+
30
+ # From original readme
31
+
32
+ Llama-3.1-Nemotron-Nano-8B-v1 is a large language model (LLM) which is a derivative of [Meta Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling.
33
+
34
+ Llama-3.1-Nemotron-Nano-8B-v1 is a model which offers a great tradeoff between model accuracy and efficiency. It is created from Llama 3.1 8B Instruct and offers improvements in model accuracy. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K.
35
+
36
+ This model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using REINFORCE (RLOO) and Online Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and Online RPO checkpoints. Improved using Qwen.
37
+
38
+ This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here:
39
+ [Llama-3.3-Nemotron-Super-49B-v1](https://huggingface.co/nvidia/Llama-3.3-Nemotron-Super-49B-v1)
40
+
41
+ This model is ready for commercial use.
42
+
43
+ ## Quick Start and Usage Recommendations:
44
+
45
+ 1. Reasoning mode (ON/OFF) is controlled via the system prompt, which must be set as shown in the example below. All instructions should be contained within the user prompt
46
+ 2. We recommend setting temperature to `0.6`, and Top P to `0.95` for Reasoning ON mode
47
+ 3. We recommend using greedy decoding for Reasoning OFF mode
48
+ 4. We have provided a list of prompts to use for evaluation for each benchmark where a specific template is required
49
+
50
+ You can try this model out through the preview API, using this link: [Llama-3.1-Nemotron-Nano-8B-v1](https://build.nvidia.com/nvidia/llama-3_1-nemotron-nano-8b-v1).
51
+
52
+ See the snippet below for usage with Hugging Face Transformers library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below.
53
+ Our code requires the transformers package version to be `4.44.2` or higher.
54
+
55
+
56
+ ### Example of “Reasoning On:”
57
+
58
+ ```python
59
+ import torch
60
+ import transformers
61
+
62
+ model_id = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
63
+ model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
64
+ tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
65
+ tokenizer.pad_token_id = tokenizer.eos_token_id
66
+
67
+ pipeline = transformers.pipeline(
68
+ "text-generation",
69
+ model=model_id,
70
+ tokenizer=tokenizer,
71
+ max_new_tokens=32768,
72
+ temperature=0.6,
73
+ top_p=0.95,
74
+ **model_kwargs
75
+ )
76
+
77
+ # Thinking can be "on" or "off"
78
+ thinking = "on"
79
+
80
+ print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
81
+ ```
82
+
83
+
84
+ ### Example of “Reasoning Off:”
85
+
86
+ ```python
87
+ import torch
88
+ import transformers
89
+
90
+ model_id = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
91
+ model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
92
+ tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
93
+ tokenizer.pad_token_id = tokenizer.eos_token_id
94
+
95
+ pipeline = transformers.pipeline(
96
+ "text-generation",
97
+ model=model_id,
98
+ tokenizer=tokenizer,
99
+ max_new_tokens=32768,
100
+ do_sample=False,
101
+ **model_kwargs
102
+ )
103
+
104
+ # Thinking can be "on" or "off"
105
+ thinking = "off"
106
+
107
+ print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
108
+ ```
109
+
110
+ For some prompts, even though thinking is disabled, the model emergently prefers to think before responding. But if desired, the users can prevent it by pre-filling the assistant response.
111
+
112
+ ```python
113
+ import torch
114
+ import transformers
115
+
116
+ model_id = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
117
+ model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
118
+ tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
119
+ tokenizer.pad_token_id = tokenizer.eos_token_id
120
+
121
+ # Thinking can be "on" or "off"
122
+ thinking = "off"
123
+
124
+ pipeline = transformers.pipeline(
125
+ "text-generation",
126
+ model=model_id,
127
+ tokenizer=tokenizer,
128
+ max_new_tokens=32768,
129
+ do_sample=False,
130
+ **model_kwargs
131
+ )
132
+
133
+ print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}, {"role":"assistant", "content":"<think>\n</think>"}]))
134
+ ```