AdvRahul
/

Axion-Flash-Reasoning-2B

@@ -1,54 +1,147 @@
 ---
-base_model: nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
-language:
-- en
 license: cc-by-nc-4.0
-pipeline_tag: text-generation
-library_name: transformers
 tags:
-- llama-cpp
 ---
-# AdvRahul/Axion-Flash-Reasoning-2B-Q8_0-GGUF
-This model was built on top of Nemotron-Research-Reasoning-Qwen-1.5B with advanced safety protocols.
-## Use with llama.cpp
-Install llama.cpp through brew (works on Mac and Linux)
-```bash
-brew install llama.cpp
-```
-Invoke the llama.cpp server or the CLI.
-### CLI:
-```bash
-llama-cli --hf-repo AdvRahul/Axion-Flash-Reasoning-2B-Q8_0-GGUF --hf-file axion-flash-reasoning-2B-Q8_0.gguf -p "The meaning to life and the universe is"
-```
-### Server:
-```bash
-llama-server --hf-repo AdvRahul/Axion-Flash-Reasoning-2B-Q8_0-GGUF --hf-file axion-flash-reasoning-2B-Q8_0.gguf -c 2048
-```
-Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well.
-Step 1: Clone llama.cpp from GitHub.
-```
-git clone https://github.com/ggerganov/llama.cpp
-```
-Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
-```
-cd llama.cpp && LLAMA_CURL=1 make
-```
-Step 3: Run inference through the main binary.
-```
-./llama-cli --hf-repo AdvRahul/Axion-Flash-Reasoning-2B-Q8_0-GGUF --hf-file axion-flash-reasoning-2B-Q8_0.gguf -p "The meaning to life and the universe is"
-```
-or
-```
-./llama-server --hf-repo AdvRahul/Axion-Flash-Reasoning-2B-Q8_0-GGUF --hf-file axion-flash-reasoning-2B-Q8_0.gguf -c 2048
-```

 ---
 license: cc-by-nc-4.0
+base_model: nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
 tags:
+- qwen
+- reasoning
+- fine-tuned
+- instruction-tuned
+- axion
+- logic
+- math
+- code
+---
+# AdvRahul/Axion-Flash-Reasoning-2B
+**An optimized and instruction-tuned model for high-speed, complex reasoning tasks.** 🚀
+`Axion-Flash-Reasoning-2B` is a fine-tuned version of NVIDIA's state-of-the-art `Nemotron-Research-Reasoning-Qwen-1.5B` model. This version is specifically adapted to be more instruction-friendly and computationally efficient, making it ideal for integration into applications requiring powerful reasoning capabilities without the overhead of larger models.
+## 🚀 Model Details
+* **Model Creator:** AdvRahul
+* **Base Model:** [nvidia/Nemotron-Research-Reasoning-Qwen-1.5B](https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B) (v2 checkpoint)
+* **Fine-tuning Focus:** Enhanced Instruction Following & Practical Usability
+* **Architecture:** Qwen 1.5
+* **License:** Creative Commons Attribution-NonCommercial 4.0 International (`cc-by-nc-4.0`)
 ---
+## 💻 How to Use
+This model can be used with the `transformers` library.
+### Basic Inference with `pipeline`
+The easiest way to get started is with the `text-generation` pipeline.
+```python
+from transformers import pipeline
+import torch
+# For optimal performance, use a GPU
+pipe = pipeline(
+    "text-generation",
+    model="AdvRahul/Axion-Flash-Reasoning-2B",
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
+# Qwen models use a specific chat template. The pipeline handles this automatically.
+messages = [
+    {"role": "system", "content": "You are a helpful assistant that excels at logical reasoning."},
+    {"role": "user", "content": "I have 3 apples and I buy 5 more. I then give 2 apples to my friend. How many apples do I have left?"}
+]
+prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
+print(outputs[0]["generated_text"])
+Optimized Inference (4-bit Quantization)
+To achieve "flash" speed and reduce memory usage, you can load the model in 4-bit using bitsandbytes.
+Bash
+pip install transformers torch accelerate bitsandbytes
+Python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "AdvRahul/Axion-Flash-Reasoning-2B"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    # This enables 4-bit quantization
+    load_in_4bit=True
+)
+messages = [
+    {"role": "system", "content": "You are an expert code assistant."},
+    {"role": "user", "content": "Write a Python function to calculate the factorial of a number using recursion."}
+]
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=150)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+📝 Model Description
+Fine-Tuning Philosophy
+While the base Nemotron-Research-Reasoning model demonstrates world-class capabilities in formal reasoning (math, code, logic), Axion-Flash has been further instruction-tuned to make these powerful abilities more accessible and practical for real-world applications. The goal is to bridge the gap between a pure research model and a deployable, instruction-following assistant that developers can easily integrate into their products.
+This fine-tuning enhances the model's ability to understand and follow user instructions in a conversational format, unlocking its reasoning power for a broader range of tasks.
+Key Capabilities
+Complex Reasoning: Inherits the base model's strength in solving logic puzzles, scientific questions, and multi-step problems.
+Code Generation: Proficient in generating code for various programming challenges and tasks.
+Mathematical Prowess: Excels at solving mathematical problems, from basic arithmetic to more complex Olympiad-level questions.
+Enhanced Instruction Following: Fine-tuned to better adhere to user instructions and constraints in a chat-like setting.
+ℹ️ Base Model Information (Nemotron-Research-Reasoning-Qwen-1.5B)
+<details>
+<summary>Click to expand details on the powerful base model</summary>
+Nemotron-Research-Reasoning-Qwen-1.5B is a leading open-weight model for complex reasoning, trained by NVIDIA using the ProRL (Prolonged Reinforcement Learning) algorithm. This advanced training method enables the model to explore reasoning strategies more deeply, leading to significant performance gains.
+The base model was trained on a diverse set of datasets, including:
+DeepScaleR-Preview-Dataset
+Eurus-2-RL-Data
+Reasoning-gym
+IFEval
+SCP-116K
+It sets a new state-of-the-art standard for models in its size class, outperforming competitors by a large margin on benchmarks for math, coding, logic puzzles, and STEM reasoning. For detailed performance metrics, please refer to the original model card.
+</details>
+⚖️ License and Terms of Use
+This model is released under the cc-by-nc-4.0 license, inheriting the license of its base model.
+This means it is available for research and non-commercial use only. Please review the license terms before using this model in your projects.
+Citing the Base Model's Research
+If you find the underlying methods of this model useful in your research, please cite the ProRL paper:
+Code snippet
+@article{liu2025prorl,
+  author    = {Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong},
+  title={ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models},
+  journal   = {arXiv preprint},
+  year      = {2025},
+  archivePrefix = {arXiv},
+  primaryClass = {cs.CL},
+  url={[https://arxiv.org/abs/2505.24864](https://arxiv.org/abs/2505.24864)},
+}