RedHatAI
/

DeepSeek-R1-Distill-Llama-8B-FP8-dynamic

@@ -11,7 +11,7 @@ library_name: transformers
 # DeepSeek-R1-Distill-Llama-8B-FP8-Dynamic
 ## Model Overview
-- **Model Architecture:** DeepSeek-R1-Distill-Llama-8B
   - **Input:** Text
   - **Output:** Text
 - **Model Optimizations:**
@@ -25,12 +25,15 @@ Quantized version of [DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deeps
 ### Model Optimizations
-This model was obtained by quantizing the weights and activations to FP8 data type, ready for inference with vLLM.
-This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.
-## Deployment
-### Use with vLLM
 This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
@@ -38,11 +41,12 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
 from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
-max_model_len, tp_size = 4096, 1
 model_name = "neuralmagic/DeepSeek-R1-Distill-Llama-8B-FP8-Dynamic"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
-llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
 sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 messages_list = [
     [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
@@ -64,44 +68,40 @@ This model was created with [llm-compressor](https://github.com/vllm-project/llm
 ```python
-import argparse
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from llmcompressor.modifiers.quantization import QuantizationModifier
 from llmcompressor.transformers import oneshot
 import os
-def main():
-    parser = argparse.ArgumentParser(description='Quantize a transformer model to FP8')
-    parser.add_argument('--model_id', type=str, required=True,
-                        help='The model ID from HuggingFace (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")')
-    parser.add_argument('--save_path', type=str, default='.',
-                        help='Custom path to save the quantized model. If not provided, will use model_name-FP8-dynamic')
-    args = parser.parse_args()
-    # Load model
-    model = AutoModelForCausalLM.from_pretrained(
-        args.model_id, device_map="auto", torch_dtype="auto", trust_remote_code=True,
-    )
-    tokenizer = AutoTokenizer.from_pretrained(args.model_id)
-    # Configure the quantization algorithm and scheme
-    recipe = QuantizationModifier(
-        targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
-    )
-    # Apply quantization
-    oneshot(model=model, recipe=recipe)
-    save_path = os.path.join(args.save_path, args.model_id.split("/")[1] + "-FP8-dynamic")
-    os.makedirs(save_path, exist_ok=True)
-    # Save to disk in compressed-tensors format
-    model.save_pretrained(save_path)
-    tokenizer.save_pretrained(save_path)
-    print(f"Model and tokenizer saved to: {save_path}")
-if __name__ == "__main__":
-    main()
 ```
 ## Evaluation
@@ -112,7 +112,7 @@ OpenLLM Leaderboard V1:
 ```
 lm_eval \
   --model vllm \
-  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-8B-FP8-Dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
   --tasks openllm \
   --write_out \
   --batch_size auto \
@@ -124,7 +124,7 @@ OpenLLM Leaderboard V2:
 ```
 lm_eval \
   --model vllm \
-  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-8B-FP8-Dynamic",dtype=auto,add_bos_token=False,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
   --apply_chat_template \
   --fewshot_as_multiturn \
   --tasks leaderboard \
@@ -132,7 +132,6 @@ lm_eval \
   --batch_size auto \
   --output_path output_dir \
   --show_config
 ```
 ### Accuracy

 # DeepSeek-R1-Distill-Llama-8B-FP8-Dynamic
 ## Model Overview
+- **Model Architecture:** LlamaForCausalLM
   - **Input:** Text
   - **Output:** Text
 - **Model Optimizations:**
 ### Model Optimizations
+This model was obtained by quantizing the weights and activations of [DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) to FP8 data type.
+This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
+Only the weights and activations of the linear operators within transformers blocks are quantized.
+Weights are quantized using a symmetric per-channel scheme, whereas quantizations are quantized using a symmetric per-token scheme.
+[LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.
+## Use with vLLM
 This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
+number_gpus = 1
 model_name = "neuralmagic/DeepSeek-R1-Distill-Llama-8B-FP8-Dynamic"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
+llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)
 messages_list = [
     [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from llmcompressor.modifiers.quantization import QuantizationModifier
 from llmcompressor.transformers import oneshot
 import os
+# Load model
+model_stub = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
+model_name = model_stub.split("/")[-1]
+model = AutoModelForCausalLM.from_pretrained(
+    model_stub,
+    torch_dtype="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained(model_stub)
+# Configure the quantization algorithm and scheme
+recipe = QuantizationModifier(
+    targets="Linear",
+    scheme="FP8_DYNAMIC",
+    ignore=["lm_head"],
+)
+# Apply quantization
+oneshot(
+    model=model,
+    recipe=recipe,
+)
+# Save to disk in compressed-tensors format
+save_path = model_name + "-FP8-dynamic
+model.save_pretrained(save_path)
+tokenizer.save_pretrained(save_path)
+print(f"Model and tokenizer saved to: {save_path}")
 ```
 ## Evaluation
 ```
 lm_eval \
   --model vllm \
+  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-8B-FP8-Dynamic",dtype=auto,max_model_len=4096,enable_chunked_prefill=True \
   --tasks openllm \
   --write_out \
   --batch_size auto \
 ```
 lm_eval \
   --model vllm \
+  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Llama-8B-FP8-Dynamic",dtype=auto,max_model_len=4096,enable_chunked_prefill=True \
   --apply_chat_template \
   --fewshot_as_multiturn \
   --tasks leaderboard \
   --batch_size auto \
   --output_path output_dir \
   --show_config
 ```
 ### Accuracy