RedHatAI
/

Sparse-Llama-3.1-8B-tldr-2of4

@@ -25,7 +25,8 @@ datasets:
 - **Model Developers:** Red Hat (Neural Magic)
 This model is a fine-tuned version of the 2:4 sparse model [RedHatAI/Sparse-Llama-3.1-8B-2of4](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-2of4) on the trl-lib/tldr dataset.
-This sparse model obtains 0.366 BERTScore on the test set of trl-lib/tldr, the same result obtained by [nm-testing/Llama-3.1-8B-tldr](https://huggingface.co/nm-testing/Llama-3.1-8B-tldr), a dense model fine-tuned on the same dataset.
 ## Deployment
@@ -33,7 +34,7 @@ This model can be deployed efficiently using [vLLM](https://docs.vllm.ai/en/late
 Run the following command to start the vLLM server:
 ```bash
-vllm serve nm-testing/Sparse-Llama-3.1-8B-tldr-2of4
 ```
 Once your server is started, you can query the model using the OpenAI API:
@@ -56,7 +57,7 @@ TITLE: Training sparse LLMs
 POST: Now you can use the llm-compressor integration to axolotl to train sparse LLMs!
-It's super easy to use. See the example in https://huggingface.co/nm-testing/Sparse-Llama-3.1-8B-tldr-2of4.
 And there's more. You can run 2:4 sparse models on vLLM and get significant speedupts on Hopper GPUs!
 """
@@ -64,7 +65,7 @@ And there's more. You can run 2:4 sparse models on vLLM and get significant spee
 prompt = f"Give a TL;DR of the following Reddit post.\n<|user|>{post}\nTL;DR:\n<|assistant|>\n"
 completion = client.completions.create(
-  model="nm-testing/Sparse-Llama-3.1-8B-tldr-2of4",
   prompt=prompt,
   max_tokens=256,
 )
@@ -214,7 +215,7 @@ The model was evaluated on the test split of trl-lib/tldr using the Neural Magic
 One can reproduce these results by using the following command:
 ```bash
-lm_eval --model vllm --model_args "pretrained=nm-testing/Sparse-Llama-3.1-8B-tldr-2of4,dtype=auto,add_bos_token" --batch-size auto --tasks tldr
 ```
 <table>
@@ -269,3 +270,45 @@ lm_eval --model vllm --model_args "pretrained=nm-testing/Sparse-Llama-3.1-8B-tld
    </td>
   </tr>
 </table>

 - **Model Developers:** Red Hat (Neural Magic)
 This model is a fine-tuned version of the 2:4 sparse model [RedHatAI/Sparse-Llama-3.1-8B-2of4](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-2of4) on the trl-lib/tldr dataset.
+This sparse model recovers 100% of the BERTScore (0.366) obtained by the dense model [RedHatAI/Llama-3.1-8B-tldr](https://huggingface.co/RedHatAI/Llama-3.1-8B-tldr).
 ## Deployment
 Run the following command to start the vLLM server:
 ```bash
+vllm serve RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4
 ```
 Once your server is started, you can query the model using the OpenAI API:
 POST: Now you can use the llm-compressor integration to axolotl to train sparse LLMs!
+It's super easy to use. See the example in https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4.
 And there's more. You can run 2:4 sparse models on vLLM and get significant speedupts on Hopper GPUs!
 """
 prompt = f"Give a TL;DR of the following Reddit post.\n<|user|>{post}\nTL;DR:\n<|assistant|>\n"
 completion = client.completions.create(
+  model="RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4",
   prompt=prompt,
   max_tokens=256,
 )
 One can reproduce these results by using the following command:
 ```bash
+lm_eval --model vllm --model_args "pretrained=RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4,dtype=auto,add_bos_token" --batch-size auto --tasks tldr
 ```
 <table>
    </td>
   </tr>
 </table>
+## Inference Performance
+We evaluated the inference performance of this model using the first 1,000 samples from the training set of the [trl-lib/tldr](https://huggingface.co/datasets/trl-lib/tldr) dataset.
+Benchmarking was conducted with [vLLM](https://docs.vllm.ai/en/latest/) version `0.9.0.1` and [GuideLLM](https://github.com/neuralmagic/guidellm) version `0.2.1`.
+The figure below presents the **mean end-to-end latency per request** across varying request rates.
+Results are shown for this model, as well as two variants:
+- **Dense:** [Llama-3.1-8B-tldr](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4)
+- **Dense-quantized:** [Llama-3.1-8B-tldr-FP8-dynamic](https://huggingface.co/RedHatAI/Llama-3.1-8B-tldr-FP8-dynamic)
+- **Sparse-quantized:** [Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic)
+Although sparsity by itself does not significantly improve performance, when combined with quantization it results in up to 1.6x speedup.
+![Latency](./inference_performance/latency.png)
+<details><summary><strong>Reproduction instructions</strong></summary>
+To replicate the benchmark:
+1. Generate a JSON file containing the first 1,000 training samples:
+```python
+from datasets import load_dataset
+ds = load_dataset("trl-lib/tldr", split="train").take(1000)
+ds.to_json("tldr_1000.json")
+```
+2. Start a vLLM server using your target model:
+```bash
+vllm serve RedHatAI/Sparse-Llama-3.1-8B-tldr
+```
+3. Run the benchmark with GuideLLM:
+```
+GUIDELLM__OPENAI__MAX_OUTPUT_TOKENS=128 guidellm benchmark --target "http://localhost:8000" --rate-type sweep --data tldr_1000.json
+```
+> The average output length is approximately 30 tokens per sample. We capped the generation at 128 tokens to reduce performance skew from rare, unusually verbose completions.
+</details>