Update README.md
Browse files
README.md
CHANGED
@@ -25,7 +25,8 @@ datasets:
|
|
25 |
- **Model Developers:** Red Hat (Neural Magic)
|
26 |
|
27 |
This model is a fine-tuned version of the 2:4 sparse model [RedHatAI/Sparse-Llama-3.1-8B-2of4](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-2of4) on the trl-lib/tldr dataset.
|
28 |
-
This sparse model
|
|
|
29 |
|
30 |
## Deployment
|
31 |
|
@@ -33,7 +34,7 @@ This model can be deployed efficiently using [vLLM](https://docs.vllm.ai/en/late
|
|
33 |
|
34 |
Run the following command to start the vLLM server:
|
35 |
```bash
|
36 |
-
vllm serve
|
37 |
```
|
38 |
|
39 |
Once your server is started, you can query the model using the OpenAI API:
|
@@ -56,7 +57,7 @@ TITLE: Training sparse LLMs
|
|
56 |
|
57 |
POST: Now you can use the llm-compressor integration to axolotl to train sparse LLMs!
|
58 |
|
59 |
-
It's super easy to use. See the example in https://huggingface.co/
|
60 |
|
61 |
And there's more. You can run 2:4 sparse models on vLLM and get significant speedupts on Hopper GPUs!
|
62 |
"""
|
@@ -64,7 +65,7 @@ And there's more. You can run 2:4 sparse models on vLLM and get significant spee
|
|
64 |
prompt = f"Give a TL;DR of the following Reddit post.\n<|user|>{post}\nTL;DR:\n<|assistant|>\n"
|
65 |
|
66 |
completion = client.completions.create(
|
67 |
-
model="
|
68 |
prompt=prompt,
|
69 |
max_tokens=256,
|
70 |
)
|
@@ -214,7 +215,7 @@ The model was evaluated on the test split of trl-lib/tldr using the Neural Magic
|
|
214 |
One can reproduce these results by using the following command:
|
215 |
|
216 |
```bash
|
217 |
-
lm_eval --model vllm --model_args "pretrained=
|
218 |
```
|
219 |
|
220 |
<table>
|
@@ -269,3 +270,45 @@ lm_eval --model vllm --model_args "pretrained=nm-testing/Sparse-Llama-3.1-8B-tld
|
|
269 |
</td>
|
270 |
</tr>
|
271 |
</table>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
- **Model Developers:** Red Hat (Neural Magic)
|
26 |
|
27 |
This model is a fine-tuned version of the 2:4 sparse model [RedHatAI/Sparse-Llama-3.1-8B-2of4](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-2of4) on the trl-lib/tldr dataset.
|
28 |
+
This sparse model recovers 100% of the BERTScore (0.366) obtained by the dense model [RedHatAI/Llama-3.1-8B-tldr](https://huggingface.co/RedHatAI/Llama-3.1-8B-tldr).
|
29 |
+
|
30 |
|
31 |
## Deployment
|
32 |
|
|
|
34 |
|
35 |
Run the following command to start the vLLM server:
|
36 |
```bash
|
37 |
+
vllm serve RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4
|
38 |
```
|
39 |
|
40 |
Once your server is started, you can query the model using the OpenAI API:
|
|
|
57 |
|
58 |
POST: Now you can use the llm-compressor integration to axolotl to train sparse LLMs!
|
59 |
|
60 |
+
It's super easy to use. See the example in https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4.
|
61 |
|
62 |
And there's more. You can run 2:4 sparse models on vLLM and get significant speedupts on Hopper GPUs!
|
63 |
"""
|
|
|
65 |
prompt = f"Give a TL;DR of the following Reddit post.\n<|user|>{post}\nTL;DR:\n<|assistant|>\n"
|
66 |
|
67 |
completion = client.completions.create(
|
68 |
+
model="RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4",
|
69 |
prompt=prompt,
|
70 |
max_tokens=256,
|
71 |
)
|
|
|
215 |
One can reproduce these results by using the following command:
|
216 |
|
217 |
```bash
|
218 |
+
lm_eval --model vllm --model_args "pretrained=RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4,dtype=auto,add_bos_token" --batch-size auto --tasks tldr
|
219 |
```
|
220 |
|
221 |
<table>
|
|
|
270 |
</td>
|
271 |
</tr>
|
272 |
</table>
|
273 |
+
|
274 |
+
|
275 |
+
## Inference Performance
|
276 |
+
|
277 |
+
We evaluated the inference performance of this model using the first 1,000 samples from the training set of the [trl-lib/tldr](https://huggingface.co/datasets/trl-lib/tldr) dataset.
|
278 |
+
Benchmarking was conducted with [vLLM](https://docs.vllm.ai/en/latest/) version `0.9.0.1` and [GuideLLM](https://github.com/neuralmagic/guidellm) version `0.2.1`.
|
279 |
+
|
280 |
+
The figure below presents the **mean end-to-end latency per request** across varying request rates.
|
281 |
+
Results are shown for this model, as well as two variants:
|
282 |
+
- **Dense:** [Llama-3.1-8B-tldr](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4)
|
283 |
+
- **Dense-quantized:** [Llama-3.1-8B-tldr-FP8-dynamic](https://huggingface.co/RedHatAI/Llama-3.1-8B-tldr-FP8-dynamic)
|
284 |
+
- **Sparse-quantized:** [Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic)
|
285 |
+
Although sparsity by itself does not significantly improve performance, when combined with quantization it results in up to 1.6x speedup.
|
286 |
+
|
287 |
+
|
288 |
+

|
289 |
+
|
290 |
+
|
291 |
+
|
292 |
+
<details><summary><strong>Reproduction instructions</strong></summary>
|
293 |
+
|
294 |
+
To replicate the benchmark:
|
295 |
+
|
296 |
+
1. Generate a JSON file containing the first 1,000 training samples:
|
297 |
+
```python
|
298 |
+
from datasets import load_dataset
|
299 |
+
ds = load_dataset("trl-lib/tldr", split="train").take(1000)
|
300 |
+
ds.to_json("tldr_1000.json")
|
301 |
+
```
|
302 |
+
|
303 |
+
2. Start a vLLM server using your target model:
|
304 |
+
```bash
|
305 |
+
vllm serve RedHatAI/Sparse-Llama-3.1-8B-tldr
|
306 |
+
```
|
307 |
+
|
308 |
+
3. Run the benchmark with GuideLLM:
|
309 |
+
```
|
310 |
+
GUIDELLM__OPENAI__MAX_OUTPUT_TOKENS=128 guidellm benchmark --target "http://localhost:8000" --rate-type sweep --data tldr_1000.json
|
311 |
+
```
|
312 |
+
> The average output length is approximately 30 tokens per sample. We capped the generation at 128 tokens to reduce performance skew from rare, unusually verbose completions.
|
313 |
+
|
314 |
+
</details>
|