imi2
/

llama-2-7b-chat-pure-Q4_0-gguf

Model card Files Files and versions Community

imi2 commited on Apr 11

Commit

0702cd8

·

verified ·

1 Parent(s): 5c9f016

Create README.md

Files changed (1) hide show

README.md +48 -0

README.md ADDED Viewed

	@@ -0,0 +1,48 @@

+---
+tags:
+- gguf
+---
+# TG Benchmarks on OnePlus 13
+There is discrepancy between qualcomm's SOTA 18 t/s llama2 (3.5GB) speed and cpu versions: https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized
+TODO:
+- benchmark qnn llama2 locally
+- benchmark T-MAC groupsize 128 if needed
+- test opencl and available qnn pull requests, feasibility for speculative decoding alongside cpu inference
+## Model Benchmarks
+### Llama 2
+| Quantization     | Benchmark 1 (200) | Benchmark 2 (50) |
+|------------------|-------------------|------------------|
+| Q4_0 (Pure)      | 12.76             | 13.22            |
+| Q4_0 (Normal)    | 12.54             | 13.03            |
+**Test Command:**
+```bash
+-p hi -t 6 -s 42 -c 512 -n (200,50) -m llama2
+```
+## Reka-Flash 21B Benchmarks Q4_0 (Normal)
+| Test Configuration | Tokens | Result |
+|--------------------|--------|--------|
+| Benchmark 1        | 200    | 4.46   |
+| Benchmark 2        | 50     | 4.45   |
+------------
+## Intermediate Layer Sizes
+| Model Architecture       | Intermediate Size |
+|--------------------------|-------------------|
+| Llama2 7B                | 11,008            |
+| Llama3 3B                | 8,192             |
+| Llama3 8B                | 14,336            |
+| Qwen 7B 2.5              | 18,944            |
+| Qwen 2.5B/14B            | 13,824            |
+| QWQ                      | 27,648            |
+| Reka-Flash 21B           | 19,648            |
+| Mistral 2503             | 32,768            |
+| Codestral 22B            | 16,384            |