Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- gguf
|
4 |
+
---
|
5 |
+
# TG Benchmarks on OnePlus 13
|
6 |
+
|
7 |
+
There is discrepancy between qualcomm's SOTA 18 t/s llama2 (3.5GB) speed and cpu versions: https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized
|
8 |
+
|
9 |
+
TODO:
|
10 |
+
|
11 |
+
- benchmark qnn llama2 locally
|
12 |
+
- benchmark T-MAC groupsize 128 if needed
|
13 |
+
- test opencl and available qnn pull requests, feasibility for speculative decoding alongside cpu inference
|
14 |
+
|
15 |
+
|
16 |
+
## Model Benchmarks
|
17 |
+
|
18 |
+
### Llama 2
|
19 |
+
| Quantization | Benchmark 1 (200) | Benchmark 2 (50) |
|
20 |
+
|------------------|-------------------|------------------|
|
21 |
+
| Q4_0 (Pure) | 12.76 | 13.22 |
|
22 |
+
| Q4_0 (Normal) | 12.54 | 13.03 |
|
23 |
+
|
24 |
+
**Test Command:**
|
25 |
+
```bash
|
26 |
+
-p hi -t 6 -s 42 -c 512 -n (200,50) -m llama2
|
27 |
+
```
|
28 |
+
|
29 |
+
## Reka-Flash 21B Benchmarks Q4_0 (Normal)
|
30 |
+
|
31 |
+
| Test Configuration | Tokens | Result |
|
32 |
+
|--------------------|--------|--------|
|
33 |
+
| Benchmark 1 | 200 | 4.46 |
|
34 |
+
| Benchmark 2 | 50 | 4.45 |
|
35 |
+
|
36 |
+
------------
|
37 |
+
## Intermediate Layer Sizes
|
38 |
+
| Model Architecture | Intermediate Size |
|
39 |
+
|--------------------------|-------------------|
|
40 |
+
| Llama2 7B | 11,008 |
|
41 |
+
| Llama3 3B | 8,192 |
|
42 |
+
| Llama3 8B | 14,336 |
|
43 |
+
| Qwen 7B 2.5 | 18,944 |
|
44 |
+
| Qwen 2.5B/14B | 13,824 |
|
45 |
+
| QWQ | 27,648 |
|
46 |
+
| Reka-Flash 21B | 19,648 |
|
47 |
+
| Mistral 2503 | 32,768 |
|
48 |
+
| Codestral 22B | 16,384 |
|