File size: 3,932 Bytes
0702cd8
 
 
 
 
 
 
 
 
 
b24d0ec
0702cd8
 
90fbe4d
 
0702cd8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bc76fc6
 
 
 
 
 
0702cd8
 
 
 
 
 
 
 
e671a99
0702cd8
 
 
 
 
 
 
 
 
 
 
0801f09
 
 
07ec941
0801f09
 
 
 
 
 
 
 
f493650
6466219
95260dd
 
e671a99
 
 
b24d0ec
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
tags:
- gguf
---
# TG Benchmarks on OnePlus 13

There is discrepancy between qualcomm's SOTA 18 t/s llama2 (3.5GB) speed and cpu versions: https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized

TODO: 

- [x] benchmark qnn llama2 locally
- benchmark T-MAC groupsize 128 if needed
- test opencl and available qnn pull requests, feasibility for speculative decoding alongside cpu inference
- overclocking ram with magisk module
- potentially check standards in quantization: mlc before regression, executorch, qnn


## Model Benchmarks

### Llama 2
| Quantization     | Benchmark 1 (200) | Benchmark 2 (50) |
|------------------|-------------------|------------------|
| Q4_0 (Pure)      | 12.76             | 13.22            |
| Q4_0 (Normal)    | 12.54             | 13.03            |

**Test Command:**  
```bash
-p hi -t 6 -s 42 -c 512 -n (200,50) -m llama2
```

### Llama 3
| Quantization     | Benchmark 1 (200) | Benchmark 2 (50) |
|------------------|-------------------|------------------|
| Q4_0 (Pure)      | 11.54             | 11.91            |


## Reka-Flash 21B Benchmarks Q4_0 (Normal)

| Test Configuration | Tokens | Result |
|--------------------|--------|--------|
| Benchmark 1        | 200    | 4.46   |
| Benchmark 2        | 50     | 4.45   |

------------
## Intermediate Sizes
| Model Architecture       | Intermediate Size |
|--------------------------|-------------------|
| Llama2 7B                | 11,008            |
| Llama3 3B                | 8,192             |
| Llama3 8B                | 14,336            |
| Qwen 7B 2.5              | 18,944            |
| Qwen 2.5B/14B            | 13,824            |
| QWQ                      | 27,648            |
| Reka-Flash 21B           | 19,648            |
| Mistral 2503             | 32,768            |
| Codestral 22B            | 16,384            |

------------

## llama.cpp Q4_K_M scheme and T-MAC inference -groupsize 128? on X86

| Model                   | Size    | Params | Backend | Threads | Test   | t/s (tokens/sec)     |
|-------------------------|---------|--------|---------|---------|--------|----------------------|
| qwen2 3B Q4_K - Medium  | 1.95 GiB| 3.40 B | CPU     | 4       | pp512  | 67.33 ± 0.10         |
| qwen2 3B Q4_K - Medium  | 1.95 GiB| 3.40 B | CPU     | 4       | tg128  | 22.72 ± 0.04         |
| qwen2 ?B INT_N Q4_K     | 1.70 GiB| 3.40 B | CPU     | 4       | pp512  | 59.66 ± 0.10         |
| qwen2 ?B INT_N Q4_K     | 1.70 GiB| 3.40 B | CPU     | 4       | tg128  | 26.43 ± 0.14         |

**INT_N isn't the equivalent or a match for fair comparison. It is 16.3% faster and 13% smaller in this scenario.**
- [Issue Link](https://github.com/microsoft/T-MAC/issues/79)

AutoGPTQ is used, by default it uses groupsize of 128: making it less bpw and smaller than llama.cpp. https://qwen.readthedocs.io/en/latest/quantization/gptq.html

- The Kquant-series isn't optimized for efficiency, it is meant for quality
- Q4_0 will use hardware accelerated dot-product instructions, using quantized-on-the-fly intermediate activations and weights.  


## Converted llama2 7B and ran the 8B

- There is a problem with inference, I tried different older versions too. I can verify it uses 3.5 GiB on disc. (3,744,635,480 bytes)
- The next option is benchmarking llama 8B, it has a larger size.
- In running services, we can observe 4.9GB used during inference. (including 4096 cache)
- The result is **13.6 t/s** for a no context prompt "What is the captial of france?", but context cache is still normally inferred and larger cache will create latency.
- The listed size and observed size is 4.8 GiB on disc. (5,121,998,280 bytes) this is accurate to the listed size on qualcomm's [website](https://aihub.qualcomm.com/models/llama_v3_1_8b_chat_quantized?searchTerm=llama). 
- llama3 8B is 1.37x larger than llama2 7B. 1.37x13.6 = 18.6 t/s. Since we can infer this, there is no need to run the 7B.