|
--- |
|
tags: |
|
- gguf |
|
--- |
|
# TG Benchmarks on OnePlus 13 |
|
|
|
There is discrepancy between qualcomm's SOTA 18 t/s llama2 (3.5GB) speed and cpu versions: https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized |
|
|
|
TODO: |
|
|
|
- [x] benchmark qnn llama2 locally |
|
- benchmark T-MAC groupsize 128 if needed |
|
- test opencl and available qnn pull requests, feasibility for speculative decoding alongside cpu inference |
|
- overclocking ram with magisk module |
|
- potentially check standards in quantization: mlc before regression, executorch, qnn |
|
|
|
|
|
## Model Benchmarks |
|
|
|
### Llama 2 |
|
| Quantization | Benchmark 1 (200) | Benchmark 2 (50) | |
|
|------------------|-------------------|------------------| |
|
| Q4_0 (Pure) | 12.76 | 13.22 | |
|
| Q4_0 (Normal) | 12.54 | 13.03 | |
|
|
|
**Test Command:** |
|
```bash |
|
-p hi -t 6 -s 42 -c 512 -n (200,50) -m llama2 |
|
``` |
|
|
|
### Llama 3 |
|
| Quantization | Benchmark 1 (200) | Benchmark 2 (50) | |
|
|------------------|-------------------|------------------| |
|
| Q4_0 (Pure) | 11.54 | 11.91 | |
|
|
|
|
|
## Reka-Flash 21B Benchmarks Q4_0 (Normal) |
|
|
|
| Test Configuration | Tokens | Result | |
|
|--------------------|--------|--------| |
|
| Benchmark 1 | 200 | 4.46 | |
|
| Benchmark 2 | 50 | 4.45 | |
|
|
|
------------ |
|
## Intermediate Sizes |
|
| Model Architecture | Intermediate Size | |
|
|--------------------------|-------------------| |
|
| Llama2 7B | 11,008 | |
|
| Llama3 3B | 8,192 | |
|
| Llama3 8B | 14,336 | |
|
| Qwen 7B 2.5 | 18,944 | |
|
| Qwen 2.5B/14B | 13,824 | |
|
| QWQ | 27,648 | |
|
| Reka-Flash 21B | 19,648 | |
|
| Mistral 2503 | 32,768 | |
|
| Codestral 22B | 16,384 | |
|
|
|
------------ |
|
|
|
## llama.cpp Q4_K_M scheme and T-MAC inference -groupsize 128? on X86 |
|
|
|
| Model | Size | Params | Backend | Threads | Test | t/s (tokens/sec) | |
|
|-------------------------|---------|--------|---------|---------|--------|----------------------| |
|
| qwen2 3B Q4_K - Medium | 1.95 GiB| 3.40 B | CPU | 4 | pp512 | 67.33 ± 0.10 | |
|
| qwen2 3B Q4_K - Medium | 1.95 GiB| 3.40 B | CPU | 4 | tg128 | 22.72 ± 0.04 | |
|
| qwen2 ?B INT_N Q4_K | 1.70 GiB| 3.40 B | CPU | 4 | pp512 | 59.66 ± 0.10 | |
|
| qwen2 ?B INT_N Q4_K | 1.70 GiB| 3.40 B | CPU | 4 | tg128 | 26.43 ± 0.14 | |
|
|
|
**INT_N isn't the equivalent or a match for fair comparison. It is 16.3% faster and 13% smaller in this scenario.** |
|
- [Issue Link](https://github.com/microsoft/T-MAC/issues/79) |
|
|
|
AutoGPTQ is used, by default it uses groupsize of 128: making it less bpw and smaller than llama.cpp. https://qwen.readthedocs.io/en/latest/quantization/gptq.html |
|
|
|
- The Kquant-series isn't optimized for efficiency, it is meant for quality |
|
- Q4_0 will use hardware accelerated dot-product instructions, using quantized-on-the-fly intermediate activations and weights. |
|
|
|
|
|
## Converted llama2 7B and ran the 8B |
|
|
|
- There is a problem with inference, I tried different older versions too. I can verify it uses 3.5 GiB on disc. (3,744,635,480 bytes) |
|
- The next option is benchmarking llama 8B, it has a larger size. |
|
- In running services, we can observe 4.9GB used during inference. (including 4096 cache) |
|
- The result is **13.6 t/s** for a no context prompt "What is the captial of france?", but context cache is still normally inferred and larger cache will create latency. |
|
- The listed size and observed size is 4.8 GiB on disc. (5,121,998,280 bytes) this is accurate to the listed size on qualcomm's [website](https://aihub.qualcomm.com/models/llama_v3_1_8b_chat_quantized?searchTerm=llama). |
|
- llama3 8B is 1.37x larger than llama2 7B. 1.37x13.6 = 18.6 t/s. Since we can infer this, there is no need to run the 7B. |