TG Benchmarks on OnePlus 13

There is discrepancy between qualcomm's SOTA 18 t/s llama2 (3.5GB) speed and cpu versions: https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized

TODO:

  • benchmark qnn llama2 locally
  • benchmark T-MAC groupsize 128 if needed
  • test opencl and available qnn pull requests, feasibility for speculative decoding alongside cpu inference
  • overclocking ram with magisk module
  • potentially check standards in quantization: mlc before regression, executorch, qnn

Model Benchmarks

Llama 2

Quantization Benchmark 1 (200) Benchmark 2 (50)
Q4_0 (Pure) 12.76 13.22
Q4_0 (Normal) 12.54 13.03

Test Command:

-p hi -t 6 -s 42 -c 512 -n (200,50) -m llama2

Llama 3

Quantization Benchmark 1 (200) Benchmark 2 (50)
Q4_0 (Pure) 11.54 11.91

Reka-Flash 21B Benchmarks Q4_0 (Normal)

Test Configuration Tokens Result
Benchmark 1 200 4.46
Benchmark 2 50 4.45

Intermediate Sizes

Model Architecture Intermediate Size
Llama2 7B 11,008
Llama3 3B 8,192
Llama3 8B 14,336
Qwen 7B 2.5 18,944
Qwen 2.5B/14B 13,824
QWQ 27,648
Reka-Flash 21B 19,648
Mistral 2503 32,768
Codestral 22B 16,384

llama.cpp Q4_K_M scheme and T-MAC inference -groupsize 128? on X86

Model Size Params Backend Threads Test t/s (tokens/sec)
qwen2 3B Q4_K - Medium 1.95 GiB 3.40 B CPU 4 pp512 67.33 ± 0.10
qwen2 3B Q4_K - Medium 1.95 GiB 3.40 B CPU 4 tg128 22.72 ± 0.04
qwen2 ?B INT_N Q4_K 1.70 GiB 3.40 B CPU 4 pp512 59.66 ± 0.10
qwen2 ?B INT_N Q4_K 1.70 GiB 3.40 B CPU 4 tg128 26.43 ± 0.14

INT_N isn't the equivalent or a match for fair comparison. It is 16.3% faster and 13% smaller in this scenario.

AutoGPTQ is used, by default it uses groupsize of 128: making it less bpw and smaller than llama.cpp. https://qwen.readthedocs.io/en/latest/quantization/gptq.html

  • The Kquant-series isn't optimized for efficiency, it is meant for quality
  • Q4_0 will use hardware accelerated dot-product instructions, using quantized-on-the-fly intermediate activations and weights.

Converted llama2 7B and ran the 8B

  • There is a problem with inference, I tried different older versions too. I can verify it uses 3.5 GiB on disc. (3,744,635,480 bytes)
  • The next option is benchmarking llama 8B, it has a larger size.
  • In running services, we can observe 4.9GB used during inference. (including 4096 cache)
  • The result is 13.6 t/s for a no context prompt "What is the captial of france?", but context cache is still normally inferred and larger cache will create latency.
  • The listed size and observed size is 4.8 GiB on disc. (5,121,998,280 bytes) this is accurate to the listed size on qualcomm's website.
  • llama3 8B is 1.37x larger than llama2 7B. 1.37x13.6 = 18.6 t/s. Since we can infer this, there is no need to run the 7B.
Downloads last month
4
GGUF
Model size
8.03B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support