imi2
/

llama-2-7b-chat-pure-Q4_0-gguf

Model card Files Files and versions Community

llama-2-7b-chat-pure-Q4_0-gguf / README.md

imi2's picture

Update README.md

bc76fc6 verified 3 months ago

|

history blame contribute delete

3.93 kB

	---
	tags:
	- gguf
	---
	# TG Benchmarks on OnePlus 13

	There is discrepancy between qualcomm's SOTA 18 t/s llama2 (3.5GB) speed and cpu versions: https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized

	TODO:

	- [x] benchmark qnn llama2 locally
	- benchmark T-MAC groupsize 128 if needed
	- test opencl and available qnn pull requests, feasibility for speculative decoding alongside cpu inference
	- overclocking ram with magisk module
	- potentially check standards in quantization: mlc before regression, executorch, qnn


	## Model Benchmarks

	### Llama 2
	\| Quantization \| Benchmark 1 (200) \| Benchmark 2 (50) \|
	\|------------------\|-------------------\|------------------\|
	\| Q4_0 (Pure) \| 12.76 \| 13.22 \|
	\| Q4_0 (Normal) \| 12.54 \| 13.03 \|

	Test Command:
	```bash
	-p hi -t 6 -s 42 -c 512 -n (200,50) -m llama2
	```

	### Llama 3
	\| Quantization \| Benchmark 1 (200) \| Benchmark 2 (50) \|
	\|------------------\|-------------------\|------------------\|
	\| Q4_0 (Pure) \| 11.54 \| 11.91 \|


	## Reka-Flash 21B Benchmarks Q4_0 (Normal)

	\| Test Configuration \| Tokens \| Result \|
	\|--------------------\|--------\|--------\|
	\| Benchmark 1 \| 200 \| 4.46 \|
	\| Benchmark 2 \| 50 \| 4.45 \|

	------------
	## Intermediate Sizes
	\| Model Architecture \| Intermediate Size \|
	\|--------------------------\|-------------------\|
	\| Llama2 7B \| 11,008 \|
	\| Llama3 3B \| 8,192 \|
	\| Llama3 8B \| 14,336 \|
	\| Qwen 7B 2.5 \| 18,944 \|
	\| Qwen 2.5B/14B \| 13,824 \|
	\| QWQ \| 27,648 \|
	\| Reka-Flash 21B \| 19,648 \|
	\| Mistral 2503 \| 32,768 \|
	\| Codestral 22B \| 16,384 \|

	------------

	## llama.cpp Q4_K_M scheme and T-MAC inference -groupsize 128? on X86

	\| Model \| Size \| Params \| Backend \| Threads \| Test \| t/s (tokens/sec) \|
	\|-------------------------\|---------\|--------\|---------\|---------\|--------\|----------------------\|
	\| qwen2 3B Q4_K - Medium \| 1.95 GiB\| 3.40 B \| CPU \| 4 \| pp512 \| 67.33 ± 0.10 \|
	\| qwen2 3B Q4_K - Medium \| 1.95 GiB\| 3.40 B \| CPU \| 4 \| tg128 \| 22.72 ± 0.04 \|
	\| qwen2 ?B INT_N Q4_K \| 1.70 GiB\| 3.40 B \| CPU \| 4 \| pp512 \| 59.66 ± 0.10 \|
	\| qwen2 ?B INT_N Q4_K \| 1.70 GiB\| 3.40 B \| CPU \| 4 \| tg128 \| 26.43 ± 0.14 \|

	INT_N isn't the equivalent or a match for fair comparison. It is 16.3% faster and 13% smaller in this scenario.
	- [Issue Link](https://github.com/microsoft/T-MAC/issues/79)

	AutoGPTQ is used, by default it uses groupsize of 128: making it less bpw and smaller than llama.cpp. https://qwen.readthedocs.io/en/latest/quantization/gptq.html

	- The Kquant-series isn't optimized for efficiency, it is meant for quality
	- Q4_0 will use hardware accelerated dot-product instructions, using quantized-on-the-fly intermediate activations and weights.


	## Converted llama2 7B and ran the 8B

	- There is a problem with inference, I tried different older versions too. I can verify it uses 3.5 GiB on disc. (3,744,635,480 bytes)
	- The next option is benchmarking llama 8B, it has a larger size.
	- In running services, we can observe 4.9GB used during inference. (including 4096 cache)
	- The result is 13.6 t/s for a no context prompt "What is the captial of france?", but context cache is still normally inferred and larger cache will create latency.
	- The listed size and observed size is 4.8 GiB on disc. (5,121,998,280 bytes) this is accurate to the listed size on qualcomm's [website](https://aihub.qualcomm.com/models/llama_v3_1_8b_chat_quantized?searchTerm=llama).
	- llama3 8B is 1.37x larger than llama2 7B. 1.37x13.6 = 18.6 t/s. Since we can infer this, there is no need to run the 7B.