RyzenAI-1.5_LLM_Hybrid_Models
Collection
27 items
โข
Updated
export MODEL_DIR = [local model checkpoint folder] or google/gemma-2-2b
# single GPU
python quantize_quark.py --model_dir $MODEL_DIR \
--output_dir output_dir $MODEL_NAME-awq-uint4-asym-g128-lmhead-g32-fp16 \
--quant_scheme w_uint4_per_group_asym \
--num_calib_data 128 \
--quant_algo awq \
--dataset pileval_for_awq_benchmark \
--model_export hf_format \
--group_size 128 \
--group_size_per_layer lm_head 32 \
--data_type float16 \
--exclude_layers
# cpu
python quantize_quark.py --model_dir $MODEL_DIR \
--output_dir output_dir $MODEL_NAME-awq-uint4-asym-g128-lmhead-g32-fp16 \
--quant_scheme w_uint4_per_group_asym \
--num_calib_data 128 \
--quant_algo awq \
--dataset pileval_for_awq_benchmark \
--model_export hf_format \
--group_size 128 \
--group_size_per_layer lm_head 32 \
--data_type float16 \
--exclude_layers \
--device cpu
Quark has its own export format, quark_safetensors, which is compatible with autoAWQ exports.
Quark currently uses perplexity(PPL) as the evaluation metric for accuracy loss before and after quantization.The specific PPL algorithm can be referenced in the quantize_quark.py. The quantization evaluation results are conducted in pseudo-quantization mode, which may slightly differ from the actual quantized inference accuracy. These results are provided for reference only.
Benchmark | google/gemma-2-2b (float16) | amd/gemma-2-2b-awq-uint4-asym-g128-lmhead-g32-fp16-onnx (this model) |
Perplexity-wikitext2 | 64.41 | 71.43 (evalauted by CPU) |
Modifications copyright(c) 2024 Advanced Micro Devices,Inc. All rights reserved.
Base model
google/gemma-2-2b