The Qwen series of models are very difficult to quantify. Is there anything that can be improved?
#9
by
shifeiwen
- opened
Hi, qwen team
I've recently been trying to deploy the qwen series model to the Qualcomm platform, and I've found that the values of the decode layer 0 attn weight matmul of the qwen series model are generally very large, about 1e5, but they return to normal in other layers. This may be acceptable under FP, but on the Qualcomm platform we usually use w4int16 and the activation quantization method is per tensor, which makes the output of this layer difficult to quantize, and the quantization model effect is very poor. Is there anything I can try? Thank you