Unexpectedly Large Memory Usage of ibm-fms/llama3-8b-accelerator in vLLM

#4
by baizhuoyan - opened

When I use mlpSpeculator in vllm, I noticed that ibm-fms/llama-13b-accelerator has only about 1.5665 GB, while ibm-fms/llama3-8b-accelerator has 4.4649 GB, which seems quite large for an 8B model. Why is this the case?

ibm-ai-platform org

Hi @baizhuoyan . This is mainly because the number of parameters for llama3 8b accelerator is larger due to the larger vocab size from llama3 (128k) vs llama2 (32k). Also this speculator has 4 heads in the model (4 prediction steps ahead) vs 3 in llama 13b. We got around this later by adding a tied head option to the speculators, but we never trained a llama3 8b model with tied heads as we focused then on the larger llama3 models. (For instance by comparison, llama3 70b accelerator has ~2.2B parameters vs llama3 8b accelerator with ~3.1B parameters)

Sign up or log in to comment