ibm-ai-platform/llama3-8b-accelerator · Unexpectedly Large Memory Usage of ibm-fms/llama3-8b-accelerator in vLLM

Hi @baizhuoyan . This is mainly because the number of parameters for llama3 8b accelerator is larger due to the larger vocab size from llama3 (128k) vs llama2 (32k). Also this speculator has 4 heads in the model (4 prediction steps ahead) vs 3 in llama 13b. We got around this later by adding a tied head option to the speculators, but we never trained a llama3 8b model with tied heads as we focused then on the larger llama3 models. (For instance by comparison, llama3 70b accelerator has ~2.2B parameters vs llama3 8b accelerator with ~3.1B parameters)