Inference speed slow?

#56
by banank1989 - opened

I have loaded this model(gemma-3 -27B) in bfloat 16 on an A100(taking 54GB GPU as expected). It is generating token at 10 tokens per second. Is this speed expected or I am missing something and hence my output is slow? My cuda version on machine is 11.6 but I have installed torch as
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118

Used 118 while installing since it is giving error in 116

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment