Possible to run with 24GB VRAM?
#14
by
happyTonakai
- opened
Anyone managed to run it with 24GB VRAM? Using 4bit quantization?
Did you quantize it on you own? We have no official quantizations yet for this model. Even If we have AWQ, or 4 bit we need a cluster of 2 24GBs to run it.
Its a 3b per expert model, so you can run it in a common cpu and 32 gb of RAM with not much delay.
Good Point! Can you do that on vLLM?
Absolutely, I always use transformers, but I dont see why not.