Trying to run with TGI - i try to run the model with doctor I am using 8 h200 gpu on Amazon ec2 p5en.48xlarge
Hi with this command .
docker run \ --gpus all \ --shm-size 32g \ -e HUGGINGFACE_CACHE_PATH=/mnt/data \ -e HF_TOKEN=**" \ -e NCCL_BLOCKING_WAIT=1 \ -e NCCL_ASYNC_ERROR_HANDLING=1 \ -e NCCL_DEBUG=INFO \ -e NCCL_TIMEOUT=3600 \ -e TORCH_NCCL_TRACE_BUFFER_SIZE=1048576 \ -p 8090:80 \ -v /mnt/data:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id meta-llama/Llama-4-Maverick-17B-128E-Instruct \ --max-input-length 70000 \ --max-total-tokens 72000 \ --max-batch-prefill-tokens 70000 \ --num-shard 8 \ --sharded true \ --revision main
I try to run llama 4 mavric
But facing issue with the GPU it's asking for more space and most probably not supported with TGI how I can run this model fully and what library I need to install,can anyone help me please.
I’m running into the exact same issue.
Even on an H100 the model won’t start. When I enable Flex Attention I get version-mismatch errors, if I remove Flex Attention the model loads but immediately hits OOM. Could someone share the exact CUDA, torch version, and minimum gpu vram just for inference?