Trying to run with TGI - i try to run the model with doctor I am using 8 h200 gpu on Amazon ec2 p5en.48xlarge

#42
by sayak340 - opened

Hi with this command .
docker run \ --gpus all \ --shm-size 32g \ -e HUGGINGFACE_CACHE_PATH=/mnt/data \ -e HF_TOKEN=**" \ -e NCCL_BLOCKING_WAIT=1 \ -e NCCL_ASYNC_ERROR_HANDLING=1 \ -e NCCL_DEBUG=INFO \ -e NCCL_TIMEOUT=3600 \ -e TORCH_NCCL_TRACE_BUFFER_SIZE=1048576 \ -p 8090:80 \ -v /mnt/data:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id meta-llama/Llama-4-Maverick-17B-128E-Instruct \ --max-input-length 70000 \ --max-total-tokens 72000 \ --max-batch-prefill-tokens 70000 \ --num-shard 8 \ --sharded true \ --revision main

I try to run llama 4 mavric
But facing issue with the GPU it's asking for more space and most probably not supported with TGI how I can run this model fully and what library I need to install,can anyone help me please.

I’m running into the exact same issue.
Even on an H100 the model won’t start. When I enable Flex Attention I get version-mismatch errors, if I remove Flex Attention the model loads but immediately hits OOM. Could someone share the exact CUDA, torch version, and minimum gpu vram just for inference?

Sign up or log in to comment