Inference recipe for multi GPU

by ptrdvn - opened 17 days ago

17 days ago

I have been trying to load this model on my environment (3090 x 8) but cannot seem to load it using vllm, SGLang, or transformers. The transformers code snippet you give on the model card does not automatically distribute over multiple GPUs so will OOM on any standard memory card AFAIK.
Could you share a snippet that you can confirm to work on multiple GPUs?

Thanks for the model!

nicoboss

15 days ago

•

edited 15 days ago

@ptrdvn To run the model at full precision on 4x A100 40 GiB I just use the following to distribute the layers:

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" 
MODEL_ID = "/hpool/Apertus-70B-Instruct-2509"

config = AutoConfig.from_pretrained(MODEL_ID)

model = AutoModelForCausalLM.from_pretrained(
   MODEL_ID,
   config=config,
   torch_dtype=torch.bfloat16,
   device_map=None,  # Don't auto-assign
)

# Generate a suggested device map with constraints
device_map = infer_auto_device_map(
   model,
   max_memory={
       0: "33GiB",
       1: "36GiB",
       2: "36GiB",
       3: "36GiB"
   },
   no_split_module_classes=["ApertusDecoderLayer"]
)

model = dispatch_model(model, device_map=device_map,)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

Usually I just run it in vLLM because the model runs much faster there especially if you do 256 concurrent prompts the performance difference is absolutely crazy. For vLLM you don't really have to do anything as it will just automatically distribute the model if you specify either --tensor-parallel-size 8 (faster and usually distributes better with less overhead) or --pipeline-parallel-size 8 (slightly slower and less ideal memory distribution but you can to use --distributed-executor-backend ray). If you use vLLM make sure to also specify --gpu-memory-utilization 0.94 or whatever is ideal for your setup. To make vLLM use less memory I recommend you specify --enforce-eager and --max-num-seqs 10 where 10 would be the number of concurrent prompts you plan on running. Obviously also don't forget specifying something like --max-model-len 3400 or whatever much context you need. Context is very memory intensive. In worst case you can also just specify --quantization bitsandbytes to run the model in 4 bit instead of full precision in which case you can easily fit it on much less GPUs than you have available or use massive parallelist and/or context.

mjaggi

Swiss AI Initiative org 3 days ago

this should work out of the box with vLLM and SGlang.
closing for now as there was not follow-up question. if needed can also create an issue on directly on vLLM or SGlang repos

mjaggi changed discussion status to closed 3 days ago

ptrdvn

3 days ago

Yup, sorry for the delay - now working nicely on vLLM, thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment