Could you let me know when the bfloat16 model will be uploaded? I can't run the float32 model!
Could you let me know when the bfloat16 model will be uploaded? I can't run the float32 model!
Yes, we would like to build a bfloat16 compatible version. In the meantime you can run this model with torch.autocast to save some memory:
with torch.autocast("cuda", enabled=True, dtype=autocast_precision):
We did our evaluations in that setting. (float32 weights with autocast enabled)
The current code does not support bfloat16 inference directly, but you can try with torch.autocast
.
import torch
with torch.autocast("cuda", enabled=True, dtype=torch.bfloat16):
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer
)
Note that the weights will still be in float32.
Will a bfloat16 version be released some point in the future though?
import torch
with torch.autocast("cuda", enabled=True, dtype=torch.bfloat16):
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer
)
使用这个问题依旧还是存在,仍然报OOM!
You can now also convert the model to bfloat16 (see the updated README), although note I have seen the model produce slightly different outputs when the weights are bfloat16 instead of float32
I've tried to do it the way it is written in README, but it didn't work for me (while using 24GB 4090). The float32 model didn't fit on my GPU, I got following warning:
[2024-10-01 07:39:57 +0000] [100] [WARNING] Some parameters are on the meta device because they were offloaded to the cpu.
And when I tried to convert the model to bfloat16, I got error:
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.
The way to fix it is to load the model in bfloat16 from the beggining:
model = AutoModelForCausalLM.from_pretrained(
'allenai/Molmo-7B-D-0924',
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map='auto'
)
Just change 'auto'
to torch.bfloat16
and remove the line:
model.to(dtype=torch.bfloat16)
There is a 4-bit quantized version which seems to work well:
https://huggingface.co/cyan2k/molmo-7B-O-bnb-4bit