Running the example
#23
by
Tomas245
- opened
I was able to run this. add use_cache=False
to model.generate
and also, if you have problem with attention_chunk_size
, add it into config before initialization:
model_id = "meta-llama/Llama-Guard-4-12B"
config = AutoConfig.from_pretrained(model_id)
config.text_config.attention_chunk_size = 8192
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
config=config,
)
Source = https://github.com/llamastack/llama-stack/issues/2871