Running the example

#23
by Tomas245 - opened

I was able to run this. add use_cache=False to model.generate and also, if you have problem with attention_chunk_size, add it into config before initialization:

model_id = "meta-llama/Llama-Guard-4-12B"
config = AutoConfig.from_pretrained(model_id)
config.text_config.attention_chunk_size = 8192
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    config=config,
)

Source = https://github.com/llamastack/llama-stack/issues/2871

Sign up or log in to comment