DOA

by MrDevolver - opened Apr 5

Discussion

MrDevolver

Apr 5

Am I the only one around here who remembers that just yesterday using consumer grade GPUs was still a thing?

jameshuntercarter

Apr 5

@MrDevolver Wait for the quants. Touch grass in the meantime.

MrDevolver

Apr 5

@MrDevolver Wait for the quants. Touch grass in the meantime.

What does quantization have to do with anything here? I'm talking about the fact that this is such a big model people with 8GB VRAM (Still widely used) won't be able to load it, or if they will be able to load it, it's gonna be very slow and/or inefficient in terms of output quality and quantizations will change absolutely nothing about it. So maybe stop touching grass and get the actual consumer-grade GPUs hardware limitations into account.

ghogan42

Apr 5

Well it will run fine on my 128GB MacBook Pro. And there are probably a lot of people in the market for the 128GB AMD AI Max+ 395 right now. And nvidia partners will be selling those $3k DGX Spark things that have 128GB. So a 4 to 6 bit quant of this model will be perfect for that kind of hardware that has plenty of ram but slower gpu + lower bandwidth.

Probably there will be other Llama-4 versions in the future at different sizes anyway...

MrDevolver

Apr 5

Probably there will be other Llama-4 versions in the future at different sizes anyway...

I doubt it and here's why:
Up to Llama 3.2 the trend was that they were releasing models for consumer-grade GPU early. Llama 3.2 was unexpectedly very small and less capable and Llama 3.3 was already a fairly big model only high-end GPUs can efficiently work with and there was no small model for 8GB VRAM and such. Now Llama 4 is yet another model which is in fact even bigger than Llama 3.3 and again, no small model. So as a matter of fact, the last Llama model that you could use on a regular computer with 16GB of RAM and 8GB VRAM is Llama 3.1. 8B (or Llama 3.2 which is technically a downgrade compared to Llama 3.1 due to its smaller size). The way I see it is that they set the trend with the previous 2 versions, effectively leaving the users of aforementioned type of hardware out of the AI game.

Many people already moved on from Llama to alternatives like Mistral Small, Qwen 2.5. With Qwen 3 around the corner, something is telling me that Qwen is going to steal the show again and hopefully they will not forget the small model users, especially not after Meta's move with Llama 4.

rcouture

Apr 6

I havent tested it yet, but I would think that the GGUF quant will be great here for loading as much as possible into the GPU and the rest in the CPU, specifically because this is a MoE model which has much higher efficiently to make up for the slower CPU ram.

jth01

Apr 6

@MrDevolver There absolutely will be smaller models.

It's clear that Scout and Maverick were trial runs, scaling up for Behemoth. Later they will release distills of Behemoth. Sharing the in-between stages is a positive.

MrDevolver

Apr 6

@MrDevolver There absolutely will be smaller models.

It's clear that Scout and Maverick were trial runs, scaling up for Behemoth. Later they will release distills of Behemoth. Sharing the in-between stages is a positive.

Hmm. 7B model? What's that distilled into? Qwen 2.5 7B? Or good old Llama 2 7B? 🤔 That would kinda make sense, considering that one of those new secret models on lmarena is saying it's Meta's Llama 2, but at the same time it's way smarter than original Llama 2 could ever hope to be. 😂

ghogan42

Apr 6

Probably there will be other Llama-4 versions in the future at different sizes anyway...

I doubt it and here's why:
...

Well in the transformers source code for llama4: https://github.com/huggingface/transformers/blob/9bfae2486a7b91dc6d4380b7936e0b2b8c1ed708/src/transformers/models/llama4/modeling_llama4.py#L997

it says this:

model = Llama4ForCausalLM.from_pretrained("meta-llama4/Llama4-2-7b-hf")

which doesn't guarantee anything but still...

Also, Zuck said that the reasoning models are yet to be released. So those maybe be in other sizes as well.

jth01

Apr 6

•

edited Apr 6

What's that distilled into? Qwen 2.5 7B? Or good old Llama 2 7B?

Not sure if that's a joke, but it definitely starts with its own llama4 base, they aren't exactly resource starved. I wouldn't be surprised if there's a few week wait though.

MrDevolver

Apr 6

What's that distilled into? Qwen 2.5 7B? Or good old Llama 2 7B?

Not sure if that's a joke, but it definitely starts with its own llama4 base, they aren't exactly resource starved. I wouldn't be surprised if theres a few week wait though.

Well, you did mention distills. If that was the case they would need an existing base model of 7B they would distill the Behemoth into. Just like Deepseek did for their smaller R1 models and for the 7B model they chose Qwen 2.5 7B. So not a joke, but a honest question.

erichartford

Apr 6

Get a bigger card or use a smaller model, there are plenty of good models that will fit in your card

MrDevolver

Apr 6

Get a bigger card or use a smaller model, there are plenty of good models that will fit in your card

What's up with that "Get a bigger card" advice? Is that "git good" equivalent for AI enthusiasts?

Anyway, it's not always as simple as "getting a bigger card" and it's certainly not always an option for various reasons, but like I already said, many people already moved on from Llama to alternatives like Mistral Small, Qwen 2.5 and Qwen 3 is around the corner... I'm still hoping to see that smaller model based on Llama 4, something actually useable for mere mortals, but thank God our lives don't depend on it.

melekuk

Apr 6

What's that distilled into? Qwen 2.5 7B? Or good old Llama 2 7B?

Not sure if that's a joke, but it definitely starts with its own llama4 base, they aren't exactly resource starved. I wouldn't be surprised if there's a few week wait though.

There is no hf org name with meta-llama4. It probably just find and replace.

Rotating

Apr 6

•

edited Apr 6

I can run the latest deepseek v3 on cpu so Llama4 is a complete bust.

TobDeBer

Apr 6

Wait a few days. You'll be able to run Scout and Maverick.
But such extreme quantization is hard work to get it right.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment