Whimsical Waffle: The Curious Case of LLMs and Their Linguistic Shenanigans

#4
by mradermacher - opened

yay

mradermacher changed discussion status to closed

Actually, to get the DRYRUN test, all we would have to do is to get rid of the MAP_POPULATE in:

mmap(NULL, 31937041504, PROT_READ, MAP_SHARED|MAP_POPULATE, 4, 0) = 0x7ff79c600000

Because I think with the right switches, we can otherwise avoid touching the memory (alternatively, map /dev/null). Of course, the measurements allowed by DRYRUN are much more worthwhile. Basically, it's the killer feature if we could make it available and it turns out to be feasible. Thats the really interesting (to me) todo point: create a script that downloads the gguf header only from huggingface and recreates a dummy gguf. Too bad the gguf file format is so badly designed - you have to decode the whole header incrementally to know how long it is.

(using fuse to mount a file via https is cheating)

btw., in the case of blacksheep, i take the lists of quants done from the "quantize" script and patch the job like this:

"iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S",

and fore the jais models for example, I removed the *0, *1, IQ4_NL quzant, essentially:

"squants": "x-f16 Q4_K_S Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS",
"iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M IQ3_XS IQ3_S",

it's in theory possible to do this when adding the job (not via llmc, because reasons), but that requires us to predict with some accuracy that this will happen, so is rarely useful

Actually, to get the DRYRUN test, all we would have to do is to get rid of the MAP_POPULATE in:
mmap(NULL, 31937041504, PROT_READ, MAP_SHARED|MAP_POPULATE, 4, 0) = 0x7ff79c600000

I'm a bit confused. Dryrun doesn't even use mmap. I explicitly disable it and even print "mmap is not supported for dry-run so it is now disabled" as warning if you don't specify --no-mmap. Why would you even want mmap for dry-run? You are not allocating any memory when loading the model so what would be the point of it?

Because I think with the right switches, we can otherwise avoid touching the memory (alternatively, map /dev/null).

What you mean with touching memory? No additional RAM or GPU memory should get allocated when loading a model. Obviously llama.cpp requires some memory to function like any application but that is so little it can be ignored.

Of course, the measurements allowed by DRYRUN are much more worthwhile. Basically, it's the killer feature if we could make it available and it turns out to be feasible. Thats the really interesting (to me) todo point: create a script that downloads the gguf header only from huggingface and recreates a dummy gguf. Too bad the gguf file format is so badly designed - you have to decode the whole header incrementally to know how long it is.

I don't think the header can be that big so you can likely just download enough for the full header to always be present.

btw., in the case of blacksheep, i take the lists of quants done from the "quantize" script and patch the job like this
"iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S"

I assume you are setting this inside llmjob edit.

Wouldn't the scripts synchronize when it is available again?

Altogether it's 3GB, not just scripts, but also, of course, llama.cpp. I added a hack so when removing the disable flag it will sync automatically, but I also update llama.cpp from home, and every node has a different combination of llama.cpp variants (probably the easiest way around is to change that).

But, yeah, that's not effectively automatable.

Yes even for me it would now be inconvenient to switch as I memorized the path so well.

embrace the difference :)

Oh, let's hope for the best. No imatrix failure so far but a lot of imatrix tasks will only be started at 22:00 due to most of them currently being timeofday blocked.

I am pretty sure the dryrun test works - the onyl way it could fail is if it somehow succeeds despite the model being broken. Likely there are some tests in llama.cpp that are only done at inference time, the question is, how many, and are they important :) We will find out.

Just so you know DRYRUN is supposed to work with every llama.cpp executable that loades a model so you are not limited to llama-cli.

To... some extent (i.e. tracking allocations)? Surely you have not found a generic way to exit all of these at just the right time.

Then just don't use llama-cli but any other one that doesn't do this.

Haha, "just". Love it :) Anyway, are there any? There is the server, but the server seems to do the same thing.

Nice. No idea why everyone keeps renaming thair models but us having a diffrent name makes ouer models hard to find so automated renames would be quite usefull.

They rename it because they want to be able to erase it and create a different one without having to come up with a new final name, in case it sucks. Models are also regularly moved, and sometimes even aparently cloned, to other users.

It does make them harder to find, but at least I stopped using the search function by hf and started to use the quantisations link.

That would be amazing! There are quite a lot of factors that influence vram usage but maybe you can find a pattern by playing around with dryrun.

I would allow the user to specify VRAM for 0, 1 or 2 gpus, tensor split, some flags like flash attention, and then probably do a binary search to find the maximum -ngl value.

models always show the date when they were last updated

You'll have to check wuant file dates anyway if you need some kind of date. And then, it's pretty useless.

I guess we can at least try to update them in chronological order, so the order stays the same. Or can we?!?

The updates would almost certainly go from newest to oldest, even (or rather, reverse order in how hf lists them for me), with some randomness.

GIT_COMMITTER_DATE and GIT_AUTHOR_DATE environment variables before committing using git

If I can't do it via the api it will not happen. Messing in scripts with git will be a disaster. Besides, will the server-side git really just accept any client-side garbage date when pushed?

as this will hopefully be the last time we ever edit all of them.

The other a-ha moment I had last week was when I realised that this is the problem and must give. I have versioned the model cards now, so we can keep any number of different co,patible card formats and update at our own pace.

I don't think with us publishing 100+ repos a day anybody would care about 20000 updates even per day.

I'm a bit confused. Dryrun doesn't even use mmap. I explicitly disable it and even print "mmap is not supported for dry-run so it is now disabled" as warning if you don't specify --no-mmap. Why would you even want mmap for dry-run? You are not allocating any memory when loading the model so what would be the point of it?

I was talking about an alternative way to achieve just the validity testing without changing llama.cpp. It's entirely hypothetical.

I don't think the header can be that big so you can likely just download enough for the full header to always be present.

The header is pretty massive - tiny if you look at the whole file, but many megabytes in size to warrant an optimisation. My first computer had ~100 octets usable memory. I sawe amazing sofwtare wirtten in 20k of memory. When I see a bash process using 2MB of RAM I regularly get dizzy.

Anyway, gguf is very wasteful, or example, every vocabulary entry is 8 bytes string length + string. Also, "likely enough" means you still have to be prepared for it to not be enough in edge cases.

And to be honest, what worries me most is that aws typically charges for the full file even if only a few bytes of it are being downloaded. But since the gguf parse on the hf page exists, I am sure it doesn't matter :)

To... some extent (i.e. tracking allocations)? Surely you have not found a generic way to exit all of these at just the right time.

It should work for majority of them. Almost all that load a model are using the same code to do so. I just tested llama-imatrix, llama-perplexity, llama-simple, llama-simple-chat and llama-run all of which were fully compatible with DRYRUN despite me never testing them before. It’s not that they are just working they also tell you how much memory would be required to fulfill to load the model in a way that fulfills thar purpose as they essentially just load the model with the exact parameters they require.

Haha, "just". Love it :) Anyway, are there any?

No ide. Try the ones I mentioned above and if they all do it than this is likely something in the model loading code in which case I can take a look at the code and see if we can change this.

I would allow the user to specify VRAM for 0, 1 or 2 gpus, tensor split, some flags like flash attention, and then probably do a binary search to find the maximum -ngl value.

That would be so awesome. This is actually exactly what I'm currently for what I'm using DRYRUN myself.

Keep in mind that DRYRUN only tells you the memory required to load the model and allocate enough memory for its context. Memory used during inference for things like attention is not considered but is easy to estimate. In fact, more memory is required to load a model if flash attention is enabled due to additional overheads associated with its implementation.

If I can't do it via the api it will not happen. Messing in scripts with git will be a disaster.

Totally understandable.

will the server-side git really just accept any client-side garbage date when pushed?

All git servers seam to do. git servers kind of trust client side garbage by design. I had to spoof dates/name/emails for author/committer so many times in the past and I not once had a git server refuse the commit. The only thing I'm not sure if HuggingFace uses the time in the git commit like GitHub/GitLab do or if it uses the server time of the push. Now I'm a bit curious so the next time I upload a model I might try it.

The other a-ha moment I had last week was when I realized that this is the problem and must give. I have versioned the model cards now, so we can keep any number of different compatible card formats and update at our own pace.
I don't think with us publishing 100+ repos a day anybody would care about 20000 updates even per day.

Yes it should be fine unless we hit some kind of rate limit.

The header is pretty massive - tiny if you look at the whole file, but many megabytes in size to warrant an optimization. My first computer had ~100 octets usable memory. I saw amazing software written in 20k of memory. When I see a bash process using 2MB of RAM I regularly get dizzy.

My first "Gameboy" which in fact was a Voyage 200 calculator for school had 188 kB RAM and 2,7 MB ROM and it was enough to play all kind of games. I even had something like Maro Maker on there. I actually had that Voyage 200 calculator 5 years before I had my first mobile phone and used it from everything from reading, writing, programming and gaming.

In case you wonder my first PC was a Windows 2000 with 13 GB of HDD storage and I think 128 MB of RAM. My first programming language was BlitzBasic to write PC games followed by Compact-C which I used to program C-Control Pro microcontrollers which had 2 KB of usable RAM, 10 KB of usable flash storage, 1 KB EEPROM and a 14.7456 MHz CPU so I know your feeling.

Anyway, gguf is very wasteful, or example, every vocabulary entry is 8 bytes string length + string.

That is indeed terrible wasteful. 1 byte would have been enough.

Also, "likely enough" means you still have to be prepared for it to not be enough in edge cases.

Which should be fine as llama.cpp was so nice to put stupid limits everywhere so most edge cases likely already failed when we tried converting them into GGUF.

And to be honest, what worries me most is that aws typically charges for the full file even if only a few bytes of it are being downloaded. But since the gguf parse on the hf page exists, I am sure it doesn't matter :)

S3 only charges for the actually used bandwidth as far I'm aware. So if you only download the first 10 MB HuggingFace should only be charged for 10 MB. They do charge per 10K API calls a very low amount but this doesn't at all matter as we only have around 500K quants. I'm mostly worried about HuggingFace might be using intelligent tiering in which case us accessing all the quants might cause them to be copied into hot storage which then would cost them the transfer fee plus 30 days of hot storage. But in any case, there is not much we can do about any of this unless we find a storage usage pattern and can based on one quant tell how much all the others require which I think might be possible.

Memory used during inference for things like attention is not considered but is easy to estimate. In fact, more memory is required to load a model if flash attention is enabled due to additional overheads associated with its implementation.

That's a bummer then... So how would you easily estimate it? And what you mean more is required to "load" a model - after loading, flash attention surely uses less memory.

Yes it should be fine unless we hit some kind of rate limit.

That doesn't worry me either - I envisaged some kind of bulk update because I thought versioning the readmes is a bad idea. But, I changed my mind. IF we hit a rate limit, it will take a few y<ears to update old repos - so what.

Voyage 200 calculator for school

I got the first HP 48SX in germany (or so I was actually told by HP). Sigh. hp calculators... were so nice...

Windows 2000

Wow. That is so long after I had switched to GNU/Linux. (I switched from DOS to Linux just before win 3 became ubiquitous (in 1994, with 1.0.2 or something - I was even late to the game, or so it felt))

That is indeed terrible wasteful. 1 byte would have been enough.

Yeah, or 4 octet (or even 8 octet) header length + json/msgpack/cbor/... and yes, one octet would be enough if you limit strings to 127 octets, but to be fair, that's a limit of the encoder, not a limit of the format.

I'd say whoever designed it (well, gerganov) was probably paranoid of not running into arbitrary 4GB limits anywhere. Puzzlingly enough, though, the primitive types numbers (there are 13) are stored in 32 bit ints. And no, everything is just octet-aligned, so it's nothing to do with that.

To it's defence, the gguf decoder I wrote in Perl is just 80 lines of code. So in that sense, it lends itself to a very simple implementation. But using an existing JSON decoder with that header would just be 3 lines or so...

I think ggerganov has a major fear of external dependencies - even more than me, and I thought I was a bit on the extreme side.

S3 only charges for the actually used bandwidth as far I'm aware.

I admit I am no expert, but it seems to be a well known attack to request only part of a large file and get billed with much larger transfer costs because aws does not bill octets downloaded but octets prepared for download, regardless of how much actually was used (or even requested). So yes, only actually used bandwidth, but it's their internal fantasy made up bandwidth, not the external customer-measurable bandwidth. It is possible that it only affects some S3 storage products, but it's a concern. Well, it's not a concern, because huggingface does it themselves, and I am happy to cache things...

S3

And don't they also bill GET requests? So there must be some optimal transfer size - probably in the megabyte range?

Sooooo, DRYRUN gives me an error message for a failed model, but exit status is 0:

load_tensors: loading model tensors, this can take a while... (mmap = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' has wrong shape; expected  5120, 152064, got  5120, 151665,     1,     1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'Methuselah-PSR-B1620-26b-14B-Exp.gguf'
main: Dryrun compleated!

changed the test to this:

      if DRYRUN= llama llama-cli -m "$SRC.gguf~" --no-warmup -n 0 -t 1 -no-cnv -st </dev/null 2>&1 | tee -a /dev/fd/2 | grep -q ": failed to load model"; then

That's a bummer then... So how would you easily estimate it? And what you mean more is required to "load" a model - after loading, flash attention surely uses less memory.

DRYRUN tells you how much memory you need to load a model and reserving the memory required for its context. So if you have as much memory as DRYRUN tells you, you will be able to load the model. However depending on context and prompt you might still OOM during inference as some memory is allocated during inference for algorithms like attention. The memory required for attention should more or less be the same for a given context with a give attention method. So you can likely measure it once and add it onto to what DRYRUN tells you is required to load the model. Flash attention needs more memory during the initial load, but the attention algorithm itself uses linear instead of quadratic memory for a given context which for large context should be more memory efficient.

That doesn't worry me either - I envisaged some kind of bulk update because I thought versioning the readmes is a bad idea. But, I changed my mind. IF we hit a rate limit, it will take a few y<ears to update old repos - so what.

The limit can't be so bad that it will take years. We should try to update them in a reasonable timeframe as the current model card isn’t that good in my opinion.

And don't they also bill GET requests? So there must be some optimal transfer size - probably in the megabyte range?

They do but it is $0.0004 per 1,000 requests so if we need 500K of them that is $0.2 which is so low it almost not worth mentioning.

HuggingFace will be fine:
"There are no retrieval charges in S3 Intelligent-Tiering. If an object in the infrequent access tier is accessed later, it is automatically moved back to the frequent access tier. No additional tiering charges apply when objects are moved between access tiers within the S3 Intelligent-Tiering storage class."

So if they use Intelligent-Tiering they are not getting charged for AWS being stupid beside paying slightly more for files being in less cold storage for 30 days which is almost nothing to what retrieval charges would be.

In case you wonder from S3 to Europe (Zurich) is $0.02 per GB and nothing if it only goes to Amazon CloudFront (which has their own billing for bandwidth) and really they seem to only calculate that data is actually getting sent to the internet based on their website and intelligent storage has no retrieval fee so they really shouldn't bill for the data we don't download unless they found some kind of loophole to trick their customers.

But in any case, there is nothing we can do about any of this.

Sooooo, DRYRUN gives me an error message for a failed model, but exit status is 0

That's so stupid. Sorry for this mistake. I forgot about that. I will be fixing it today evening.

changed the test to this

This will work in the meantime.

DRYRUN tells you how much memory you need

I realise what you mean. I guess it can also be handled by telling the user to reduce ngl a bit when in doubt. It will still be far more useful than the manual trial runs I have to do now.

The limit can't be so bad that it will take years.

I meant toi say "even it takes a few years..." and I didn't expect the repom create limit to be as bad as it is. Or erratic(?) still feels weird to get rate limited sometimes, even when we don't crunch through lots of models.

S3

Thanks, helps a lot :)

This will work in the meantime.

We are not in a hurry - assuming that we always get "failed to load model". Eh, evenif it would not, it'd still be a great improvement :)

model page

Well, my plan is to get rid of graphs and everything but the download table and the links, and also more or less fully generate the page and move all metadata to yaml. The only hindrance is that it is a lot of work, and even a single misplaced space or fixed typo will cause havoc :) Just not so much fun. But I am slowly working towards making it doable (and gaining motivation by not forcing me to work on it :)

If you have any conrete input (text fragments, layout) on the model page, I am happy to collect it. The general trend, though, should be to move as much of the info to the external model page, so there is only one place to improve. Unfortunately, the model download page already needs revamping, too, and already goes too much into the directioon of web development for my taste :)

Sooooo, DRYRUN gives me an error message for a failed model, but exit status is 0:

This should now be fixed in the latest version. I kind of forgot about llama.cpp sometimes using exceptions to jump out of heavily nested functions skipping all the code that would otherwise get executed by following the normal return path. I personally don't really like throwing exceptions somewhere and handling them on a completely different location - it feels like a modern version of goto but without labeling where it jumps to.

I fixed this by adding a dedicated exit point for dry-run inside common.cpp to no longer mess with llama.cpp's exception handling and removing all modifications from main.cpp. This now ensures exceptions skip past the dry-run dedicated exit point and are instead getting properly handled by main.cpp

I also updated the mradermacher branch to latest llama.cpp so we now have Gemma 3 and experimental Gemma 3 vision support.

You guys might find this interesting: https://arxiv.org/abs/2503.03592

Quote from conclusion:

Further, the usage of importance matrices written in non-English does not significantly improve performance on non-English datasets and might in fact slightly harm it. However, this reduction in performance is not statistically significant.

You guys might find this interesting: https://arxiv.org/abs/2503.03592

Thanks a lot for sharing! I looked at the paper and am really surprised by the result. Their testing methodology looks clean and the result tell quite a clear story. This means our primary English imatrix dataset is much better for non-English models than we thought. I now regret having non-English models only queued for static quants.

@nicoboss I assume you queued all/most of the nice lint check models that all fell through the llama loading code check? :)

Here's all errors (deduplicated), and they do all seem legit (and therefore I have nuked them):

/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
/llmjob/llama.cpp-cuda512/src/llama.cpp:8666: GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor") failed

I suspect these are all pure embeddings and therefore can't be used with llama-imatrix?

regarding the paper, it's one of the results I expected (either it's no big deal, because a lot with imatrix training data seems irrelevant), or it has a big effect. But finally I can choose between these extremes!

I also feel much better about my training data now, which is pretty incoherent. But given that random tokens seem to work relatively fine, it would actually be surprising if it were so detrimental.

The question is, what does that tell us about how llms store knowledge? and how about IQ quants, which are far, far more sensitive to imatrix weights?

@tdh111 anyway, very much appreciated, I would have never seen this paper without you

I assume you queued all/most of the nice lint check models that all fell through the llama loading code check? :)

I queued a quite a lot of trending models some of which turned out to be bad. Those errors are all legit and can be nuked.

I suspect these are all pure embeddings and therefore can't be used with llama-imatrix?

Yes exactly. I will improve my trending model discovery scripts to filter out embeddings in the next version. I will also check if there is a way dry-run can detect this. The main issue that this is a check that occurs during inference time inside llama_decode_impl and not while loading the model.

The last 2 failures you can nuke if you want.

https://huggingface.co/cl-nagoya/ruri-large-v2 likely requires manual GGUF conversion due to ModuleNotFoundError: No module named 'fugashi'

No idea why https://huggingface.co/google/flan-t5-xxl fails to download but if the just started Redown fail I guess I will provide the GGUF manually there as well.

Edit: Nevermind cl-nagoya/ruri-large-v2 likely is an embedding as well so I nuked it as we don't care about them.
Edit2: I think redown fixed flan-t5-xxl so must have just been some random HuggingFace download error.
Edit3: No flan-t5-xxl failed again: ValueError: Missing or incomplete model files: ['model-00001-of-00005.safetensors']

anyway, very much appreciated, I would have never seen this paper without you

Thanks a lot for sharing!

Glad you both liked it.

The question is, what does that tell us about how llms store knowledge? and how about IQ quants, which are far, far more sensitive to imatrix weights?

Both of those are separate from the paper I linked, but this paper is relevant to your first question: https://arxiv.org/abs/2503.05613 .

Your second question about IQ quants is best answered by ikawrakow, who would most likely answer if asked in a discussion post in ik_llama.cpp. I feel like I know the answer but I'm not confident enough to give my answer because I would rather not spread potential wrong information, but now that you ask I'm curious if the same holds true for his new quant types (IQ_K) which at low bpw offer better performance than I-quants and at higher bpw offer better performance and quality compared to K-quants.

I will also check if there is a way dry-run can detect this.

Be careful - the more checks you add, or rather, move, the more you will diverge from future llama.cpp versions that might do things differently. There is a trade-off here, between catching more things and maybe blocking future roads.

some random HuggingFace download error.

Possible, but unlikely, as hfd retries pretty aggressively. When you open a (s) in audit, the download is printed (it's in MODEL/log, too). If it's a new model, a much more common failure mode is actually not yet uploaded files. For example,, YOYO-AI loves to make elabnorate model cards before actuially uploading all files :/

I'm unexpectedly busy (and probably rather tired) for the next weeks, probably. I'll try to take care, but don't be alarmed if things get a bit more erratic.

also not caught:

llama_model_quantize: failed to quantize: key not found in model: llama.context_length

this is actually caught by quantize, so causes extra human work, but not extra computational work (it's caught during static jobs).

interesting that qantize even bothers...

and clearly, nice level 1200 is the junk class

How do I know if /tmp/quant/Samantha-1.11-70b-i1-GGUF/imatrix.dat is an old or new imatrix? I unfortunately nuked the existing imatrix repo before hash comparing them. I checked for Samantha-1.1-70b which is basically the same case and they were different so I'm almost certain the imatrix for Samantha-1.11-70b got recomputed as well. It seems like cases where after nuke existing imatirx are getting copied only happens if they were somewhat recently generated but not for this 1-year-old cases of repositories where static quants never even existed. In the future I will obviously use nukeall so none of this will be an issue.

and clearly, nice level 1200 is the junk class

I noticed this as well. I nuked so many errors this morning when I woke up. We had almost entire hosts filled with errors.

also not caught:
llama_model_quantize: failed to quantize: key not found in model: llama.context_length

This does get caught using dry-run. Not sure why you think it does not. I even tested one of the models that had this error today to confirm:

llama_model_loader: mmap is not supported for dry-run so it is now disabled
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 12.55 GiB (16.00 BPW) 
llama_model_load: error loading model: error loading model hyperparameters: key not found in model: llama.context_length
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/nico/law-LLM.gguf'
main: error: unable to load model

I'm unexpectedly busy (and probably rather tired) for the next weeks, probably. I'll try to take care, but don't be alarmed if things get a bit more erratic.

No problem. Now that you gave me all these amazing tools and I got familiar using them I should be able to solve most of the issues myself hopefully letting you focus as much on your job as possible. Just ignore things and only respond that what is important to safe time. I'm doing them same when I'm busy. Feel free to ignore user requests and audits as I can handle them myself.

Interesting so nico2 has still not turned itself off despite the repository creation issue beeing long gone and 3 hours after 17:00:

nico2    nice size (static/imatrix) -- jobs 3/4-12 maxm 300 free 1479 budget 769 uploads 0 hfd 0 32c
         1200    1 s  Qwen2.5-0.5B-SFT-merge                       blocked/frozen/timeofday repo create (interrupting)
         1200   15 s  mistral-openorca-openplatypus-alpacagpt4     blocked/admin/paused
         1200   17 s  WEPO-llama-3-8b                              blocked/admin/paused

Regarding Snowflake Arctic Instruct the source GGUF is under /apool/snowflake-arctic-instruct.gguf. We only want to regenerate the imatrix and the imatrix quants but keep the static quants. Before you add it you need to nukerepo https://huggingface.co/mradermacher/snowflake-arctic-instruct-i1-GGUF but only this one and NOT the static quants! You also need to archive the current imatrix quants WITHOUT using nukeall as we want to keep the static quants.

How do I know if /tmp/quant/Samantha-1.11-70b-i1-GGUF/imatrix.dat is an old or new imatrix?

If you are lucky, from the mtime. Otherwise, if we have a repo for it, we have it cached. If we don't have a repo for it, we shouldn't have it cached. In this case, it's old, though:

-rw------- 1 root root 4.6M Dec 31 15:16 imatrix-remote/Samantha-1.11-7b.imatrix
-rw------- 1 root root 7.2M Dec 14 13:59 imatrix-remote/Samantha-1.11-13b.imatrix
-rw------- 1 root root 25M Mar 10 08:47 imatrix-remote/Samantha-1.11-70b.imatrix

I assume I should remove it? (done)

If you are lucky, from the mtime. Otherwise, if we have a repo for it, we have it cached. If we don't have a repo for it, we shouldn't have it cached.

Thanks a lot! Can you in this case please check for Samantha-1.1-70b as well and delete if it is older than a few weeks? I have the feeling that has generated a new one despite https://huggingface.co/mradermacher/Samantha-1.1-70b-i1-GGUF existing as the sha256 hash of the imatrix file I have locally doesn't match the one inside the HuggingFace repository.

@mradermacher The status page and telnet 10.28.1.1 16732 are already frozen for an hour but llmc audit still works without any issue which is strange as shouldn't it break as well if someone doesn't release the lock? I plan on executing llmc killall9 should the issue still persist in half an hour.

Interesting so nico2 has still not turned itself off despite the repository creation issue beeing long gone and 3 hours after 17:00:

repo creation is currently not interruptible, and no, the status showsa it's frozen, so still ongoing.

there are two issues here: a) jobs should no longer be frozen on nico1 - if a job somehow isn't finished before 17:00, it should probably continue to run and b) if a job is still active, it will not shutdown

i removed the cron rules causing job freezes. but repo creates would still keep it on indefinitely atm.

i also don't think snowflake will help us. I think we should push a few of the big 1400 jobs to prio 1200 or so per day. maybe. will have to see.

I couldn't queue this morning, this is probably the root cause.

-rw------- 1 root root 25M Mar 10 05:55 imatrix-remote/Samantha-1.1-70b.imatrix

moved away

if there already is a job i need to manually add the imatrix job

repo creation is currently not interruptible, and no, the status showsa it's frozen, so still ongoing.
there are two issues here: a) jobs should no longer be frozen on nico1 - if a job somehow isn't finished before 17:00, it should probably continue to run and b) if a job is still active, it will not shutdown
i removed the cron rules causing job freezes. but repo creates would still keep it on indefinitely atm.

Thanks a lot. No worry about the repo creation. It will eventualy create it and shutdown. The main reason it didn't was likely because of timeofday. The rate limit usualy doesn't blocks a specific task for more than around a hour unless you get very unlucky and always loose the race to create a repo once a slot gets free.

I couldn't queue this morning, this is probably the root cause.

We did over 300 models today as 19 hours ago we had 1779 models in the queue and now there are 1492 not even considering all the ones we queued. It was crazy. I had to llmc audit sometimes even multible times per hour as we went through so many models. We need to do a somewhat healthy mix of difffrent sized models so we don't end up having days where we only do small ones or we will get rate limited. Next time I will queue some myself earlier.

I think we should push a few of the big 1400 jobs to prio 1200 or so per day. maybe. will have to see.

That sounds like a great idea for days where we there are not many new big great models.

i also don't think snowflake will help us.

It will at least keep nico1 busy and it's one of the massive models we had to do anyways. I'm currently also closely following llama.cpp decisions what MLA algorithm they decide on. Depending on which one they choose we may or may not need to requantize all the DeepSeek V2/V3/R1 models.

-rw------- 1 root root 25M Mar 10 05:55 imatrix-remote/Samantha-1.1-70b.imatrix
moved away

Thanks a lot!

if there already is a job i need to manually add the imatrix job

Now that I know the current imatrix was outdated I will be secured the source GGUF, use nukeall and queue them again. That should be the cleanest option and not require you to do anything.

I tried fixing the frozen status page using llmc killall9 but it timed out...

nico1 ~# llmc killall9
nico1
back
leia
nico2
rich1
kaos
marco
rain
Killed llmjob(2720126) with signal 9
Killed llmjob(1136699) with signal 9
Killed llmjob(1136722) with signal 9
Killed llmjob(1137378) with signal 9
Killed llmjob(296290) with signal 9
Killed llmjob(514440) with signal 9
Killed llmjob(3434878) with signal 9
Killed llmjob(661385) with signal 9
llmjob: no process found
Killed llmjob(2256273) with signal 9
nico2: Connection timed out

At least I now know which node is to blame. Guess time to turn on nico2 again to fix this.

With nico2 turned on llmc killall9 terminated successfully within a second but the status page is still frozen. This issue really seems quite different from how normally the frozen status page behaves. I turned off nico2 again as turning it on didn't help solving the issue.

Oh maybe I shouldn't have used llmc killall9. When I now check llmc audit I see a many entires like this but not entierly sure if related but they are not something I've seen before:

ionice -c3 chrt -b 0 systemd-run --scope --unit=llmjob-wrap-omega1.3-static-2719966 -G -p MemoryMax=32G
/llmjob/share/bin/quantize: line 295: 2720126 Killed                  llmjob hf-ensure-repo "$DST"
job finished, status 137
job-done<0 omega1.3 static 137>
https://huggingface.co/nicogptai/omega1.3

back from working...

yes, i just had the same experience. i have no clue what causes these deadlocks (basically, the master takes a global lock before connecting to workers, and each worker takes its own local lock, in random order. one would think the relatively new "upcalls" (via llmc) might be an issue, but i don't see a path where llmjob does a blocking llmc call - the only llmc call llmjob does is "push", which does not block if the lock is held. shucks).

killall -9 llmjob is no longer a crude but effective method, because llmjob has become a toolbox for lots of things, rather than only the scheduler itself, so killing it kills lots of other stuff, failing jobs. it's relatively simple to clean up for me, so if it means some other job will start instead, do it. the fix is to fix the deadlock problem...

@nicoboss so, i thought, great oppportunity to do the snowflake imatrix quant. i can mlock the Q8_0 (509G) without issue, but llama-imatrix is "Killed" (so probably oom killer), even with literally nothing else running.

@nicoboss so, i thought, great oppportunity to do the snowflake imatrix quant. i can mlock the Q8_0 (509G) without issue, but llama-imatrix is "Killed" (so probably oom killer), even with literally nothing else running.

It indeed got OOM killed despite nothing else being running. I was aware you wanted to run Snowflake Arctic imatrix so I turned of all services on StormPeak. The only thing I forgot was reducing the ZFS ARC cache from 24 GB to less but the last time we did snowflake arctic base this wasn't required. Here the kernel log oft he OOM event:

Mar 14 02:28:22 StormPeak kernel: llama-imatrix invoked oom-killer: gfp_mask=0x440dc0(GFP_KERNEL_ACCOUNT|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=800
Mem-Info:
active_anon:221810 inactive_anon:462979 isolated_anon:0
active_file:1412 inactive_file:3913 isolated_file:0
unevictable:124205714 dirty:195 writeback:108
slab_reclaimable:389941 slab_unreclaimable:342434
mapped:124210748 shmem:28017 pagetables:385960
sec_pagetables:0 bounce:0
kernel_misc_reclaimable:0
free:873272 free_pcp:132 free_cma:0
Node 0 active_anon:1733588kB inactive_anon:1005568kB active_file:5240kB inactive_file:14472kB unevictable:496822856kB isolated(anon):0kB>
Node 0 DMA free:11264kB boost:0kB min:0kB low:12kB high:24kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB i>
lowmem_reserve[]: 0 1432 515181 515181 515181
Node 0 DMA32 free:1520228kB boost:0kB min:252kB low:1716kB high:3180kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_>
lowmem_reserve[]: 0 0 513749 513749 513749
Node 0 Normal free:1958780kB boost:0kB min:91612kB low:617688kB high:1143764kB reserved_highatomic:1867776KB active_anon:1873616kB inact>
lowmem_reserve[]: 0 0 0 0 0
Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
Node 0 DMA32: 9*4kB (UM) 12*8kB (UM) 10*16kB (UM) 8*32kB (UM) 9*64kB (UM) 8*128kB (UM) 8*256kB (UM) 11*512kB (UM) 11*1024kB (UM) 12*2048>
Node 0 Normal: 1292*4kB (UME) 11246*8kB (UME) 14994*16kB (ME) 5833*32kB (ME) 2351*64kB (UME) 264*128kB (UME) 114*256kB (UM) 80*512kB (UM>
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
124203420 total pagecache pages
0 pages in swap cache
Free swap  = 0kB
Total swap = 0kB
134086427 pages RAM
0 pages HighMem/MovableOnly
2179266 pages reserved
0 pages hwpoisoned
Mar 14 02:28:26 StormPeak kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=ns,mems_allowed=0,global_oom,task_memcg=/lxc/108/ns/user.slice/user-0.slice/session-3129.scope,task=llama-imatrix,pid=2313118,uid=100000
Mar 14 02:28:26 StormPeak kernel: Out of memory: Killed process 2313118 (llama-imatrix) total-vm:502778504kB, anon-rss:77292kB, file-rss:280131584kB, shmem-rss:8192kB, UID:100000 pgtables:548784kB oom_score_adj:800
Mar 14 02:28:30 StormPeak kernel: oom_reaper: reaped process 2313118 (llama-imatrix), now anon-rss:0kB, file-rss:0kB, shmem-rss:72kB

I now reduced the ZFS ARC Cache from 24 GB to 1 GB. If this is still not enough, please offload layers to booth RTX 4090 GPUs and it will fit for sure. StormPeak is now ready for you to be used for Snowflake Arctic imatrix computation.

I now joined the waitlist for HuggingFace Xet. Xet is thair next generation storage solution replacing S3/GIT LFS. If my personal account gets accepted I will let you know if it is any good. You could join using https://huggingface.co/join/xet but I recommend to wait. Xet probably lifts the 50 GB limit so no more splitting/merging required. For our dry-run all GGUFs project Xet would be far superior compared to S3 as unlike S3 Xet is a block storage so you likely only need to download a single block per model.

grafik.png

@mradermacher I tried for the first time to manually run a massive imatrix job and everything seemed well but something is blocking it. Maybe because I paused the host to prevent other tasks from running as I had no clue how to put a host in that mode as llmc help had no command for it.

Edit: No also got stuck at this exact location when nico1 wasn't paused at all.

grafik.png

Also all thouse commands seam to be broken depite the host no longer beeing paused:

nico1 /tmp# llmc disable llmjob.nico1
disable.llmjob.nico1+: fail
nico1 /tmp# llmc disable imatrix.nico1
disable.imatrix.nico1+: fail
nico1 /tmp# llmc pause llmjob.nico1
pause.llmjob.nico1+: fail
nico1 /tmp# llmc pause imatrix.nico1
pause.imatrix.nico1+: fail

And resuming the GPUs I paused while playing around also seams no longer possible despite the host no longer beeing paused for a while:

nico1 ~# llmc resume GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc
pause.GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc-: fail
nico1 ~# llmc resume GPU-188a5143-db69-7058-63b5-f2f1d2354f91
pause.GPU-188a5143-db69-7058-63b5-f2f1d2354f91-: fail
nico1 /tmp# llmc enable GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc
disable.GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc-: fail
nico1 /tmp# llmc enable GPU-188a5143-db69-7058-63b5-f2f1d2354f91
disable.GPU-188a5143-db69-7058-63b5-f2f1d2354f91-: fail

I noticed that llmc help is missing the imatrix FLAG files
/tmp/pause to pause the imatrix tasks (which is still used despite the ability to pause GPUs because of legacy scripts and it being super reliable)
/tmp/imatrix.force to ignore "larger than 480GB" imatrix limit
/tmp/max-ngl to set the maximum number of layers allowed to be offloaded to the GPU

I now returned everything to as close to normal again as I could. quantization jobs are running again on nico1 and one of the 70B imatrix jobs is running despite booth GPUs being paused as I used the /tmp/pause flag to pause it before pausing the GPUs. The other imatirx jobs will unfortunately be blocked as I had no idea llmc enable GPU-* would be broken.

It would be great if you could tell me what I did wrong and/or by your own start snowflake-arctic-instruct imatrix computation once you are available. How did you make sure only one imatrix task is running? The only thing I could think of would be to pause llmjob.nico1 and one of the GPUs which should gurantee only one imatrix task running. Don't worry about the imatrix queue being so long. This is mainly because nico1 somehow decided to eat all the priority 40 tasks due to me overriding everything as llmc pause llmjob.nico1 was broken.

Wow strange it seem to still be doing imatrix jobs just with one GPU despite booth being blocked. Cool I guess as I wanted to unpause them but super confusing that it does this.

How is this an error now?

400   17 s  ablation-65-a55.simpo.armorm-shisa-v2-llama-3.1-8b error/255 repo create

If I llmc audit I see this:

HfHubHTTPError("500 Server Error: Internal Server Error for url: https://huggingface.co/api/repos/create (Request ID: Root=1-67d49fb3-377305ff77f775b842cdcecc;fd588126-eff5-4fcf-8792-e655e5a2affc)\n\nInternal Error - We're working hard to fix this as soon as possible!") at /llmjob/share/bin/llmjob line 2715.
        ...propagated at /llmjob/share/bin/llmjob line 2718.
job finished, status 255
job-done<0 ablation-65-a55.simpo.armorm-shisa-v2-llama-3.1-8b static 255>

https://huggingface.co/shisa-ai/ablation-65-a55.simpo.armorm-shisa-v2-llama-3.1-8b

This must be different than the repository creation rate limit I assume as rate limit usually never errors. I selected "retry" for now.

if there already is a job i need to manually add the imatrix job

You will unfortunately need to do so for Samantha-1.11-70b and Samantha-1.1-70b or tell me how to manually trigger an imatrix job if the scheduler thinks an imatrix already exists and so doesn't do one by itself as by the time the model was queued we have not yet archived the old imatrix.

Wow strange it seem to still be doing imatrix jobs just with one GPU despite booth being blocked.

You specified the wrong gpu uuid, so only one is blocked. You shoulöd be able to block all using llmc pause imatrix.nico2 .

I did that now, unpause with llmc resume imatrix.nico2

500 Server Error: Internal Server Error for url

yes, repo create just endlessly retries on being rate limited. this is simply hf suckiness, it happens on any request, and not all of them are retryable.

I noticed that llmc help is missing the imatrix FLAG files

There aren't any that you can access, they are all on the host that runs the imatrix jobs (kaos). And they are: .imatrix-hfd (gguf is valid) and .soverride (block this job). everything else would be in the json job description (e.g. which quant, wehere to download, and "force" which special cases quite a few things even the quantiser).

Xet probably lifts the 50 GB

I hope this will be transparent when using the hub api?

(flags)

Hmm, I reworked the flags stuff a few days ago, probably something is broken. One issue is that at least one of your uuids was not a uuid, but that's not checked by anything - it would simply mean you disabled a card that doesn't even exist, explaining those problems.

I am rather busy atm., but I will look at it later. Skimming through this,m your intentionw as to enable everything again, so I will do that.

gpu resuming was broken, the other flags should work

snowflake instruct didn't exhibit partially covered tensors either. peculiar:


[30]4.5118,[31]4.4902,[32]4.2987,[33]4.1352,[34]4.0640,[35]4.1097,[36]4.0953,[37]3.9571,[38]3.8514,[39]3.8286,
save_imatrix: entry '              blk.0.ffn_down_exps.weight' has partial data (95.31%)
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '              blk.0.ffn_gate_exps.weight' has partial data (95.31%)
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '                blk.0.ffn_up_exps.weight' has partial data (95.31%)
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[40]3.8027,[41]3.7899,[42]3.7907,[43]3.7528,[44]3.7568,[45]3.7423,[46]3.7421,[47]3.7649,[48]3.7736,[49]3.8393,
save_imatrix: entry '              blk.0.ffn_down_exps.weight' has partial data (97.66%)
save_imatrix: 3 out of 128 experts are missing data
save_imatrix: 3 out of 128 experts are missing data - storing but be aware
save_imatrix: entry '              blk.0.ffn_gate_exps.weight' has partial data (97.66%)
save_imatrix: 3 out of 128 experts are missing data
save_imatrix: 3 out of 128 experts are missing data - storing but be aware
save_imatrix: entry '                blk.0.ffn_up_exps.weight' has partial data (97.66%)
save_imatrix: 3 out of 128 experts are missing data
save_imatrix: 3 out of 128 experts are missing data - storing but be aware
[50]3.8619,[51]3.7652,[52]3.6756,[53]3.5995,[54]3.5232,[55]3.4544,[56]3.4564,[57]3.4428,[58]3.4881,[59]3.5413,[60]3.6089,[61]3.5819,[62]3.6202,[63]3.6591,[64]3.6948,[65]3.7287,[66]3.7580,[67]3.8092,[68]3.8528,[69]3.8791,[70]3.9078,[71]3.9304,[72]3.9267,[73]3.9117,[74]3.8934,[75]3.9132,[76]3.9207,[77]3.9402,[78]3.9272,[79]3.9366,[80]3.9516,[81]3.9433,[82]3.9516,[83]3.9424,[84]3.9542,[85]3.9600,[86]3.9625,[87]3.9679,[88]3.9827,[89]3.9785,[90]3.9810,[91]3.9932,[92]3.9877,[93]3.9786,[94]3.9769,[95]3.9490,[96]3.9652,[97]3.9638,[98]3.9659,[99]3.9510,[100]3.9496,[101]3.9704,[102]3.9540,[103]3.9469,[104]3.9432,[105]3.9589,[106]3.9717,[107]3.9966,[108]4.0193,[109]4.0061,[110]3.9952,[111]3.9861,[112]3.9732,[113]3.9614,[114]3.9478,[115]3.9381,[116]3.9272,[117]3.9226,[118]3.9405,[119]3.9552,[120]3.9894,[121]4.0224,[122]4.0629,[123]4.0951,[124]4.1483,[125]4.1925,[126]4.2084,[127]4.2208,[128]4.1951,[129]4.2042,[130]4.2001,[131]4.1913,[132]4.1599,[133]4.1229,[134]4.1424,[135]4.1528,[136]4.1617,[137]4.1609,[138]4.1752,[139]4.1914,[140]4.2076,[141]4.2148,[142]4.2255,[143]4.2306,[144]4.2246,[145]4.2281,[146]4.1897,[147]4.1507,[148]4.1279,[149]4.0930,[150]4.0629,[151]4.0319,[152]4.0548,[153]4.0668,[154]4.1009,[155]4.1359,[156]4.1768,[157]4.2180,[158]4.2545,[159]4.2916,[160]4.3233,[161]4.3644,[162]4.4004,[163]4.4289,[164]4.4621,[165]4.4962,[166]4.5275,[167]4.5578,[168]4.5864,[169]4.6174,[170]4.6465,[171]4.6776,[172]4.7161,[173]4.7472,[174]4.7761,[175]4.8207,[176]4.8486,[177]4.8822,[178]4.9031,[179]4.9323,[180]4.9580,[181]4.9898,[182]5.0146,[183]5.0482,[184]5.0830,[185]5.1043,[186]5.1348,[187]5.1531,[188]5.1795,[189]5.2056,[190]5.2293,[191]5.2568,[192]5.2935,[193]5.3223,[194]5.3406,[195]5.3595,[196]5.3979,[197]5.4154,[198]5.4360,[199]5.4551,[200]5.4766,[201]5.5009,[202]5.5214,[203]5.5368,[204]5.5569,[205]5.5791,[206]5.6068,[207]5.6288,[208]5.6491,[209]5.6769,[210]5.7026,[211]5.7270,[212]5.7459,[213]5.7706,[214]5.7825,[215]5.8032,[216]5.8271,[217]5.8449,[218]5.8689,[219]5.8854,[220]5.9095,[221]5.9244,[222]5.9341,[223]5.9554,[224]5.9779,[225]5.9978,[226]6.0189,[227]6.0359,[228]6.0500,[229]6.0720,[230]6.0962,[231]6.1168,[232]6.1403,[233]6.1576,[234]6.1815,[235]6.2036,[236]6.2261,[237]6.2379,[238]6.2595

Could it be that the patch simply makes tensors valid, so as soon as they are "stored", they no longer count as partial from then on? I haven't looked at the patch, but maybe it would fill partial weights with dummy weights, so on the next round, they would no longer count as partial? Might not be a disastrous thing, but probably the patch shouldn't permanently change weights, because that would slightly change the results for the next rounds - maybe it should modify and save a copy.

save_imatrix: 14 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[20]4.3242,[21]4.4543,[22]4.3112,[23]4.1885,[24]4.1851,[25]4.1865,[26]4.1966,[27]4.1751,[28]4.2599,[29]4.3854,
save_imatrix: 7 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[30]4.5118,[31]4.4902,[32]4.2987,[33]4.1352,[34]4.0640,[35]4.1097,[36]4.0953,[37]3.9571,[38]3.8514,[39]3.8286,
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[40]3.8027,[41]3.7899,[42]3.7907,[43]3.7528,[44]3.7568,[45]3.7423,[46]3.7421,[47]3.7649,[48]3.7736,[49]3.8393,
save_imatrix: 3 out of 128 experts are missing data
save_imatrix: 3 out of 128 experts are missing data - storing but be aware

Thinking about this, we should definitely investigate this, as this will probably affect most moe's and has a good chance of negatively affecting them, unless the patched weight values are essentially being ignored (I have no clue how the weights are combined between chunks).

We reached the repository creation limit again, so I started to reprioritize and statically assigned some of the in my opinion important 1400 priority medium sized models to rein and leia and some big ones to rich1. I'm now strictly controlling what models nico1 and nico2 are working on to ensure they only do big ones. I kept back, kaos and marco untouched as they are not currently working on background models. Don't be confused of me abusing the pause host functionality on nico1 and nico2. I realized that I can pause a host and then delete the interrupt files for it to work on specific models without sheduling any new ones. The reason I have to pause the entire host is because ther followinfg commands still do NOT work:

nico1 ~# llmc pause imatrix.nico1
pause.imatrix.nico1+: fail
nico1 ~# llmc pause llmjob.nico1
pause.llmjob.nico1+: fail

so I started to reprioritize and statically assigned

That sucks, because I also did this this morning, so us having to do it twice is not a good sign :)

work on specific models without sheduling

If that works for you, that's great. You can also manually start models by setting the .force flag (e.g. via llmc shell) and push'ing. They will immediatelly be interrupted by ready models highe rin the queue, but those cna be overriden. I envisage that's useful if some models are stuck in repo create.

On the other hand, how does that work at all? In my experience, the creation limit is something that hits during the day, then stays with us until 2-3am, with some limited softness (i.e. one can get a request through sometimes).

followinfg commands still do NOT work:

Eh, fascinating how many mistakes you cna put into a few regexes. And I thought I tested those.

followinfg commands still do NOT work:

I kind of refactored it once more. IT's even better than before. Looks fine from my point of view, but I didn't have the time to test everything.

so, gemma3 also has a vision part, and llama already has an extractor for it?

since i am super swamped, wanna give it a try at helping me? the qwen2vl extractor probably is a good base, and it's in "quantize" (you can search for Qwen2VLForConditionalGeneration)

as can be seen from the code, it's not exactly straightforward, mostly due to hard to predict output naming. the way I solved it was by creating an empty temporary directory, and assuming only a single file will be created, which will then be renamed to a known name.

I feel the newest dryrun has issues, or mnaybe we just ran into this class of problem:

load_tensors: loading model tensors, this can take a while... (mmap = false)
[DRYRUN][CPU]: 6425850112
alloc_tensor_range: failed to allocate CPU buffer of size 6425850112
load_tensors: pretend allocating CPU buffer was successful due to dry-run being enabled
...
[DRYRUN][CPU]: 513024
output_reserve: failed to allocate output buffer of size 0.49 MiB
llama_init_from_model: failed to initialize the context: failed to reserve initial output buffer
common_init_from_params: Dryrun compleated!
dryrun failed

I went back to the grep method, but it seems dryrun testing is completely broken at the moment.

I feel the newest dryrun has issues, or mnaybe we just ran into this class of problem:

What issues are you experiencing and with what model do they occur? The output you posted looks all great except you wrongly detecting it as failed. Is it not returning status code 0 even if it is successful or why is your code still detecting this as dryrun failed? I don't see how it is possible that it doesn't exit with code 0 if you see above Dryrun compleated! message as the code in common.cpp is the following - shouldn't exit(0) immediately terminate the application with exit code 0? Are you sure your exit code check is implemented correctly?

if(getenv("DRYRUN")) {
    LOG_ERR("%s: Dryrun compleated!\n", __func__);
    exit(0);
}

Just to make sure I tested the latest version myself and as expected the exit codes printed using echo $? is correct. Now I'm even more confused what issues you are talking about. Maybe you just got confused by the expected errors such as failed to allocate and failed to initialize which are intentional. The only issue I found was this embarrassing typo in the Dryrun compleated! message.

Working model:

[DRYRUN][PINNED]: 122088
output_reserve: failed to allocate output buffer of size 0.12 MiB
llama_init_from_model: failed to initialize the context: failed to reserve initial output buffer
common_init_from_params: Dryrun compleated!
0

Broken model:

llama_model_load: error loading model: error loading model hyperparameters: key not found in model: llama.context_length
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/nico/law-LLM.gguf'
main: error: unable to load model
1

so, gemma3 also has a vision part, and llama already has an extractor for it?

The gemma3 vision extraction is really simple. You just execute examples/llava/gemma3_convert_encoder_to_gguf.py to get the mmproj file as usual. By default the mmproj will be stored under mmproj.gguf but the --outfilecommand line argument can be used to specify whatever name you like. Using --outtype you can specify if you want the mmproj as f32, f16, bf16 or q8_0. If you don't specify anything the mmproj will be in f16. Then you just specify the path of your gemma3 model and done. Should you encounter any issues you can use --verbose to see exactly what it is doing.

since i am super swamped, wanna give it a try at helping me? the qwen2vl extractor probably is a good base, and it's in "quantize" (you can search for Qwen2VLForConditionalGeneration)

I just looked at the quantize script and the only thing you have to change is likely:

python3 "/llmjob/llama.cpp/examples/llava/qwen2_vl_surgery.py" --data_type fp16 -- ../"$SRC" || exit 33

to

python3 "/llmjob/llama.cpp/examples/llava/gemma3_convert_encoder_to_gguf.py" --outtype f16 -- ../"$SRC" || exit 33

as can be seen from the code, it's not exactly straightforward, mostly due to hard to predict output naming. the way I solved it was by creating an empty temporary directory, and assuming only a single file will be created, which will then be renamed to a known name.

You could heavily simplify it for gemma3 by just using --outfile to specify whatever output file you want. This unfortunately doesn't seem to be possible for qwen2vl so you either use the auto outfile detection for booth of them or use dedicated code for gemma3 in which case you can remove it and instead just use --outfile.

i'll have a look at dryrun later - most likely I made a mistake in a hurry.

What's your take on f32 and Q8_0 quants of vision parts? Q8_0 seems attractive to have, and I made sure our naming convention supports that. f32, not so much.

What's your take on f32 and Q8_0 quants of vision parts? Q8_0 seems attractive to have, and I made sure our naming convention supports that. f32, not so much.

It would be awesome if we could offer diffrent mmproj quants as well. qwen2vl supports fp32 and fp16 as --data_type argument while gemma3 supports f32, f16, bf16 and q8_0as --outtype argument. We should at least offer f16 and q8_0 and maybe even f32 for gemma3.

ok, not sure when i can go about that (gemma3). in the meantime, he is a diff I use for many months now, for use in the mradermacher branch

diff --git a/gguf-py/gguf/gguf_writer.py b/gguf-py/gguf/gguf_writer.py
index 080d2b9d..d3cbe44f 100644
--- a/gguf-py/gguf/gguf_writer.py
+++ b/gguf-py/gguf/gguf_writer.py
@@ -237,6 +237,10 @@ class GGUFWriter:
             kv_bytes = bytearray()
 
             for key, val in kv_data.items():
+                if val.type != GGUFValueType.ARRAY or len (val.value) < 50:
+                    print("gguf serialising key ", key, "value", val)
+                else:
+                    print("gguf serialising key ", key, "value-suppressed")
                 kv_bytes += self._pack_val(key, GGUFValueType.STRING, add_vtype=False)
                 kv_bytes += self._pack_val(val.value, val.type, add_vtype=True)
 
@@ -269,8 +273,8 @@ class GGUFWriter:
         self.state = WriterState.TI_DATA
 
     def add_key_value(self, key: str, val: Any, vtype: GGUFValueType) -> None:
-        if any(key in kv_data for kv_data in self.kv_data):
-            raise ValueError(f'Duplicated key name {key!r}')
+        #if any(key in kv_data for kv_data in self.kv_data):
+        #    raise ValueError(f'Duplicated key name {key!r}')
 
         self.kv_data[0][key] = GGUFValue(value=val, type=vtype)
 

in the meantime, he is a diff I use for many months now, for use in the mradermacher branch

Thanks for sharing. I fixed typo in the "Dryrun completed!" message, applied your diff and updated to latest llama.cpp despite there not being any changes relevant to us. There is no reason for you to update again unless you have not manually applied your diff the last time you did.

i made a typo when restoring the old dryrun code. it seems to work now :-)

i also removed the -ofreq 10, assuming this deals with any problems with wrong imatrix weights. that means no feedback for us, but it's rarely useful anyway.

why, oh why, did i follow llama naming conventions and called it fp16 instead of f16 (qwen2vl)

I noticed that for my latest medical models I have quite a few imatrix errors like this:

nico1 /tmp# grep -r "GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed" *.log
Gemma2-2b-IT-FT-medical_qa.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
gemma-2b-it-finetuned-medical-qa.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
gemma-medical_qa-Finetune-ad-2b.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
gemma-medical_qa-Finetune-ja.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
gemma-medical_qa-Finetune-v2.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
medical_jargons_simplifier2.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
Medical-mT5-large.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
Medical-mT5-xl-multitask.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
medical_q_a_model.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
Medical_Report_Summarization.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
Medical_Summarization.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed

They don't appear in llmc audit. How to deal with them? nuke or nukeall?

neither, the imatrix ones i have to deal with. queue fewer junk models? :-)
(do these actually work with transformers?)

well, actually nukeall does work in this case

What should I do about this one? nuke and force requeue explicitly to nico1? I think this should work as it should auto-skip already existing quants.

Running scope as unit: llmjob-wrap-gemma-3-4b-persian-v0-noquant-6698.scope
{[[PROGRESS:preparing...]]}
{[[PROGRESS:mmproj extraction]]}
mmproj extraction attempted on unsupported host
job finished, status 72
job-done<0 gemma-3-4b-persian-v0 noquant 72>

https://huggingface.co/mshojaei77/gemma-3-4b-persian-v0

neither, the imatrix ones i have to deal with. queue fewer junk models? :-)

Most of them are not junk but I unfortunately don't have time to test every single one of them before queueing. Many medical finetunes lacking a proper model card which makes judging their quality without actually testing the model almost impossible. We could say no model card means trash but this doesn't seem to always be true as some are just lazy and I already had multiple good models without a model card.

do these actually work with transformers?

I just tested Gemma2-2b-IT-FT-medical_qa using transformers and it worked. But no worries the model kind of sucks as it wants you to ask questions formatted exactly like inside the medical QA dataset and is so heavily censored that it refuses to answer the majority of them. It seems so stupid to create a medical finetune that refuses to answer medical questions. But it also seams stupid to not write a model card.

well, actually nukeall does work in this case

Great I will nukeall them myself in the future. I will also try to find a way to recognize and filter such failures before even queueing them. With latest changes to my script the failure rate already got reduced a lot compared to earlier versions.

What does (worker +cork) mean? I noticed that you queued all of today’s lownice models using that flag.

Edit: Ah interesting that flag is gone now.

I merged the latest llama.cpp into the mradermacher branch adding support for the RWKV v7 architecture and fixing the tensor shape issue of OLMo-2-0325-32B-Instruct (tensor 'blk.0.attn_k_norm.weight' has wrong shape; expected 5120, got 1024)

I highly recommend to update as otherwise all RWKV v7/RWKV v7 Distilled based and many OLMo-2 based models will fail. Once you updated, please queue the following models:

RWKV v7 Base models (RWKV7ForCausalLM):

RWKV v7 Distilled models (RwkvHybridForCausalLM):

Force requant failed OLMo-2 models (Olmo2ForCausalLM):

(worker +cork)

Sorry, just experimenting - I wanted to queue everything first, so I set an impossible worker name to be changed when I am happy with the queue.

llama.cpp is updated, could you do me a favour and queue the models, maybe a test model first?

llama.cpp is updated, could you do me a favor and queue the models, maybe a test model first?

Thanks a lot! Will do.

Sorry, just experimenting - I wanted to queue everything first, so I set an impossible worker name to be changed when I am happy with the queue.

Ah I see. Now it makes sense. No problem I was just a bit confused at first.

@mradermacher Half an hour ago llama.cpp added support for Mistral3ForConditionalGeneration. Luckily it is a ‎convert_hf_to_gguf.py change only so I was able to manually provide the GGUF and use our existing llama.cpp version for imatix computation and quantization. I recommend you again upgrade to the latest version of the mradermacher branch, so this no longer requires manual intervention. We could also hold back Mistral3ForConditionalGeneration based models until the vision extraction for it is implemented but I would expect this to take days if not weeks for them to implement so waiting is likely not a feasible option.

updated - but please keep a list of the models you queued so far, so we can re-run these models. new "add"s should automatically log these ("Mistral3ForConditionalGeneration, logging.")

i tried some of the rwkv 7 models that showed up in my list today (e.g. RWKV7-Goose-Pile-168M-HF), but... any idea?

  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 5384, in <module>
    main()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 5378, in main
    model_instance.write()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 440, in write
    self.prepare_metadata(vocab_only=False)
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 433, in prepare_metadata
    self.set_vocab()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 3598, in set_vocab
    self._set_vocab_rwkv_world()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 915, in _set_vocab_rwkv_world
    assert (self.dir_model / "rwkv_vocab_v20230424.txt").is_file()
AssertionError

updated

Thanks a lot for the quick update! :D

please keep a list of the models you queued so far, so we can re-run these models. new "add"s should automatically log these ("Mistral3ForConditionalGeneration, logging.")

The only one I manually converted so far was https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503

i tried some of the rwkv 7 models that showed up in my list today (e.g. RWKV7-Goose-Pile-168M-HF), but... any idea?

All RWKV v7 based models are supposed to have a file named rwkv_vocab_v20230424.txt as can be seen under any RWKV v7 base model like https://huggingface.co/fla-hub/rwkv7-191M-world/raw/main/rwkv_vocab_v20230424.txt in the case of fla-hub/rwkv7-191M-world. Your RWKV7-Goose-Pile-168M-HF model misses this file. Likely because it got converted from the RWKV v7 into a HuggingFace transformers comparable model as can be seen from the model’s name. We could try just copying that file into the same folder as the model but not sure if this would work. By the way fun fact that file used to allow arbitrary code execution in an earlier luckily rejected convert_hf_to_gguf.py implementation by phrasing the file using eval(line[line.index(' '):line.rindex(' ')]). ARWKV-7B-Preview-0.1 using the RwkvHybridForCausalLM you queued worked fine.

By the way fun fact that file used to allow arbitrary code execution in an earlier luckily rejected

I was under the impression that convert...py always allows arbitrary code execution - for example, in the glm case, I regularly have to patch .py files inside the repo to make it work, which proves that the files get executed. One way is enough...

That is what prompted me to introduced safe-exec btw., because In was also under the impression that it would not execute files from the repo by default. We did have a little chat about that, too, I think..

All RWKV v7 based models are supposed

I guess we cna then just skip those, as they are likely (in theory atr leats) identical to the non-hf version. Problems will arise if these become more popular (as they are by "RWKV")

I was under the impression that convert...py always allows arbitrary code execution - for example, in the glm case, I regularly have to patch .py files inside the repo to make it work, which proves that the files get executed. One way is enough...

It does for some models that are using a custom loader but there it is quite obvious that the custom loader gets executed to load the model so someone that doesn't mass convert thousands of models would likely take a short look at it before converting to GGUF. Allowing arbitrary code execution to phrase massive text file on the other hand is definitely not something any user could ever expect. It is also like the dumbest way to implement a text file parser.

As long convert_hf_to_gguf.py supports loading any models that are not in safetensors you can easily make it execute arbitrary code anyways. Someone with malicious intent would likely choose to infect the actual model and not the python file that loads it as that one is easily renewable but actually doing so in a stealthy way would be a genius as it will the automated malware scanner only scans models as far I'm aware. I'm positively surprised malicious AI models are not a common issue. Is far I'm aware not a single AI model tried to infect our infrastructure so far.

That is what prompted me to introduced safe-exec btw., because In was also under the impression that it would not execute files from the repo by default. We did have a little chat about that, too, I think.

We did. Enabling that for sure was a great decision. It would be really annoying to having our infrastructure infected by some random malware. We are at like the highest risk possible of this happening to us as we process thousands of models from often untrustworthy sources shortly after their release and so before HuggingFace could take them down based on their malware scanners results. But no worries as long nobody burns a Linux kernel exploit or more likely a Nvidia driver exploit on me nothing will get out of my LXC container. I’m always closely monitoring the LXC container so I would probably almost immediately spot any malicious process running inside of it.

I guess we cna then just skip those, as they are likely (in theory atr leats) identical to the non-hf version. Problems will arise if these become more popular (as they are by "RWKV")

No need to do them but could indeed get an issue if users start finetuning them instead of the ones inside the original RWKV v7 format but don't worry if it gets an issue, we can for sure do something to convert them.

It does for some models that are using a custom loader

If it does it for some, it does it for all - the model type is parsed from the files as well.

it is quite obvious that the custom loader gets executed to load the model so someone that doesn't mass convert thousands of models would likely take a short look at it before converting to GGUF.

I think the oppposite is the case. You assume everybody using transformers (or llama.cpp) somehow is an expert. I would assume most people would blindly trust it.

As long convert_hf_to_gguf.py supports loading any models that are not in safetensors you can easily make it execute arbitrary code anyways.

How so? The only alternative would by pytorch, and I don't think that executes code anymore.

automated malware scanner only scans models

As far as I am aware, automated malware scanners don't really exist. They either check outdated signatures, or pretend to check for behaviour and completely fail. Point in case, the hf malware repo scanner... :)

Anyway, I think the deeper issue is that transformers code is written by people who don't understand basic security or even safety practise, so running everything in a jail is the way to go :)

We are at like the highest risk possible of this happening to us as we process thousands of models from often untrustworthy sources

We are also one of the biggest targets for attacks, especially if something can be done with the generated files.

I’m always closely monitoring the LXC container so I would probably almost immediately spot any malicious process running inside of it.

Pride goes before the fall.

[rwkv]

No need to do them but could indeed get an issue if users start finetuning them instead of the ones inside the original RWKV v7 format but don't worry if it gets an issue

There were also two fla-hub non"-hf" "-pile" models having the same issue.

How so? The only alternative would by pytorch, and I don't think that executes code anymore.

What makes you think that PyTorch would no longer be vulnerable to arbitrary code execution? As long you unpickle you allow arbitrary code to run. The entire legacy file formats are insecure by design. I don't see any way once could ever load them in a secure way. Even the poor machine that convert legacy models to safetensor and so loads them will inevitable execute whatever arbitrary code is in the non-safe model but at least the resulting SafeTensor will not contain any arbitrary code.

I found this nice article from December 2024 to show how to backdoor AI models - doing so is surprisingly simple and it really is kind of a miracle no bad acters seam to make use of it: https://snyk.io/articles/python-pickle-poisoning-and-backdooring-pth-files/

We are also one of the biggest targets for attacks, especially if something can be done with the generated files.

Well similar to SafeTensor GGUFs are secure unless they exploit a security vulnerability inside llama.cpp of which every few month another one gets responsible disclosed under https://github.com/ggml-org/llama.cpp/security

I'm mainly concerned about someone stealing our HuggingFace token to nuke our models or uses it to push some garbage. Hopefully HuggingFace has a repository delete rate limit. Maybe we also should rate limit the nukerepo/nukeall commands just in case. Having a malicious insider should also be a concern given how much access all our nodes currently have.

Pride goes before the fall.

True but I have rotating air gaped offline backups in worst case. Even if someone could somehow escape the LXC container they would be user 100000 on the host which can't do anything without a kernel exploit that gives them root. Given how often I update the kernel especially if there are news about any security vulnerability I seems quite unlikely someone who gains access to your container could escape it unless NVidia lets me down with their terrible driver security. Worst someone could do inside the LXC container is ruining/sabotaging our operations and bothering my ISP with malicious activity. I have quite tight firewall rules set and use a separate subnet for our cluster, our internet network and my home network and generally don't have insecure devices in my network so unlikely they could traverse to any other device from within the LXC container beside other LXC containers used for our operation which equal security constraints.

The wait queue never was as low since early November. That was almost half a year ago!

964 additional job(s) in wait queue (total estimated size 24.760TB, imatrix 7.501TB, lownice 0):

To celebrate I reconstructed the following historic wait queue size based on manual status page backups I made:

Nov 09 01:05: 1107
Nov 09 10:21: 1456
Nov 10 02:47: 1489
Nov 10 10:14: 1537
Nov 10 11:50: 1589
Nov 10 18:20: 1611
Nov 10 20:58: 1636
Nov 11 14:14: 1637
Nov 11 17:22: 1633
Nov 12 00:10: 1678
Nov 13 11:29: 1738
Nov 13 13:49: 1796
Nov 14 10:32: 2020
Nov 18 19:43: 2077
Nov 19 10:22: 2962
Nov 20 02:22: 3073
Nov 20 02:25: 3107
Nov 20 09:45: 3319
Nov 21 10:26: 3324
Nov 22 16:27: 3329
Nov 22 17:42: 3330
Nov 27 00:00: 3466
Nov 27 02:20: 3468
Nov 27 16:37: 3459
Nov 28 23:09: 3441
Nov 29 10:28: 3440
Dec 01 07:21: 3534
Dec 01 17:17: 3613
Dec 01 20:22: 3729
Dec 02 14:36: 3720
Dec 03 01:15: 3698
Dec 03 17:19: 3848
Dec 04 01:53: 3816
Dec 11 10:49: 3800
Dec 12 13:03: 3830
Dec 13 09:58: 3919
Dec 13 16:46: 3959
Dec 14 23:37: 3977
Dec 15 02:25: 4001
Dec 15 02:51: 4000
Dec 15 06:59: 4052
Dec 15 12:31: 4051
Dec 15 18:25: 4056
Dec 16 15:21: 3987
Dec 16 19:59: 3969
Dec 17 11:30: 3907
Dec 17 13:47: 3881
Dec 17 15:39: 3831
Dec 17 21:57: 3754
Dec 18 05:15: 3731
Dec 18 17:35: 3636
Dec 19 04:42: 3620
Dec 19 11:11: 3556
Dec 20 00:49: 3465
Dec 20 16:06: 3386
Dec 21 05:12: 3379
Dec 21 15:09: 3325
Dec 21 21:43: 3295
Dec 22 16:02: 3183
Dec 23 16:50: 2982
Dec 24 04:15: 2898
Dec 24 15:15: 2769
Dec 25 01:49: 2612
Dec 25 16:08: 2599
Dec 26 03:54: 2598
Jan 03 02:57: 1450
Jan 03 03:06: 1447
Jan 03 03:09: 1446
Jan 03 04:02: 1464
Jan 03 14:47: 1393
Jan 03 23:56: 1299
Jan 04 01:20: 1283
Jan 04 13:34: 1160
Jan 04 19:05: 1094
Jan 05 02:01: 1022
Jan 05 04:43: 973
Jan 20 01:54: 1205
Jan 21 11:33: 1192
Jan 24 12:49: 1097
Feb 02 00:52: 1061
Feb 04 09:48: 1074
Feb 16 02:43: 2145
Feb 18 02:18: 2377
Mar 12 20:32: 1799
Mar 13 04:15: 1779
Mar 13 19:24: 1512
Mar 13 19:58: 1501
Mar 13 20:14: 1496
Mar 13 21:55: 1492
Mar 14 09:24: 1472
Mar 14 17:16: 1380
Mar 14 17:52: 1367
Mar 15 00:22: 1370
Mar 15 14:28: 1242
Mar 15 17:28: 1206
Mar 15 19:54: 1204
Mar 18 17:52: 1132
Mar 18 20:06: 1125
Mar 19 09:16: 1068
Mar 19 14:32: 1007
Mar 19 14:53: 1000
Mar 19 15:11: 995
Mar 19 17:50: 967
Mar 19 19:21: 964

The maximum it ever reached according to my measurements was 4056 on 15th of December 2024! :D

What makes you think that PyTorch would no longer be vulnerable to arbitrary code execution? As long you unpickle you allow arbitrary code to run.

I thought I had read that transformers had switched to a restricted unpickle library, but... indeed, that seems not the case.

However, my point was a different one: I think it does not unpickle by default, so pickle isn't the problem for unsuspecting users (it is for us, since we enable it). The problem is that asking for untrusted code execution wrongly implies that it won't execute untrusted code when not told to do so. IT would be better to always execute untrusted code instead of giving a false sense of security.

Having a malicious insider should also be a concern given how much access all our nodes currently have.

Like, richard, me, you, marco and....? I mean, specifically the mradermacher nodes themselves, or the host, not other containers.

I am not very concerned about that, but maybe I don't value the repositories as highly, so vandalism is pretty low on my list of fears. But I wouldn't want to be part of a löarge scale attack on people downloading models :) Other than llama.cpp insecurities, which by necessity I don't care much about.

The wait queue never was as low since early November. That was almost half a year ago!

Time to queue more older models, I suspect. Once I find the time again.

To celebrate I reconstructed the following historic wait queue size based on manual status page backups I made:

I infrequently wished we had such historical backups :_)

Like, richard, me, you, marco and....? I mean, specifically the mradermacher nodes themselves, or the host, not other containers.

I got betrayed by too many friends when I hosted a Minecraft server back in high school so I take security against trusted insiders likely a bit too serious. The good thing here is that all of the current persons involved invested a ton of money and resources into this so nobody will do anything malicious for sure. This is mainly a concern should we ever plan on onboarding someone new as giving someone we barely know this level of access is a major risk.

I am not very concerned about that, but maybe I don't value the repositories as highly, so vandalism is pretty low on my list of fears.

We put so much effort, work and resources into them so I value them quite a lot.

But I wouldn't want to be part of a löarge scale attack on people downloading models :) Other than llama.cpp insecurities, which by necessity I don't care much about.

It is very unlikely someone could use GGUFs to distribute malware. The format is relatively secure.

I infrequently wished we had such historical backups :_)

I will upload mine soon. They are tiny. To make your own you could create a crontask that downloads the status page. I just Ctrl & S store the status page from times to times so I can better see how well things progress. I'm way too obsessed with the status page.

My list of fears:

  1. HuggingFace imposing a storage limit
  2. HuggingFace bugging us for using too much resources
  3. HuggingFace banning us for some really stupid reason like too many DMCA notices and not taking into account the number of models we have or some abuse report spam or other trash like this
  4. HuggingFace running out of money
  5. llama.cpp deciding to merge a change braking support for all quants ever created because their maintainers don't value support for legacy quants.
  6. GGUF getting replaced by a far superior format as it happened for GGML which then got replaced by GGUF
  7. Stupid regulation for USA messing with open AI models. Especially with the currently president behaving so unpredictable.
  8. EU being stupid as always and wanting to geoblock some "dangerous" AI models. I'm so glad Switzerland is not part of this organization.
  9. My ISP kicking me out due to using too much bandwidth.
    10 Someone or something sabotaging our operations
  10. HuggingFace Xet storage disrupting our operations. I think they already push 7% of traffic through Xet so we might already use it.
  11. HuggingFace doing stupid rate limits. Richard just got rate limited today:
    grafik.png

When we are at thing Richard doesn't like its him only having 2 tasks on rich1. He sent me this picture 2 house ago - luckily there are more now. Also he would like the ability to set the number of parallel tasks:

grafik.png

In any case let's just enjoy the moment. There never was a better time to enjoy all this amazing openly available AI models!

I'm enjoying locally running AI models so much I ordered 2x Intel Arc 770 last weekend. They perform better for LLMs than they should and you get 4x better performance/value than for NVidia according to specification and even better performance/value in the unlikely case those unrealistic benchmarks would be true: https://www.tweaktown.com/news/97705/metas-next-gen-llama3-llm-is-here-and-the-intel-arc-a770-outperforms-geforce-rtx-4060/index.html and https://www.plainconcepts.com/maximizing-ai-performance-intel-arc-a770-gpu/

97705_03_ai-llm-performance-on-intel-arc-a770-16gb-gpu-outperforms-geforce-rtx-4060-8gb-using-cuda_full.jpg

perfomance-comparison-chart.png

This is mainly a concern should we ever plan on onboarding someone new as giving someone we barely know this level of access is a major risk.

Agreed. Maybe we can move repo creation more centrally (dryrun was a major step towards that btw.) and maybe have finer-grained tokens for, say, only uploads. At some point.

The format is relatively secure.

The format is completely secure, I think. But I don't trust the gguf parsers one bit.

list of fears

yeah, these are all definitely on the realistic part of the scale. In fact, I am susprised this got as far as it got, and I expect total breakdown daily. Enshittification is a thing, and it already happens with hf as well, although at a surprisingly low level so far.

HuggingFace Xet storage disrupting our operations.

my immediate short term concern, yes :)

Richard just got rate limited today:

Holy shit, what did they call the 5MB/s bandwidth before? unlimited? :-)

When we are at thing Richard doesn't like its him only having 2 tasks on rich1.

Well, nice level 1400 is very far down the list, so the scheduler does reserve resources for higher priority things. Some tuning might always be required (the logic is pure mess, always changing :)

But the real problem with richard will be once we are through the low pri models, which will be soon. Richard and me will have to find a new mode of operations.

Also he would like the ability to set the number of parallel tasks:

As in less than two, or more than two? I suspect once we are through the models, it would make most sense to limit it to two, so he always has guaranteed resources available for himself, since we likely won't need him full time anymore (likewise nico2). rich1 is the fastest box we have that is always available (marco is hampered by disk speed mostly. he was thinking of buying an nvme for just this).

I'm enjoying locally running AI models so much I ordered 2x Intel Arc 770 last weekend.

Yeah, I wondered about intel arc, too, in the beginning, and then theyx cancelled their promised 64GB model and generally fucked up their drivers again, so I was dissuaded. But things seem to be improving, that is good. If anybody needs more competition, it's nvidia, and if anybody needs better, more competitive products, it's intel at the moment. We saw how long it took AMD to mirror the shitty nvidia prise hikes (i.e. instant), and I don't doubt intels death will immediatelly cause amd to become the new evil intel. Not to speaks of the shady practise of artificially reducing PCIe lines to sell more server hardware (which amd also copied from intel). Enshittification everywhere.

Although, I must admit, I was dreaming about intels death ever since I had a dual opteron.

In any case let's just enjoy the moment. There never was a better time to enjoy all this amazing openly available AI models!

Yes, very depressing. Oh, you meant this to be uplifting??

containing all relevant files for a GPTNeoXTokenizerFast tokenizer.

do you know a way of doing something about this? happens with https://huggingface.co/zaas12/pythia-1.4-orpo for example. if it is as simple as installing it, there might be a whole bunch of models with this or similar issues (missing python packages)

btw., so amusing: https://emygervais.github.io/2025/03/15/bytecraft.html (saw the model this morning)

sigh. the amusing bytecraft model caused 100% endless loop on rich, blocking everything.

CPU based prompt processing will soon be so fast of Q4_K on AVX2 compatible CPUs: https://github.com/ggml-org/llama.cpp/pull/12332

Well, nice level 1400 is very far down the list, so the scheduler does reserve resources for higher priority things. Some tuning might always be required (the logic is pure mess, always changing :)

It would be nice if there are always some model waiting around idle ready to take over. Especially for Richard who cares way too much about his server getting fully utilized all the time.

But the real problem with richard will be once we are through the low pri models, which will be soon. Richard and me will have to find a new mode of operations.

You will have to keep rich1 busy or he will start using it for his own quants of that list with trash models you gave him. By the way he is doing an awesome job abusing Google Collab and soon his home internet for Native/AWQ/MLX quants.

As in less than two, or more than two? I suspect once we are through the models, it would make most sense to limit it to two, so he always has guaranteed resources available for himself, since we likely won't need him full time anymore (likewise nico2). rich1 is the fastest box we have that is always available (marco is hampered by disk speed mostly. he was thinking of buying an nvme for just this).

What do you expect from Richard. Obviously he wants to run 3. Here a quote from him:

why rich1 so smol ?
I paid today, I want full load 💀
we need to find something to process on rich1
a third queue or something just to run to hit this 100% cpu
I pay for whole server, I use whole server
can I have a button to switch 2 models concurrently to 3/4 models concurrently?

Usually he would run his own quants on rich1 as well to make sure it is maxed out but HuggingFace repository create rate limited him today so he cannot really max it out.

Yeah, I wondered about intel arc, too, in the beginning, and then theyx cancelled their promised 64GB model and generally fucked up their drivers again, so I was dissuaded. But things seem to be improving, that is good. If anybody needs more competition, it's nvidia, and if anybody needs better, more competitive products, it's intel at the moment. We saw how long it took AMD to mirror the shitty nvidia prise hikes (i.e. instant), and I don't doubt intels death will immediatelly cause amd to become the new evil intel. Not to speaks of the shady practise of artificially reducing PCIe lines to sell more server hardware (which amd also copied from intel). Enshittification everywhere.

Luckily StormPeak has 128 PCIe 5.0 lanes and 16 PCIe 4 lanes. AMD is quite generous with latest gen Threadripper Pro as even the cheapest 12 core model for $1400 comes with the full 128 PCIe lanes offering an absolutely awesome price/pcie lane ratio.

All manufacturers latest gen GPUs is shit. Nvidia 50-series got backported to TSMC 5nm and is really inefficient and basically just a larger 40-series for an insane price. AMD costs way too much for just 16 GB of memory and ROCm is the biggest pain ever to use for AI and basically anything else. Intel Arc latest gen is decent but only 192 bit bus worse than their previous generation with only 12 GB of memory and so less bandwidth and much less AI cores but there is hope for an awesome 24 GB clamshell model later this year.

Intel Arc 770 is truly awesome. This is not the latest generation they released this year but what they released 2.5 years ago. They now offer 16 GB of GDDR6 at a 256-bit bus with 560 GB/s bandwidth and 512 AI cores for $280 while NVidia offers a 4060 Ti 16 GB with 128-bit bus using clamshell for $600. For the price of an RTX 5090 I could buy over 8x Intel ARC 770 which would combined be 128 GB GDDR6 at 4480 GB/s and 2048-bit bus totaling 2048 AI cores. Really the price/performance you currently get with last gen Intel Arc 770 is insane. It also worth considering that despite its age it is using TSMC 5 nm like the NVidie 40-series of GPUs. And now so many ears after the initial Intel arc launch, they finally get the software side of things perfect. PyTorch, llama.cpp, axolotl, vLLM all work without any issues on intel arc booth on Windows and Linux. I just hope it doesn't have the audio stuttering or PCIe device reset issues I'm currently experiencing on RTX 4090 or the random Linux kernel crashes I'm experiencing using Sparkle Intel ARC A310 4GB GPUs in my job. I will for sure let you know how it goes once they arrive. They will probably booth go into StormPeak so it then has 2x RTX 4090 + 2x Intel Arc 770 so we can keep the RTX 3080 in CastlePeak and RTX 2070s in Threadripper for the RPC setup.

Although, I must admit, I was dreaming about intels death ever since I had a dual opteron.

Regarding CPUs Intel is dead for me since they removed AVX-512. I'm just not buying any CPU without AVX-512. Doing so would be stupid. I want my fast memset. Jokes aside there are applications like llama.cpp on CPU and AV1 encoding where AVX-512 makes quite massive difference. But I'm generally not that happy with AMD64. I really wish we could soon move on and use RISC-V based CPUs for PCs. I'm already using RISC-V based CPUs for use-cases where security matters. I also really miss transactional memory which Intel promised many times and then messed up and now just abandoned it. With the latest security vulnerability AMD CPUs got a whole lot more interesting anyways. You can now jailbreak them and write your own microcode: https://bughunters.google.com/blog/5424842357473280/zen-and-the-art-of-microcode-hacking

Yes, very depressing. Oh, you meant this to be uplifting??

It was mean uplifting but I guess it can also be seen quite depressing depending if you value the now or the future. Things will likely never be as good as they are now at least for me as I basically reached the peak of joy and happiness, I have an awesome job I truly enjoy and to which I’m looking forward to every day and during my spare time I can have fun with all this awesome openly available AI models. There just is no way thing stay anywhere close to as good they currently are. I recommend to just enjoy the moment as long it lasts. I made sure to back up all the base models and models I like the most just in case.

do you know a way of doing something about this?

The entire error is:
Can't load tokenizer for '/bpool/pythia-1.4-orpo'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/bpool/pythia-1.4-orpo' is the correct path to a directory containing all relevant files for a GPTNeoXTokenizerFast tokenizer.

https://huggingface.co/zaas12/pythia-1.4-orpo/tree/main does not contain a tokenizer.json or tokenizer.model so the model simply has no tokenizer. To fix this error it just copy the GPTNeoXTokenizerFast tokenizer from a different model into the folder containing the downloaded model.

For this specific model you know based on the "_name_or_path" inside the config.json that it was trained based of "EleutherAI/pythia-1.4b"
So you could download:

After which the model successfully converts into a GGUF. I stored it the resulting GGUF under /tmp/quant/pythia-1.4-orpo.gguf.

missing python packages

I noticed a ton of models with missing python packages and I was wondering why we keep nuking them instead of installing the proper dependencies. It seams quite stupid we don't support models where the HF to GGUF conversion depends on a specific python packet. I guess now that we maintain our own llama.cpp fork we could add them all to the requirements.txt

btw., so amusing: https://emygervais.github.io/2025/03/15/bytecraft.html (saw the model this morning)

Who would have guessed: "Working in the byte world is extremely challenging because a single wrong byte can break the whole functioning of the file."
This idea is so insane. You basically teach an LLM to write bytes instead of ASM. It might work but damn is it an insane idea. They should have at least used a tokenizer that made somewhat sense for this use case or even better just train a model from scratch because this is so far different from the common use-case of an LLM that starting fresh would likely be justified. What's next? An AI that creates a ROP chain so I can run a game inside a banking application by abusing a buffer overflow?

Not to speaks of the shady practise of artificially reducing PCIe lines to sell more server hardware (which amd also copied from intel). Enshittification everywhere.

Intel has done plenty of shady shit, but this situation is far more nuanced, as PCIe lanes take up valuable die space, and PCIe switches becoming too expensive for consumer usage, etc.

"PLX was acquired by Avago in 2014 in a deal that valued the company at $300m, and seemingly overnight the cost of these switches increased three-fold according to my sources at the time, making them unpalatable for consumer use." source: https://www.anandtech.com/show/15821/microchips-new-pcie-40-pcie-switches-100-lanes-174-gbps
second source that mentions this increase: https://www.servethehome.com/business-side-plx-acquisition-impediment-nvme-everywhere/

My go to example for intel screwing consumers over is ECC memory, there was no technical limitation on that at all, and consumer platforms having less stability because of it is a legacy we still deal with to this day.

Intel has done plenty of shady shit, but this situation is far more nuanced, as PCIe lanes take up valuable die space

That is why AMD puts a dedicated IO die in their CPUs. But even on a monolithic design PCIe lanes and memory lanes are always worth the die space they use. More PCIe lanes means more GPUs and more SSDs and more memory lanes means faster CPU inference performance. I'm so happy StormPeak has octa-channel memory.

PCIe switches becoming too expensive for consumer usage

I never really saw the appeal of PCIe switches. What is the advantage of using a PCIe switch compared to just using PCIe bifurcation to split PCIe bandwidth between multiple devices. When I want to plug 4 GPUs into one PCIe x16 slot I'm just using a x4x4x4x4 bifurcation card for $30 and I have something relabel that equally distributes the bandwidth between all the GPUs. But cheap PCIe redrivers would be super useful. My mainboard luckily has some integrated but having them after the PCIe riser cable would likely make way more sense.

My go to example for intel screwing consumers over is ECC memory, there was no technical limitation on that at all, and consumer platforms having less stability because of it is a legacy we still deal with to this day.

EEC memory is huge. Having ECC memory is an absolute must. I need to be able to trust my PC. Without ECC memory I would have to basically do all computations twice and compare results which would be insanely wasteful. This is by the way exactly what I did before I had ECC memory. All my PCs since 2017 have ECC memory. Threadripper has 128 GB ECC UDIMM DDR4 memory, CastlePeak has 256 GB ECC UDIMM DDR4 memory and StormPeak has 512 GB ECC RDIMM DDR5 memory. For anyone telling me that ECC errors are unlikely: No they are not. My ECC RAM puts a kernel log entry every time one happens and they indeed do happen and ECC always manages to correct them. Same thing as bit rot on TLC SSD is something that hapopens.

That is why AMD puts a dedicated IO die in their CPUs.

Yes, but that is a more recent thing compared to losing out on cheap PCIe switches.

But even on a monolithic design PCIe lanes and memory lanes are always worth the die space they use.

Again it really isn't that simple, it has to beachfront silicon, as it is I/O and if you want more of that you have to use a bigger chip. It's not like you can just shrink the cores or use less cores as that won't give you anymore beachfront silicon.

I never really saw the appeal of PCIe switches.

They are incredibly useful, and do far more than what bifurcation can. Even with the dedicated I/O die on zen CPUs the chipsets still are PCIe switches (with extra functionality), look up the block diagrams of the X570 and B650 chipsets and you'll see they are PCIe switches (although again they do offer a bit more than a standard PCIe switch, but the switching functionality is still core and important).

I agree with you on ECC, even if I haven't been fortunate enough to use exclusively ECC computers, my desktop is still not ECC but my NAS and my servers are.

You will have to keep rich1 busy or he will start using it for his own quants of that list with trash models you gave him.

I don't know what that means. Will he take my ability away to use it? That sucks. Will he use the idle time for other purposes? Sounds great to me - why keep it artificially busy. In any case, if for some reason things get too bad (there is no indication of that at the moment :) I'd rather not have rich1 then.

I can add a few more models to the queue, though.

What do you expect from Richard. Obviously he wants to run 3. Here a quote from him:

We've been there before. There is no way to help technically illiterate people. We can run three models once his server has the disk and memory to do so. Right now, it clearly doesn't have the memory, nor the disk, nor the network, for more.

My ECC RAM puts a kernel log entry every time one happens and they indeed do happen and ECC always manages to correct them.

Sorry, but ECC errors and bit errors are exceedingly rare. I've had dozens of busy servers over the decades, and the only case where I had ECC errors was a CPU errata.

So, yeah, they do happen, but many other faults are more likely. It is certainly nice to have this fuzzy feeling of extra security, though, but the performance drain is objectively not worth it for most applications.

Same thing as bit rot on TLC SSD is something that hapopens.

I thought so, too, but all my data happens to be checksummed, and even for ssds that have been stored for years, I've never had a bit error (admittedly, I only have crucial ssds that are that old). But I might simply have been lucky.

When I want to plug 4 GPUs into one PCIe x16 slot I'm just using a x4x4x4x4 bifurcation

and enjoy 25% speed of prompt processing and many other tasks?

Intel has done plenty of shady shit, but this situation is far more nuanced

I disagree with the "far". Intel has reduced the number of pcie lines in desktop cpus over the years. So, yeah, some nuance, but intel did this for segmentation reasons. change my mind, but that is what intel does for many many years now.

Actually there must be a bug with the queue on rich1.

Actually, there isn't. There simply isn't enough budget to add more big jobs. The only way out is to reduce the size of the jobs it can accept, greatly reducing it's usefulness.

Anyway, the queue will become empty at some point, and other than idiotically wasting cpu cycles, there is no way we can avoid becoming idle at some point.

I've reduced max model size for rich to 100B.

Sorry, but ECC errors and bit errors are exceedingly rare. I've had dozens of busy servers over the decades, and the only case where I had ECC errors was a CPU errata.

So, yeah, they do happen, but many other faults are more likely. It is certainly nice to have this fuzzy feeling of extra security, though, but the performance drain is objectively not worth it for most applications.

I don't have great information on the prevalence of errors so I'm not going to argue one way or the other, but ECC offers far more than just a "fuzzy feeling of extra security". Error detection and monitoring is a huge benefit, such as helping you find and deal with faulty hardware before it causes actual problems, and telling you if that is or is not the cause of the issue you are experiencing.

I disagree with the "far". Intel has reduced the number of pcie lines in desktop cpus over the years. So, yeah, some nuance, but intel did this for segmentation reasons. change my mind, but that is what intel does for many many years now.

Huh? I agree with you that it is Intel intentionally doing this for market segmentation, but my point was that for consumers motherboard PCIe lanes went down far more because of the lack of cheap PCIe switches, as most lanes before were given by PCIe switches, and that is still a thing with modern chipsets stepping in still to offer more lanes than the CPU provides. Memory channels and PCIe lanes are far more costly than cores and market segmentation based on that isn't entirely unreasonable. The scummy shit is them doing stuff like this "When Intel introduced Haswell-E, it experimented with a new type of product separation: it also varied the number of CPU-host PCIe lanes among the SKUs. This practice continues in Broadwell-E, in an almost identical fashion. The lowest end CPU has 28 PCIe 3.0 lanes, capable of three-way GPU setups (and no 2x16 setups), while the other processors have a full 40 PCIe 3.0 lanes" source https://www.anandtech.com/show/10337/the-intel-broadwell-e-review-core-i7-6950x-6900k-6850k-and-6800k-tested-up-to-10-cores

If you wanted more memory channels or PCIe lanes you went onto the HEDT platforms, they also didn't really have do HEDT for a while, but again that is far after when we are talking about with PCIe lanes going down, and a whole different story.

I'm not trying to change your mind as I'm not really even sure where we disagree. I just think ECC is a far simpler and clearer example of Intel being scummy and locking consumers out of things, and even the mainstream platform being kept at quad cores well for far longer than it should have been, and the way they segmented hyper threading both have less nuance than the PCIe/memory class segmentation between HEDT and consumer.

such as helping you find and deal with faulty hardware before it causes actual problem

Helping is a relative term. There are many places where you can get bit errors, such as inside your CPU. I've had more cases of faulty cpus and cpu memory controllers than I ever had memory errors.

Point being, ECC is a relatively minor thing. When I hear that without ECC memory, one does all the calculations twice, this is just cargo cult.

Also, it's really fun to pull nicos legs sometimes.

I'm not trying to change your mind as I'm not really even sure where we disagree.

I don't think we are in any significant disagreement :)

Again it really isn't that simple, it has to beachfront silicon, as it is I/O and if you want more of that you have to use a bigger chip. It's not like you can just shrink the cores or use less cores as that won't give you anymore beachfront silicon.

Ah that explains why the I/O die on StormPeak is so physically massive compared to the dies that contain the actual cores. I always wrongly assumed this is the case because they use cheaper older wafers for I/O dies.

They are incredibly useful, and do far more than what bifurcation can. Even with the dedicated I/O die on zen CPUs the chipsets still are PCIe switches (with extra functionality), look up the block diagrams of the X570 and B650 chipsets and you'll see they are PCIe switches (although again they do offer a bit more than a standard PCIe switch, but the switching functionality is still core and important).

You are right. On my WRX90E-SAGE SE mainboard the chipset also acts as PCIe switch. It servs the two SlimSAS ports each running at PCIe 4.0 x4 and servers the 4 SATA ports.

amd-wrx90.png

I agree with you on ECC, even if I haven't been fortunate enough to use exclusively ECC computers, my desktop is still not ECC but my NAS and my servers are.

ECC is awesome! I love every bit of it.

I don't know what that means. Will he take my ability away to use it? That sucks. Will he use the idle time for other purposes? Sounds great to me - why keep it artificially busy. In any case, if for some reason things get too bad (there is no indication of that at the moment :) I'd rather not have rich1 then.

Sorry for being unclear. No he will obviously not take it away rich1 when we don’t use it but will make use of any idle resources to do models for his own account. He even does this now during the short downtime we sometimes have due to repo creation rate limit.

I can add a few more models to the queue, though.

I personally would be happy to see the queue empty up and be like it was before November than keep thing as crazy as they currently are. But if you find great model, please queue them but we don’t need to queue garbage just to satisfy Richard. He can do his own quants if he is unsatisfied with rich1 utilization which actually already does every time we don’t max out his CPU.

We've been there before. There is no way to help technically illiterate people. We can run three models once his server has the disk and memory to do so. Right now, it clearly doesn't have the memory, nor the disk, nor the network, for more.

Yes exactly. I fully agree. He figured out the hard way today after insisting on me telling him how to increase the default OpenWrt 65536 active connection limit. He increased it to 655390 just to figure out that the limit existed for a reason and above 80K concurrent connections things started to get unstable and he lost connection to the entire server. Sometimes he just has to see things for himself to learn. He is still young and so has to get his experience from somewhere. It’s quite funny how he keeps complaining why everything keeps breaking for him without ever wondering if he might be the reason why. There is a 150-Watt peak power limit for the GPU in his laptop. He thought it is a great idea to remove that and run it 24/7 at 200-Watt normal power. Let's see how long that lasts. He just does all the stupid things a teenagers would do but using computer hardware.

Sorry, but ECC errors and bit errors are exceedingly rare. I've had dozens of busy servers over the decades, and the only case where I had ECC errors was a CPU errata.

Let’s check logs on Treadripper and see who is right. I investigated a ton of ECC errors in late 2024 so I for sure wouldn’t consider them “exceedingly rare”. The are surprisingly common for me.

So, yeah, they do happen, but many other faults are more likely. It is certainly nice to have this fuzzy feeling of extra security, though, but the performance drain is objectively not worth it for most applications.

It really depends on the type of computations you run. For me correctness is way more important than anything else. Especially back when I did scientific computing. I guess now that I mostly do AI. This is also why I have not enabled ECC on the GPUs you use for imatrix computation as doing so would lead to an over 10% performance decrease for very little benefit for our use case. ECC doesn’t matter for my current use cases as much as it did in the past but it still is an important nice to have and worth the hardware and performance cost for sure.

I thought so, too, but all my data happens to be checksummed, and even for ssds that have been stored for years, I've never had a bit error (admittedly, I only have crucial ssds that are that old). But I might simply have been lucky.

It seems to really depend on the SSD controller, storage chip and ECC algorithm. I think so far it was only late PCIe 3 and early PCIe 4 SSD from Samsung and Kingston from which I experienced bit rot issues. The SSDs you currently use for nico1 are notorious for uncorrectable bit rot so if you ever store a large file and don't read it for half a year the host wouldn’t run monthly scrubs it would have a quite high likelihood of being corrupted after half a year. This is one of the main reasons I gave you those specific SSDs. Their bit rot was a massive pain for me and I wasted dozens of hours on it. Corrupted rotten blocks kept breaking my backups as every time one got encountered the backup got canceled resulting in me having to spent hours searching whatever file contains the faulty block, restoring it from backup and trimming all empty space and hope the issue is gone. I had Windows Server on those SSDs so no fancy file system like ZFS or BTRFS to tell me which files are broken so I just had to dd the entire thing, see where it fails and then figure out what file is on this position.

and enjoy 25% speed of prompt processing and many other tasks?

I really should reduce the 4090 GPUs to x4 and see if there is a performance difference for imatrix computation. You are the opinion that all that matters for imatrix performance is PCIe bandwidth but I'm almost certain imatrix is RAM bottlenecked as this is what is getting hot while it is running. Last weekend I even installed a piece of plastic inside StormPeak to direct airflow towards RAM as before every time we did imatrix computation everyone in the entire house heard the fans and joked that I have a hay drier in the basement and it was actually so loud that sitting next to it for a long period of time made my ears hurt. Since I did that modification imatrix computation is almost quiet.

I disagree with the "far". Intel has reduced the number of pcie lines in desktop cpus over the years. So, yeah, some nuance, but intel did this for segmentation reasons. change my mind, but that is what intel does for many many years now.

And this is why I don't buy Intel or normal AMD CPUs. I mainly care about PCIe lanes and memory lanes when buying a CPU as this is what is end up bottlenecking me. I really hope AMD keeps Threadripper around because EPYC server mainboards suck.

Actually there must be a bug with the queue on rich1.
Actually, there isn't. There simply isn't enough budget to add more big jobs. The only way out is to reduce the size of the jobs it can accept, greatly reducing it's usefulness.

Ah yes that explains why only 2 got put there. Currently models somehow got massive because we reached the big model part of the priority 1400 models. I still don't get why we sort models by priority. Doing so seems a bit dumb because then once we are done with them we have to rush manually adding more big ones to not get HuggingFace repo creation limited.

Anyway, the queue will become empty at some point, and other than idiotically wasting cpu cycles, there is no way we can avoid becoming idle at some point.

Which is a good thing as then Richard can use it for his own purposes while we are not using it.

I've reduced max model size for rich to 100B.

Great. That should ensure that always some idle models are on rich1.

I don't have great information on the prevalence of errors so I'm not going to argue one way or the other

That's why I will now actually check my kernel logs on Threadripper because I should have the data there as for some reason journalctl keep all the kernel logs since 31th May there.

ECC offers far more than just a "fuzzy feeling of extra security". Error detection and monitoring is a huge benefit, such as helping you find and deal with faulty hardware before it causes actual problems, and telling you if that is or is not the cause of the issue you are experiencing.

I absolutely awesome. It already helped me detecting many issues. Manly while building a new PC.

Memory channels and PCIe lanes are far more costly than cores and market segmentation based on that isn't entirely unreasonable.

I never realized that they would be so expensive given that even the cheapest latest gen Threadriper Pro has 128 PCIe 5.0 lanes and 8 memory lanes despite only having 12 cores.

Helping is a relative term. There are many places where you can get bit errors, such as inside your CPU. I've had more cases of faulty cpus and cpu memory controllers than I ever had memory errors.

ECC doesn't just correct memory errors that happen due to faulty bits but also errors that happen during the transfer from memory to CPU. Unless the CPU really has some bug, data integrity should be guaranteed. DDR5 has some in-memory ECC but that is not checking the transfer so even for DDR5 I’m using proper ECC memory. But honestly if you don’t use your PC for anything important DDR5 without ECC will likely be fine as the internal ECC in all DDR5 memory is quite decent at preventing random memory errors due to things like cosmic rays.

Point being, ECC is a relatively minor thing. When I hear that without ECC memory, one does all the calculations twice, this is just cargo cult.

I actually did so back in university for all scientific calculations because I couldn't risk them to be wrong.

I don't think we are in any significant disagreement :)

We are not.

Also, it's really fun to pull nicos legs sometimes.

Or more like make me spent 2 hours reading, research and replaying to the massive wall of text you all wrote today. Joking aside it actually was a very interesting discussing and this is the first time I closely looked at the ECC error log as a hole instead of investigating specific ECC events.

This is how a typical ECC event looks like:

Aug 30 19:47:21 Threadripper kernel: mce: [Hardware Error]: Machine check events logged
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: Corrected error, no action required.
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: Error Addr: 0x000000019a9b9040
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x0000fd010a400302
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
Aug 30 19:47:21 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Dec 24 02:54:42 Threadripper kernel: mce: [Hardware Error]: Machine check events logged
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: Corrected error, no action required.
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: Error Addr: 0x000000019a9b9040
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x0000fd010a400302
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
Dec 24 02:54:42 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

And here all ECC events that happened on Threadripper with 128 GB DDR4 ECC memory since 31th of Mai

Aug 30 19:47:21 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01
Nov 28 14:45:22 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Nov 28 21:24:02 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Nov 29 05:30:06 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Nov 30 07:42:58 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 01 04:28:09 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 03 02:09:44 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 04 02:55:13 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 04 03:00:41 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 04 14:56:07 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 05 14:30:36 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 02:04:11 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 03:53:25 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 03:58:52 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 04:04:20 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 04:09:48 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 08:04:38 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 10:33:53 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 12:01:16 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 12:06:44 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 12:12:11 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 20:12:47 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 17:56:29 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 17:58:03 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:03:30 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:08:58 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:14:26 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:19:53 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:25:21 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:30:49 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:36:16 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 20:41:53 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:21:00 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:26:28 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:31:56 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:37:24 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:42:51 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:48:19 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:53:47 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:59:14 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 10:04:42 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 10:10:10 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 10:15:37 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 10:21:05 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 12:37:37 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:21:18 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:26:46 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:32:14 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:37:41 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:43:09 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:48:37 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:54:04 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:59:32 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:05:00 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:10:28 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:15:55 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:21:23 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:26:51 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:32:18 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 18:27:08 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 10 21:40:05 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 11 03:02:18 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 11 18:08:53 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 24 02:54:42 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)

This is a total of 64 corrected ECC errors in less than 10 months. I wouldn't consider this rare. It's also quite surprising that the issue seams to always happen at the same address so maybe there is some sort of hardware defect that makes this specific address way more likely to have issues.

Your results suggest that there is something seriously wrong with your hardware. I wouldn't trust it if it generates this massive amount of bit errors. insert funny theory about the swiss alps being nearer to space

The only times I ever got ECC errors in the last 20 years (I don't think IO had ecc detection hardware before the 2000s) was a hardware errata in an intel cpu (to be ignored), and actually faulty ram sticks. I am seriously distrusting your hardware now. I mean, the ram -> cpu path is not the only thing that can go wrong, and the failures you have are massive. Once every 5 years would be more acceptable, IMnsHO.

Since it seems to be always the same address (if I read that correctly, which I probably don't), this also indicates that your ram is indeed faulty. So, yeah, ECC found it, but so would a burn in with a memory checker.

Ah that explains why the I/O die on StormPeak is so physically massive compared to the dies that contain the actual cores. I always wrongly assumed this is the case because they use cheaper older wafers for I/O dies.

I hate to sound repetitive but there is more nuance again. They do use older nodes for the I/O die, and that does result in them being larger, but also not by that much because I/O is one of those things that does not scale well with process nodes, and that adds to the problem we were talking about before of it taking up valuable die area, as process node shrinks help reduce the die area of cores far more than I/O.

You are right. On my WRX90E-SAGE SE mainboard the chipset also acts as PCIe switch. It servs the two SlimSAS ports each running at PCIe 4.0 x4 and servers the 4 SATA ports.

Yep, that is typical. But also it's not just ubiquity like I said before PCIe switches are capable of things that bifurcation can't.

ECC is awesome! I love every bit of it.

I do like it, memory instability sucks as someone who has dealt with it in recent times (solved by RMA'ing the RAM).

Yes exactly. I fully agree. He figured out the hard way today after insisting on me telling him how to increase the default OpenWrt 65536 active connection limit. He increased it to 655390 just to figure out that the limit existed for a reason and above 80K concurrent connections things started to get unstable and he lost connection to the entire server. Sometimes he just has to see things for himself to learn. He is still young and so has to get his experience from somewhere. It’s quite funny how he keeps complaining why everything keeps breaking for him without ever wondering if he might be the reason why. There is a 150-Watt peak power limit for the GPU in his laptop. He thought it is a great idea to remove that and run it 24/7 at 200-Watt normal power. Let's see how long that lasts. He just does all the stupid things a teenagers would do but using computer hardware.

His experimenting sounds fun (might have something to do with the fact that I'm not at all impacted by it). You can learn by doing, but not all mistakes you find out what you did wrong. I still don't know why I couldn't get jumbo frames working with a point to point link (so very few things involved and all of them should support it), a few years ago.

It seems to really depend on the SSD controller, storage chip and ECC algorithm. I think so far it was only late PCIe 3 and early PCIe 4 SSD from Samsung and Kingston from which I experienced bit rot issues. The SSDs you currently use for nico1 are notorious for uncorrectable bit rot so if you ever store a large file and don't read it for half a year the host wouldn’t run monthly scrubs it would have a quite high likelihood of being corrupted after half a year. This is one of the main reasons I gave you those specific SSDs. Their bit rot was a massive pain for me and I wasted dozens of hours on it. Corrupted rotten blocks kept breaking my backups as every time one got encountered the backup got canceled resulting in me having to spent hours searching whatever file contains the faulty block, restoring it from backup and trimming all empty space and hope the issue is gone. I had Windows Server on those SSDs so no fancy file system like ZFS or BTRFS to tell me which files are broken so I just had to dd the entire thing, see where it fails and then figure out what file is on this position.

Thank you for this story (and I would love to know more if you don't mind). I'm very picky about buying SSDs (when I can afford to be), quality like you saw varies but what bothers me a lot is it is easier to count the number of companies that don't do the scummy thing of changing internal components without changing the sku, as it is so common place, which means actually evaluating quality MUCH harder.

I don't have great information on the prevalence of errors so I'm not going to argue one way or the other

That's why I will now actually check my kernel logs on Threadripper because I should have the data there as for some reason journalctl keep all the kernel logs since 31th May there.

[...]

Sorry, but ECC errors and bit errors are exceedingly rare. I've had dozens of busy servers over the decades, and the only case where I had ECC errors was a CPU errata.

Let’s check logs on Treadripper and see who is right. I investigated a ton of ECC errors in late 2024 so I for sure wouldn’t consider them “exceedingly rare”. The are surprisingly common for me.

I've seen this conversation happen so many times which is why I bowed out early, but like always it will be fun for me to hear it happen again.

I absolutely awesome.

??? Lol.

It already helped me detecting many issues. Manly while building a new PC.

If you build a new PC you should thoroughly do testing which includes memory testing, that would find those issues regardless of ECC (also on that note the state of memory checkers that handle ECC intelligently is literally one from what I found, and it is sadly paywalled for the useful version, as I recently found out when dealing with a server and testing it).

I never realized that they would be so expensive given that even the cheapest latest gen Threadriper Pro has 128 PCIe 5.0 lanes and 8 memory lanes despite only having 12 cores.

The MSRP of the AMD Threadripper PRO 7945WX is $1399, that is well outside of consumer CPU pricing, and requires an also much more expensive than consumer platform motherboard and RAM (especially if you want to make use of the octal channel). I'm not making a value judgement here, but it is objectively in a different price segment than consumer stuff, as most consumers wouldn't even spend half of that CPU price for the entire system.

ECC doesn't just correct memory errors that happen due to faulty bits but also errors that happen during the transfer from memory to CPU. Unless the CPU really has some bug, data integrity should be guaranteed. DDR5 has some in-memory ECC but that is not checking the transfer so even for DDR5 I’m using proper ECC memory. But honestly if you don’t use your PC for anything important DDR5 without ECC will likely be fine as the internal ECC in all DDR5 memory is quite decent at preventing random memory errors due to things like cosmic rays.

You are correct about the difference in ECC but your last sentence is very odd to me. If I'm not using a PC for anything important anything is fine, but even still I would trust a DDR4 system over a DDR5 one as the in memory ECC is due to the extremely high data transfer rates inherent to the standard, and memory controllers are generally less mature, and DDR5 is still more inherently challenging to run.

Even the PCIe standard had to add error correction (but they also switched to PAM4 while DDR5 is using NRZ like previous PCIe revisions):

"because of the additional signal states a PAM4 signal itself is more fragile than a NRZ signal. And this means that along with PAM4, for the first time in PCIe’s history the standard is also getting Forward Error Correction (FEC). Living up to its name, Forward Error Correction is a means of correcting signal errors in a link by supplying a constant stream of error correction data, and it’s already commonly used in situations where data integrity is critical and there’s no time for a retransmission (such as DisplayPort 1.4 w/DSC). While FEC hasn’t been necessary for PCIe until now, PAM4’s fragility is going to change that. The inclusion of FEC shouldn’t make a noticeable difference to end-users, but for the PCI-SIG it’s another design requirement to contend with. In particular, the group needs to make sure that their FEC implementation is low-latency while still being appropriately robust, as PCIe users won’t want a significant increase in PCIe’s latency.

The upshot of the switch to PAM4 then is that by increasing the amount of data transmitted without increasing the frequency, the signal loss requirements won’t go up. PCIe 6.0 will have the same 36dB loss as PCIe 5.0, meaning that while trace lengths aren’t officially defined by the standard, a PCIe 6.0 link should be able to reach just as far as a PCIe 5.0 link. Which, coming from PCIe 5.0, is no doubt a relief to vendors and engineers alike." Source: https://www.anandtech.com/show/14559/pci-express-bandwidth-to-be-doubled-again-pcie-60-announced-spec-to-land-in-2021

Joking aside it actually was a very interesting discussing

Same for me.

I've avoided samsung for other reasons (ignoring fua), but hearing that is a bit shocking. I am keeping checksums of most of my files for decades now, so even before filesystems had data checksums (well, just btrfs out of the linux ones, I think), I knew bitrot was a thing. I haven't caught a SSD doing that, but I have caught ext3 and xfs bugs that way in the past, and of course lots of hardware issues.

In any case, I can hardly believe that samsungs would actually bitrot just after a few months, when the disk is actually on (even if off its hard to believe). Surely, this would be well known if that was really the case in general, rather than some faulty specimens? I mean, I believe you, but, sheesh, it can't be, can it?

In any case, I hope you run a monthly scrub on my disks then? :)

Sorry to be a bother but will you be doing this model https://huggingface.co/deepseek-ai/DeepSeek-V3-0324 (the new deepseek checkpoint), and if so when do you think you'd have the imatrix done?

Edit: Sorry, I see it's already in progress here

Sorry to be a bother but will you be doing this model https://huggingface.co/deepseek-ai/DeepSeek-V3-0324 (the new deepseek checkpoint)

Yes for sure. You know very well that I love huge models. We even did FatLlama 1.7T in the past. I recommend you follow https://huggingface.co/mradermacher/model_requests/discussions/797 to get news about our progress on DeepSeek-V3-0324.

if so when do you think you'd have the imatrix done?

I want to do imatrix using Q8 as I value quality and I won't consider everything below Q8 high quality. Unfortunately doing it in Q8 requires the RPC setup due to requiering significantly more than 512 GB of RAM. The good thing is that the RPC setup is ready to use but the bad thing is that using it will mean a total outage of nico1 and nico2 workers for almost a day with nico1 the only worker able to do DeepSeek-V3-0324 static quants. I assume we will do imatrix quants as soon static quants are done. By then we will hopefully also know if StormPeak is still stable with the two new Intel Arc 770 GPUs, I installed during today evenings maintenance window. I built two servers with Intel Arc GPUs at work and they booth crash almost once every day due to Intel drivers so I’m quite skeptical. A crash during RPC imatrix computation would be a disaster as it could mean 20 hours of lost work. When thinking about it we might be able to use the Intel Arc 770 for the RPC setup but not sure if you can mix NVidia with Vulkan or SYCL backend but if we can it would mean 30 GB more RAM which would allow nico1 to continue working during RPC imatrix computation.

Edit: Sorry, I see it's already in queue.

Yes it indeed is as you can see under https://hf.tst.eu/status.html

-7777  689 si DeepSeek-V3-0324                             run/static 3/12,Q2_K [133/1025] (hfu Q4_K_S)

Q4_K_S is done and already uploading while Q2_K is currently beeing computed: https://huggingface.co/mradermacher/DeepSeek-V3-0324-GGUF
Download Page: https://hf.tst.eu/model#DeepSeek-V3-0324-GGUF

I won't consider everything below Q8 high quality.

Sure. But without data, we won't know if Q8_0 vs. Q2_K even makes a measurable differtence. Maybe it does, but imatrices are such crude tools, I would be surprised if we wouldn't find that Q4_K_M or some other quant gives practically indistinguishable results from f16.

I'm moving from Kernel 6.8.12-9-pve to 6.8.12 to 6.11.11-2-pve. The i915 driver sucks for Intel Arc GPUs. I'm switching to Xe which on Kernel 6.8 is too immature.

Sure. But without data, we won't know if Q8_0 vs. Q2_K even makes a measurable differtence. Maybe it does, but imatrices are such crude tools, I would be surprised if we wouldn't find that Q4_K_M or some other quant gives practically indistinguishable results from f16.

Someday I will measure it. Maybe if the queue ever runs dry. Until then we should at least use Q8 when possible, as for Q8 we know for quite certain that the difference will be so small it will be impossible to ever tell as. I even did my 405B quant quality measurements on quants made using Q8 imatrix and I have not seen anything that would indicate it performing any worse.

@nicoboss btw., nico2 is enabled every morning via root's crontab on nico1 - if it pings, it will be enabled, so if we want it down (e.g. when doing rpc imatrix computations), we should comment that out (or you can just keep the nico2 vm down for example, as the enabling is done via that)

@mradermacher How to restart an imatrix task I killed? I forgot to initialize the NVidia GPUs before starting your container, so I had to kill them. llmc audit doesn't work for imatrix tasks. Sorry for this. After every host reboot, I need to remember to execute nvidia-smi on the host before starting your LXC container and if I'm really busy with other things I sometimes forget.

@nicoboss btw., nico2 is enabled every morning via root's crontab on nico1 - if it pings, it will be enabled,

Ah that’s why it didn't start today. I just happen to reboot the StormPeak at that time to fix some issues with the new Intel Arc GPUs. I restarted StormPeak way too many times today before asking some Vulkan developer why Vulkan on my Intel Arc GPUs isn't working but everything else like SYCL is. In case anyone wonders unlike NVidia Intel userland drivers don't come with Vulkan so mesa must be installed.

we should comment that out (or you can just keep the nico2 vm down for example, as the enabling is done via that)

We can just turn off the nico2 LXC container so nothing can turn off the CastlePeak host. We need the RAM for the CastlePeak RPC server anyways.

How to restart an imatrix task I killed?

By telling me, the only way currently. It's no problem :)

[keep the nico2 vm down] We can just turn off the nico2 LXC container

So.. turning it off but not keeping it down(?)

We need the RAM for the CastlePeak RPC server anyways.

Just say the word - I assume same set-up as before?

I'm now preparing the RPC setup. It will be the same as always. It will be using latest llama.cpp. I already updated the mradermacher branch.

@mradermacher The RPC imatrix computation setup is ready! :D

I configured it to maximize available memory on StormPeak so let’s hope that works without OOM but I think it will as this should be even tighter for 405B 16 bit than R1 8 bit and there it just barely worked.

I already updated the mradermacher branch.

Updating...

The RPC imatrix computation setup is ready! :D

... or not :) Also, reminds me that I will have to think about how to update llama on nico2 when it's down, also. Probably from nico1 when it enables it.

Updating...

Great! I just rebooted StormPeak into low ARC mode so we get an additinal 24 GB of RAM.

... or not :)

Everything is ready!

Also, reminds me that I will have to think about how to update llama on nico2 when it's down, also. Probably from nico1 when it enables it.

For now this is not needed. nico2 will stay turned off during RPC computation as CastlePeak is hosting the RPC server using a diffrent LXC container but yes updating it on wake would make sense.

nico1 is currently idle and all remaining lownice tasks seam to not require any imatrix so now seams like the perfect time to start RPC. Also timing wise we should booth be awake when it finishes when we start now which is really nice.

Starting it now also has the advantage that I might still be awake in case we OOM while loading the model and could adjust RPC servers accordingly.

For now this is not needed.

It is needed, because when nico2 comes up, it should not run on outdated llama.cpp till I remember to update it maybe in a few weeks :)

Case in point, in my euphoria I started the imatrix job before the update was finished, because I did run the update earlier and forgot that it had failed. Probably would have worked, but would have been a mistake nevertheless.

Thanks a lot for starting the imatrix computation.

Case in point, in my euphoria I started the imatrix job before the update was finished, because I did run the update earlier and forgot that it had failed. Probably would have worked, but would have been a mistake nevertheless.

It probably would have worked but nice you cough it. Sorry that I just happen to reboot at the exact time you made the update. I only checked that everything on the status page is idle but forgot about llama.cpp updates. I should have rebooted way earlier when I setup the entire RPC setup but forgot about the changing the ZFS ARC cache size requiring a reboot as usually it never needs a reboot but if I want to make it quite low have to put the value into modprobe, rebuild initramfs and reboot or it will be ignored.

It is needed, because when nico2 comes up, it should not run on outdated llama.cpp till I remember to update it maybe in a few weeks :)

No worries I will remind you if you forget.

kaos now has all the llama variants and should be able to update nico2 whenever it comes up again. in theory, of course.

Sorry that I just happen to reboot at the exact time you made the update.

You couldn't know, it's not a big deal. It did remind me to change things around, so whenever nodes become enabled, they will be auto-updated now. Other than the rpc link.

No worries I will remind you if you forget.

Right - and now that I thankfully have my own llama upstream maintainer, you think you can add the current build number of git revision to to ggufs in quantize? A simple string in mradermacher.llama_quantize_build or so would suffice. That doesn't tell us what version of convert*py did the thing, but it we often wondered which version of llama.cpp did a certain quant, exactly, and that at least gives us the version at quantize time.

PS: forgot if I asked already, if yes, and it was too annoying, just ignore me. this is not a repeat nudge :)

nite. things should continue as normal after deepseek is done

haha, as we talk about it, the net delivers: https://retr0.blog/blog/llama-rpc-rce (just the rpc server though, apparently)

You couldn't know, it's not a big deal. It did remind me to change things around, so whenever nodes become enabled, they will be auto-updated now. Other than the rpc link.

llmc enable nico2 indeed performed the llama.cpp update but nico2 still shows as disabled no matter how many times I try to enable it.

I'll have a look soon :) In any case, the safe choice should be /root/nico2-resume on nico1, that's the daily cronjob (but it also only does llmc enable)

yup, it was broken (by the rsync, even)

StormPeak crashed again earlier this evening thanks to the intel graphics drivers which caused the status page to breeze. Luckily the timing couldn’t have been better. DeepSeek-V3-0324 just finished 2 minutes before the kernel gave up after over an hour of struggling. I didn’t do killall9 yet so you can investigate why It froze as the local scheduler still seems to be doing an amazing job on keeping nico1 busy.

It did remind me to change things around, so whenever nodes become enabled, they will be auto-updated now. Other than the rpc link.

Which is really cool. I loved watching nico2 update on enable.

nite. things should continue as normal after deepseek is done

It did which was really cool. I was busy with work when suddenly fans ramped up and I immediately knew DeepSeek imatrix computation must be done. Only measurement mattering for fan speed is RAM temperature so it always ramps up if we do non-RPC imatrix but since I installed some DVD cover to redirect airflow over the RAM last weekend it does so way less than before.

haha, as we talk about it, the net delivers: https://retr0.blog/blog/llama-rpc-rce (just the rpc server though, apparently)

Thanks for linking. What an interesting article. I really like reading this vulnerability writeups. I was quite active in the Nintendo Switch hacking scene.

yup, it was broken (by the rsync, even)

Thanks for fixing it. Today nico2 started as intended.

Right - and now that I thankfully have my own llama upstream maintainer, you think you can add the current build number of git revision to to ggufs in quantize? A simple string in mradermacher.llama_quantize_build or so would suffice. That doesn't tell us what version of convert*py did the thing, but it we often wondered which version of llama.cpp did a certain quant, exactly, and that at least gives us the version at quantize time.

I don't really get why you would need a llama.cpp change for that. Couldn't you just add that the version number like all other custom metadata just providing it as command line argument to llama.cpp? But sure, I can add it if you prefer. Doing so would be really easy.

If you don't know the version you can get it using .\llama-cli.exe --version but keep in mind that on our fork build numbers and git commits will differ from official llama.cpp so maybe the timestamp of the most recent commit would make more sense.

PS: forgot if I asked already, if yes, and it was too annoying, just ignore me. this is not a repeat nudge :)

You already did and I was just too busy to answer and then forgot so goo you reminded me again. Please continue to do so for important things I forget. I'm sometimes really busy with my job and usually try to get back to you once things calm down but it can happen that I forget about something you asked so keep reminding you if I don’t get back o you within a few days.

I didn’t do killall9 yet so you can investigate why It froze

Annoyingly enough, because there was a hung rsh that actually needed a kill -9. The global scheduler was still running, as it was only the rsh process that the status daemon was waiting for.

I really like reading this vulnerability writeups. I was quite active in the Nintendo Switch hacking scene.

Me too (writeups) and cool (hacking scene :)

I don't really get why you would need a llama.cpp change for that. Couldn't you just add that the version number like all other custom metadata just providing it as command line argument to llama.cpp?

I can't get the version in a race-free way. I either would have to do versioned updates, and then somehow track when I can get rid of the old versions, or stop-the-world-then-update, or llama.cpp adds it itself. Having the version be wrong on some quants would IMHO be worse than not having it at all. In any case, doing it without a llama change is likely orders of magnitudes more complicated and buggy than the likely one or two-line change inside llama.cpp. The only issue is that we don't track the convert version, which is probably even more important, but that requires both the script to set it, as well as quantize to preserve it. (well, in theory I could extract it and set it manually).

maybe the timestamp of the most recent commit would make more sense.

Yeah, build numbers are liekly irrelevant. git commit id should be the way to go, imho. Maybe that and the most recent commit timestamp. I think llama.cpp already somewhere has the former.

Better quants coming to llama.cpp/ik_llama.cpp soon: https://github.com/ggml-org/llama.cpp/pull/12557

@nicoboss since you have done a lot of work testing quant quality, this may make some of your findings outdated.

Of particular note to you both is this may finally put to rest issues with Q3_K being unusable.

Also on the topic of bitrot, it can also be caused by failing HDD thankfully on ZFS that can be handled gracefully.

My ZFS dashboard shows me statistics.
Read Errors: 0
Write Errors: 0
Self Healed: ~400k (this most recent scrub was clean, but sometimes more get caught in scrubs)

Only one disk, been anticipating a drive failure, I'm somewhat prepared, but still don't feel like proactively replacing it.

@nicoboss yeah, if unpickling is always unsafe, why does transformers pretend otherwise, e.g:

_pickle.UnpicklingError: Weights only load failed. Re-running torch.load with weights_only set to False will likely succeed, but it can result in arbitrary code execution.Do it only if you get the file from a trusted source. WeightsUnpickler error: Unsupported operand 149

ikawrakow:

Oh, I used ik_llama.cpp to compare. It is possible that has become much faster than mainline (I haven't used mainline for quite some time). I started testing with DeepSeek-Lite, and almost gave up (your IQ4_NL quantization took 302.5 seconds with imatrix). ik_llama.cpp does it in 54.5 seconds.

intriguing...

in any case, my impression (primed by being explicitly said) that llama.cpp does not care about imatrix quants fortifies. and something must have happened between ikawrakow and llama.cpp if he even avoids commenting on the llama.cpp issues (https://github.com/ikawrakow/ik_llama.cpp/discussions/288)

and something must have happened between ikawrakow and llama.cpp if he even avoids commenting on the llama.cpp issues (https://github.com/ikawrakow/ik_llama.cpp/discussions/288)

This is leaning to the drama side of things which I normally stay away from as it isn't productive but I feel the desire to correct this, as it isn't true, https://github.com/ggml-org/llama.cpp/issues/10011 here he is well after he forked helping mainline out.

Someone did point blank ask ikawrakow what happened, and he answered https://github.com/ikawrakow/ik_llama.cpp/discussions/256

From my perspective it does just seem like there is a difference in visions that became incompatible. Especially now with how I see them, and what the roadmaps of both look like I'm glad both exist.

@tdh111 thanks for your links.

note that ikawrakow disagrees with the compilade patch actually improving Q3_K quants

Someone did point blank ask ikawrakow what happened, and he answered https://github.com/ikawrakow/ik_llama.cpp/discussions/256

He did not actually answer, in fact, he completely avoided answering that (at least in the issue you provided), which actually supports my suspicion - if everybody were happy, why not say that. I might be biased, having had a bad history with llama.cpp devs as well, but his behaviour, i think objectively matches my theory better than the "no bad blood at all" theory (neither of which is likely the true story).

It's clear, though, that ikawrakow is not interested in talking about it, and I think he shouldn't be prodded.

@nicoboss the backlog is essentially dead - "only" 10TB static quants left, which will mostly run into the repo creation limit, causing big delays, but no other grief. And 180 70Bs mostly caused by the sce craze that I didnt't imatrtix (or at all) quant.

Update: thinking about it, I'll stop submitting to nico2 and stop using it once it's done with its models. No point force-hitting the limit.

It's clear, though, that ikawrakow is not interested in talking about it, and I think he shouldn't be prodded.

I agree completely, I've been curious about it but never asked for that reason.

note that ikawrakow disagrees with the compilade patch actually improving Q3_K quants

his data says otherwise, https://github.com/ikawrakow/ik_llama.cpp/pull/295 in that table he shows Q3_K both his version and compilade's have improvements.

his data says otherwise,

No, it doesn't, his data is for --pure, not the standard Q3_K quant formats that llama uses and everybody generates. --pure formats are not relevant in the real world and perform worse than much smaller mix quants. Same is true for compilade - he didn't test with real world quants apparently, assuming the results would be the same. According to ikawrakow, they aren't. Besides, even compilade agrees that the improvements are only for a single model family, so even if the table were for Q3_K quants, it would not show otherwise.

@nicoboss I've re-queued most of the static-only quants that in the meantime got a couple hundred downloads, which was surprisingly many... but still the minority.

@mradermacher Could you please restart thePathos-Delta1-LLaMa-70B imatrix task. ArsParadox accidentally crashed it by starting his finetuning too early. I unfortunately still have no way to restart imatrix tasks by my own.

https://github.com/ggml-org/llama.cpp/pull/12634 just got merged. I'm so happy. I actually already used this model to create my own static Q4_K_M quants and ran inference of them for over 12 hours. Let’s update my mradermacher branch and then do all the BailingMoeForCausalLM models.

I updated the mradermacher llama.cpp fork. Please update all the local llama.cpp installations. I will be preparing the SOURCE GGUFs of the 290B models manually in the meantime.

updated llama.cpp. it also enables Qwen2_5_VLForConditionalGeneration and PLMForCausalLM. that'll be lots of models.

@nicoboss Qwen2_5_VLForConditionalGeneration is a vision model, yes? do you happen to know if it works with the same extraction process as Qwen2VLForConditionalGeneration ?

I will try to hold those models back.

llama.cpp quality is so frustrating. is it too much to ask to actually try out PLM model support on the actual PLM model before pushing it out?

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 291, got 290

I will be preparing the SOURCE GGUFs of the 290B models manually in the meantime.

Do they require a special process?

I've set the override file for Ling-plus-base on nico1, remove it once it is ready.

And the bailing support also hasn't apparently been tested with the actual bailing models. Ling-lite-base fails:

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 3072 (input tensor's size at dimension -2), but got split_sizes=[0, 0, 0]

I feel we should delay new model support by a week or so - it's like this every single time.

Try the large one. I put it under /tmp/quant/Ling-plus.gguf - sorry no softlink this time because bpool is full and I had no time to clean it up.

Actually there would now be enough storage on bpool if you want to move it over but I see you already started it. Also for Ling-plus.gguf imatrix you will have to either use Q8 or RPC due to it beeing 545 GiB. If you need the RPC setup just let me know but Q8 is likely good enough.

Qwen2_5_VLForConditionalGeneration

After reading https://github.com/ggml-org/llama.cpp/issues/11483 I gave it a try, but indeed, qwen2_vl_surgery fails (ValueError: Trying to set a tensor of shape torch.Size([2048]) in "bias" (which has shape torch.Size([1280])), this looks incorrect.)

Not sure what a good way to proceed is.

imatrix you will have to either use Q8 or imatrix due to it beeing 545 GiB. If you need the imatrix setup just let me know but Q8 is likely good enough.

Well, didn't we say for "foundation" models we do full precision, if possible? But, yeah, I am fine with Q8_0 :=) But I always also want the base model to get the same treatment.

Try the large one.

If that refers to the bailing models, the large one was already converted by you. You mean I should convert it on my own? I can try that with the base model, unless you already working on that. I don't think it's something in my set-up, though - I guess it's just another case of llama.cpp not even bothering to test with the original models.

his data says otherwise,

No, it doesn't, his data is for --pure, not the standard Q3_K quant formats that llama uses and everybody generates. --pure formats are not relevant in the real world and perform worse than much smaller mix quants. Same is true for compilade - he didn't test with real world quants apparently, assuming the results would be the same. According to ikawrakow, they aren't. Besides, even compilade agrees that the improvements are only for a single model family, so even if the table were for Q3_K quants, it would not show otherwise.

He merged in his PR which said this:

"In PR 12557 in mainline llama.cpp @compilade uses a (nearly) exhaustive search for optimality, whith correspondingly very long quantization times. One can arrive at about the same result much quicker as follows[...]

[see PR for actual math ]

Because of that, this kind of "first order" approximation is much faster than exhaustive search, as can be seen in the above table by comparing quantization run times between this PR and @compilade 's PR 12557, while achieving effectively the same quantization accuracy as measured by PPL.

Extending the above algorithm to the non-linear quants IQ4_XS and IQ4_NL is trivial.

". He considers it an improvement, and otherwise he would not have merged it in, when testing you use pure to test each quant at all tensors ( except token embeddings and output tensor which are always set to Q8_0 to prevent PPL collapse at low bpw). It is what you want to do if you are fundamentally altering the underlying math behind the quantization process.

https://github.com/ggml-org/llama.cpp/pull/12634 just got merged. I'm so happy. I actually already used this model to create my own static Q4_K_M quants and ran inference of them for over 12 hours. Let’s update my mradermacher branch and then do all the BailingMoeForCausalLM models.

@nicoboss

I see your comments on the PR, how would you rate Ling as a model (in comparison to others you've liked), I may want to run it myself.

@tdh111 I am just the messenger - your guessing at what he might or might not think simply cannot trump what he explicitly wrote:

So, when using --pure, it may appear that one gets an improvement because the new method being tested happens to do better on exactly these tensors, but worse on many others.

Yes, you might test on pure quants, but these quants are useless. What counts is the actual Q3_K* quants, which do not get improved by compilades patch according to him, because it also affects other quant types.

Please, if you want to continue to argue this, argue it with him. Maybe he lied when he wrote that, maybe he changed his mind without saying so - I find it moot to discuss this - point being that I was right in what I reported, even if you clearly don't like it - but it's not my opinion, and I correctly reported what he wrote.

I would also like to split these discussions into another discussion topic, so we can keep this topic strictly related to the quanting business. I've created a new topic for this:

https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/5

Well, didn't we say for "foundation" models we do full precision, if possible? But, yeah, I am fine with Q8_0 :=) But I always also want the base model to get the same treatment.

True. Let's stick to that rule. Honestly I really prefer doing it using F16 but I felt bad bothering you with another complicated RPC setup. Let's do it in F16 using RPC as the model is awesome and absolutely deserves it. It also is relatively tiny and so it is no problem to quickly run over RPC. The experts are rleatively small so duing imatrix in Q8 could have some negative impact.

If that refers to the bailing models, the large one was already converted by you. You mean I should convert it on my own?

Sorry for the confusion. I meant trying to quantize and imatrix it and see if the source GGUF I provided works or has the same issue as that base model which failed.

I can try that with the base model, unless you already working on that. I don't think it's something in my set-up, though

I think the same. I will provide the large base model later today if it even converts which it probably will not.

I guess it's just another case of llama.cpp not even bothering to test with the original models.

I'm in fact the only reason Ling-plus worked as I tested it and gave feedback that it is broken back before the merge request got merged. I had no idea he also didn't test the base models as he only mentioned Ling-plus beeing too large for him to test. If I knew I would have tested all of them.

Bailing is in one of the foundational models I really enjoy. I ran inference on it for a quite a while generating over 1 MB of text.

@mradermacher The RPC imatrix setup is ready to be used for Ling-plus. I have freed up much memory and you will be able to keep quantization tasks running for sure as the model only barely exceeds the single node memory limit. I disabled the CastlePeak shutdown trigger so you don't accidentally turn of off while imatrix computations are running.

No idea why Ling-lite and Ling-Coder-lite are tokenizing the input for such a long time during imatrix computation - I hope they aren't stuck.

both ling models were stuck at tokenizing input, of all things. inspires confidence. I am sorry, Inonly saw it now. I don't think I will be able to do the required conversio0ns etc., before I get up again :/ I will see though.

I'm in fact the only reason Ling-plus worked as I tested it

I saw that, yes:

I had no idea he also didn't test the base models

As usual, it looks as if they didn't test any of the actual models.

Ah, wait, yes, I don't need to convert, it's already there. I need the base model. If you manage to get it read till early morning, and if I manage to get ling-plus going soon, it would work out perfectly.

Hmm, maybe all I need is update the job and everything is automatric... I'll update scheduling parameters so three quant jobs will run on nico1 (deepsek as bg. and two normal opnes), as I hope there will b e enough memory. I moved it to nice 2, so it should automatically start doing it's thing after most of the imatrix queue is done. Good that everythiong is automated...

Hmm, i don't use irun for imatrix jobs. Sucks, which one is the right one to kill (another ling-lite model clogged the queue)

Note to self: fix that tomorrow!

Uuh, and today is the first day I can watch nico2 do a scheduled shutdown.

Ah, right, good that I saw that, it's 17:00... hmmm...

Bailing is in one of the foundational models I really enjoy.

I count that as a manual request for the model :)

@mradermacher The RPC imatrix setup is ready to be used for Ling-plus. I have freed up much memory and you will be able to keep quantization tasks running for sure as the model only barely exceeds the single node memory limit. I disabled the CastlePeak shutdown trigger so you don't accidentally turn of off while imatrix computations are running.

Uuh, and today is the first day I can watch nico2 do a scheduled shutdown.
Ah, right, good that I saw that, it's 17:00... hmmm...

Oh no I feel so bad that you stayed up so long just to see it happened likely forgetting about my previous message.

I count that as a manual request for the model :)

I really want to do them. All the Bailing models are quite fantastic. They are massive and intelligent while being fast to run on the CPU due to being MoE. They have a very nice balance of being intelligent while still being able to run at reasonable speed despite not fitting in GPU memory. Like 405B and R1 are just too slow for me to use them unless I absolutely have to and then I prepare all my prompts and run them over nigh while for Ling-plus with its 6 tokens per second it generates text faster than I read making it just fast enough for me to be willing to use it at real time and I could even speed it up further by using speculative token generation and GPU offloading.

Hmm, i don't use irun for imatrix jobs. Sucks, which one is the right one to kill (another ling-lite model clogged the queue)

You can always just check nvtop to see the PID to kill.

Ah, wait, yes, I don't need to convert, it's already there. I need the base model. If you manage to get it read till early morning, and if I manage to get ling-plus going soon, it would work out perfectly.

I will make it ready. It should be ready in around 2 hours.

both ling models were stuck at tokenizing input, of all things. inspires confidence. I am sorry, Inonly saw it now. I don't think I will be able to do the required conversio0ns etc., before I get up again :/ I will see though.

They are aware of the issue and are currently working on a fix.

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 3072 (input tensor's size at dimension -2), but got split_sizes=[0, 0, 0]

They are aware of this and the errors reported by the users as well. Please follow https://github.com/ggml-org/llama.cpp/pull/12634 as things are moving along quite fast.

Oh no I feel so bad that you stayed up so long just to see it happened likely forgetting about my previous message.

I didn't stay up for that, no :)

All the Bailing models are quite fantastic.

Unfortunately, even the Ling-plus model entered an endless loop during tokenization. We had two reports of another ling* model to crash with "invalid regex", so probbaly some regex is broken, fails to match, and you get 0 length token or so. rinse repeat (https://huggingface.co/mradermacher/Ling-lite-GGUF/discussions/2#67eaeffe7382053ae1241ee5)

On the other hand, you ran inference with the .gguf, and that seems to have worked, correct?

You can always just check nvtop to see the PID to kill.

Hmm.

nvtop: ./src/extract_gpuinfo_intel.c:228: parse_drm_fdinfo_intel: Assertion `!cache_entry_check && "We should not be processing a client id twice per update"' failed.

(that's at home). Nah. I just did it the old fashioned way with ps and grep...

They are aware of this

No doubt they will eventually fix it. But they always release it broken. Anyway, everything is still queued. And sorry, it's just not how I release software - I am just exasparated.

Any idea what we should do with qwen 2.5 vision models? supposedly, vision extraction should work with most sizes, but not for me.

I guess log and only do the text for the time being?

@nicoboss I changed nico1 to single-quant job, because it will primarily do 70B's, and they should fit nicely into the arc cache.

Indeed, it basiclaly does zero I/O, other than occasional writing of tensor data. Still, the cpu is often 80% idle when it shouldn't wait for I/O - is quantize this inefficient? I was under the impression that it has no trouble keeping cores busy as long as tensor data is in memory - you have any idea? In the meantime I'll increase to two jobs again.

@nicoboss I changed nico1 to single-quant job, because it will primarily do 70B's, and they should fit nicely into the arc cache.

I assume you mean nico2. nico1 doesn't even use ZFS and so no ARC cache.

Indeed, it basiclaly does zero I/O, other than occasional writing of tensor data. Still, the cpu is often 80% idle when it shouldn't wait for I/O - is quantize this inefficient? I was under the impression that it has no trouble keeping cores busy as long as tensor data is in memory - you have any idea? In the meantime I'll increase to two jobs again.

Sorry I yet again forgot to manualy run my boot.sh this morning. I have no clue why cron doesn't execute it on boot but without this script running we heve almost zero ARC cache explaining why you barely see any CPU utilisation.

/etc/crontab

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
@reboot /root/boot.sh

/root/boot.sh:

#!/bin/bash
cpupower frequency-info
cpupower frequency-set -u 3200000
cpupower frequency-info
echo 193273528320 > /sys/module/zfs/parameters/zfs_arc_max
numfmt --to iec --format "Set ZFS ARC to %3.2f" $(cat /sys/module/zfs/parameters/zfs_arc_max)
root@CastlePeak:~# stat boot.sh 
  File: boot.sh
  Size: 247             Blocks: 9          IO Block: 512    regular file
Device: 0,27    Inode: 290397      Links: 1
Access: (0777/-rwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-03-10 13:07:34.281358058 +0100
Modify: 2025-03-13 16:20:35.299606772 +0100
Change: 2025-03-13 16:20:46.101826699 +0100
 Birth: 2025-03-10 13:07:34.281358058 +0100

Not only does corn fail me but so does modprobe. Maybe I forgot to rebuild the initramfs but I'm quite sure I did. It is so ridiculous that I have to manually execute boot.sh every day just because all automation fails me.

/etc/modprobe.d/zfs.conf

options zfs zfs_arc_max="2147483648"

Ah wait I see. /etc/modprobe.d/zfs.conf has the wrong value. That explains why why modprobe doesn't work. So actually the issue is just cron in that case.

Interesting so crone thinks something is wrong with the syntax of my crontab file

root@CastlePeak:~# cat /var/log/syslog | grep -w 'cron'
2025-04-01T07:01:05.326155+02:00 CastlePeak systemd[1]: Started cron.service - Regular background program processing daemon.
2025-04-01T07:01:05.328109+02:00 CastlePeak cron[3195]: (CRON) INFO (pidfile fd = 3)
2025-04-01T07:01:05.329151+02:00 CastlePeak cron[3195]: Error: bad command; while reading /etc/crontab
2025-04-01T07:01:05.329181+02:00 CastlePeak cron[3195]: (*system*) ERROR (Syntax error, this crontab file will be ignored)
2025-04-01T07:01:05.330370+02:00 CastlePeak cron[3195]: (CRON) INFO (Running @reboot jobs)

Maybe I'm stupid. I don't see a syntax error. Here the entire file:

root@CastlePeak:~# cat /etc/crontab
# /etc/crontab: system-wide crontab
# Unlike any other crontab you don't have to run the `crontab'
# command to install the new version when you edit this file
# and files in /etc/cron.d. These files also have username fields,
# that none of the other crontabs do.

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
17 *    * * *   root    cd / && run-parts --report /etc/cron.hourly
25 6    * * *   root    test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.daily; }
47 6    * * 7   root    test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.weekly; }
52 6    1 * *   root    test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.monthly; }
#
@reboot /root/boot.sh

Nice how it gives you a line number. Not.

As for your problem, not sure, but a) does your crond support @reboot? and b) I don't think you can leave out the user

And a free admin pro-tip, don't patch /etc/crontab, use your own file in /etc/cron.d (or root's crontab)- that way, you won't have conflicts on upgrades, and you can name your files descriptively, e.g. /etc/cron.d/mradermacher

PPS: /etc/rc.local, on debian, is likely the safer way to do that, altogether

assume you mean nico2.

Yup.

we heve almost zero ARC cache explaining why you barely see any CPU utilisation.

Wait, what? If you don't have a big arc cache, ZFS simply idles, with no I/O and no CPU? In what world does that make sense? Or is it just that I can't see it.

I'll reduce the job number again and see if anything has changed.

a) does your crond support @undefined

Yes it does

b) I don't think you can leave out the user

Ah you are right:

Jobs in /etc/cron.d/

       The jobs in cron.d and /etc/crontab are system jobs, which are
       used usually for more than one user, thus, additionally the
       username is needed.  MAILTO on the first line is optional.

Somehow, with two jobs, it's now 90% idle for 10+ seconds tretches.

(Maybe because it's two static + imatrix jobs now.)

@undefined is funny

Wait, what? If you don't have a big arc cache, ZFS simply idles, with no I/O and no CPU? In what world does that make sense? Or is it just that I can't see it.

I think you just can't see it with whatever tool you use. I can see it on the Proxmox web interface under IO delay. For some reason IO delay (the time the CPU spends waiting for IO) increases but the IO bandwidth stays almost the same.

Somehow, with two jobs, it's now 90% idle for 10+ seconds tretches.

It takes some time for ARC cache to fill up with the data you need.

No, there doesn't seem to be a change. It's currently uploading and doing more I/O, but it seems llama.cpp really can't keep the cpu busy even when I/O is essentially free. I'll use tow jobs. That sucks, because otherwise, one job would be ideal at the moment.

Anyway, thanks for your input again, signing off

I juest realized that I did manually run boot.sh late morning today. So it should have worked. So the zero IO you where seeing was real. It really seams like a llama.cpp issue of still beeing inefficient even if all data it needs is in RAM.

Wow I got mentioned XD

It really seams like a llama.cpp issue of still beeing inefficient even if all data it needs is in RAM.

Yeah, so my mental model of "it loads the tensor, runs x threads in parallel on uniform data with perfect sharing, cleaning up + saving" is wrong. But maybe there is a good reason for that, so no criticism here :)

@undefined sorry, I thought I had looked and determined that you don't exist. I was wrong.

I've replaced the standard rsh-client by rsh-redone-client (which is differently buggy, and I normally avoid it, but I don't use it interactively on kaos, so don't care). In the long run, We'll have to have some kind of llmjob server. But maybe that one doesn't hang so often. (the statusd was hanging waiting for it once more)

@mradermacher Please update to latest llama.cpp on our fork and restart all failed Bailing imatrix tasks. https://github.com/ggml-org/llama.cpp/pull/12677 got merged which finally fixes all the remaining Bailing issues.

Please make sure to use RPC when restarting Ling-plus. All RPC servers are updated to latest llama.cpp and ready. We maybe should also unblock Ling-plus-base and setup its imatrix task to use RPC as well.

I also moved Ling-lite-base from marco to nico1 and manually provided the GGUF as https://huggingface.co/inclusionAI/Ling-lite-base/discussions/2 is not yet merged.

HuggingFace implemented a really cool feature that shous you exactly which quants are compatible to your hardware for every single quant we uploaded as a single file:

grafik.png

grafik.png

I see that Ling-lite-base imatrix computation is currently blocked which is a good thing as you have not yet updated to our latest llama.cpp version but please make sure to unlock it once you have done so.

I think we should change timeofday end time to 18:00 as due to the switch to summertime last weekend it is still daytime and sunny outside despite being 17:30.

I'll take care of llama/ling when I am mentally fit enough again.

Ling-lite-base

Very cool of you!

HuggingFace implemented a really cool feature that shous you exactly which quants are compatible to your hardware for every single quant we uploaded as a single file:

All quants are "compatible". And if the question is "will fit", then the feature is useless, imho, because it doesn't take any parameters into account. If all it does is compare quant sizes with some fixed vram limit, I don't see the point.

I think we should change timeofday end time to 18:00

I will (and hopefully not forget nico2 once it is available again. Any ETA on it? I was kind of depending on it...)

I will (and hopefully not forget nico2 once it is available again. Any ETA on it? I was kind of depending on it...)

nico2 was running almost the entire day until you turned it off at 17:00. It just started a bit later than usual as I tried enabling kernel crash dumps which caused it to get stuck during boot and it took a few hours for me to notice. But it was running late morning and the entire afternoon.

Actually just set timeofday end-time to 19:00 as it in fact is still sunny outside - summer time is so strange.

I turned on CastlePeak and started the RPC server there as it got turned off together with CastlePeak at 17:00 but we need it for Ling-plus

nico2 was running almost the entire day until you turned it off at 17:00.

That is good to hear (and no, I don't turn it off myself :)

as it got turned off together with CastlePeak at 17:00 but we need it for Ling-plus

It was a valiewnt attempt :)

Ling-plus

Failed due to nan's showing up during imatrix training. Happened in the past a lot, until lama.cpp fixed things, now is a rare thing. According to the llama.cpp devs, this makes ling-plus either a completely useless model incapable of inferencing, or us liars and deceivers. In case you are now "what the fuck" - that was before your time, when I dared to report these cases to llama.cpp - I was then publicly accused of tinkering with the model, the code, or the results, or having "weong" imatrix training data, or not enough (which obviously is bollocks, because shorter material increases chances of success) - and was ordered(!!) in no unclear terms to not distribute such quants. It caused a major ruckus in the social sphere (e.g. reddit, 4chan) where mradermacher was accused of somehow producing broken quants on purpose, backed by the derogatory llama.cpp developer comments. Never received an apology when it turned out to be llama.cpp bugs.

That's the source of my bitterness towards llama.cpp.

And it just all came back, showing how much more it affected me than I thought. At least I didn't throw the towel right then and there, but it was close.

This all sucks.

Don't feel like you have to comment on the above.

Anyway, in the dark past, I used a reduced imatrix training set, because the problems only appeared after 94 chunks (that's the reason I had half and quarter training sets), but I guess the thing to do is to not have imatrix quants of ling-plus at the moment, because clearly something is suboptimal in the source model.

Actually just set timeofday end-time to 19:00 as it in fact is still sunny outside - summer time is so strange.

Well, how about doing something like sunrise + x to sunset - y. Can't be much more complicated than a sine function or so. Maybe with some hard limits in winter. If you think it's aon OK idea, I will look into it, because it sounds interesting for a change.

that's the reason I had half and quarter training sets

Right, that was also the reason behind adding an abort() when ppl becoems a nan, and having regular autosaves that I recently disabled. That's how rare that case had bcome.

I have started the rpc imatrix process for ling-plus-base. it is not uncommon that the nan problem appears only for the instruct model. Hope I didn't do antyhing wrong - but I assaumed if rpc is working it can be used.

And sorry in general, have too little time for detail work.

I have started the rpc imatrix process for ling-plus-base. it is not uncommon that the nan problem appears only for the instruct model. Hope I didn't do antyhing wrong - but I assaumed if rpc is working it can be used.

Yes I have not yet increased the ARC cache on CastlePeak and all RPC servers are still running so everything is still ready for RPC.

The big question is why it is so slow (12h). Shouldn't it be way faster? (with luck it will crash after 94 chunks anyway :)

Ling-plus-base failed at the samer block, so probably it's actually related to the specific imatrix chunk. I suggest we do static only then :(

We could consider re-adding the static IQ3* quants? Maybe they are fixed...

I will try some other imatrix datasets or make llama.cpp skip NaN chunks. Seams like terrible design that it refuses to compute the imatrix just because of some random NaN.

As a start we should try applying https://github.com/ggml-org/llama.cpp/pull/11773 - if it works, I consider merging this into our fork as it will fix many remaining imatrx NaN cases. NaNs are always an issue on llama.cpp's side not properly handle certain edge cases and never us or ouer imatrix dataset. In the same PR ikawrakow explained how NaNs are caused by llama.cpp's hacky implementation.

I will try some other imatrix datasets or make llama.cpp skip NaN chunks.

That can't work - once nan, forever nan. The only way is to not have it run into nans in the first place. Again, this why I did an autosave every 10 chunks, but I don't think we can do that with the patched version, we'd have to fix it first.

But since this is almost certainly a model defect, I think it's probably not worth it.

llama.cpp's side not properly handle certain edge cases and never us or ouer imatrix dataset.

Or, very commonly, broken weights in models. Which likely is the case here.

"I am not interested in maintaining this code."

We know. (slaren)

https://github.com/ggml-org/llama.cpp/pull/11773

I wonder why it isn't applied. But that code isn't executed during imatrix training, so I give a zero^W1e-5 chance of helping?

We can probably search for models that actually failed this way with llmc why, though. I am sure we have a few (but not many).

I am also glad you are queuing models again, although there seems to be a perceived 100% failure rate (I only really see the failures :-)

What would be interetsing, aqnd I wonder if you are already doing this, effectively, is to see if we have any non-imatrix models that are popular enough (for some metric) to warrant an imatrix?

@RichardErkhov @nicoboss

We are through the queue. Most of the remaining models in the queue are held back qwen 2.5 vl models where a transformers release is likely needed.

That means rich1 will essentially become idle regularly, and we will have to find a different mode of use - we still depend on rich1, but we need a better way of sharing than "mradermacher takes over".

For example, once done I could reduce the number of jobs to one, and this guarantee at least half the box's memory will be available. Or something else. Discussion appreciated.

PS: yes, I still want to redo classical pre-2024 models, and I have an idea on how to find broken models and eradicate/replace them, but neither of these will create the massive queue we had for the last halfr a year.

PPS: and the next thing will be moving nico2 to some other model, such as permanently off, or on only on (manual) demand

PPPS: wow, this feels good. io wish the queue wasn't clogged with the 2.5 models and was completely empty, just for the psychological effect - couldn't have done it without you :)

nice, well I guess we should prioritise quanting on rich1 whenever we get new models, and if then if queue becomes too big to handle spread to other servers. Because nico is paying for the electricity, and I am paying fixed price, so might as well save some nico money and utilize already spent my money haha. Well, I will continue with my queue, but it obviously cant 24/7 CPU just because of my amazing code and because my queue is sorted by size lol, well it just needs a bit of fixing and everything will be fine. Well, that was a nice half a year of 110% cpu load, I guess we should queue 1990-2024 models lmao

or we can imatrix all of the ggufs lmao, just literally abuse nico gpus for a month to imat the whole 20k+ models and store them, and just make rich1 process all of them

We are through the queue.

What a moment to celebrate! Thank you everyone for your massive effort! We all did such an amazing job!

Most of the remaining models in the queue are held back qwen 2.5 vl models where a transformers release is likely needed.

Qwen2.5 VL support for llama.cpp is basically ready and just waiting for review: https://github.com/ggml-org/llama.cpp/pull/12402 - hopefully the llama.cpp team didn't miss that it is ready for review because I did.

PS: yes, I still want to redo classical pre-2024 models, and I have an idea on how to find broken models and eradicate/replace them, but neither of these will create the massive queue we had for the last halfr a year.

We could dry-run them and see if they still work. But I would say if they are of big historical significance requantizing might makes sense even if they are not broken just for the sake of having better quality quants.

PPS: and the next thing will be moving nico2 to some other model, such as permanently off, or on only on (manual) demand

I think we could just make it automatically turn on whenever we have high-priority models to work on.

PPPS: wow, this feels good.

It indeed does. It is so relieving that after over 5 month the queue is finally empty again! Having such a massive backlog was quite overwhelming and I thought many times we might never get them all done.

couldn't have done it without you :)

We couldn't have achieved this without you who put so much work and effort into finding all these awesome models and maintaining this surprisingly complex infrastructure.

well said, @nicoboss , thanks to everyone guys !

the queue is going in front of me. half an hours ago it was 119
81 additional job(s) in wait queue (total estimated size 7.022TB, imatrix 6.809TB, lownice 79)

I guess we are going to cook everything by tomorrow morning

@mradermacher fun question: what is the size of mradermacher?

We could dry-run them and see if they still work.

That is exactly what I plan to do - write the header-downloader to recreate "stub" models, then dryrun them. The downloader I want anyway for other reasons. Just when is the question :)

1990-

Yeah :)

I think we could just make it automatically turn on whenever we have high-priority models to work on.

Right, except the "have high-priority models to work on" is a bit difficult to formulate. In any case, we'll try with manual queueing for a while, and see how much we will need it.

the queue is going in front of me. half an hours ago it was 119

Quite a few of the queued models are (invisibly) locked due to being qwen 2.5, and the 7Bs specifically need the next transformers release, apparently.

fun question: what is the size of mradermacher?

Measured in TB, it's 5077.057 (our count, can't find where hf shows it anymore).

As for saving power, essentially we already have a preference system - rain is always preferred over nico1 for models it can support for example. I suggest we change things so that nico1 only does jobs with nice level < -1400 during the night (right now, it's <= 50). That would exclude daily models. Not sure rich1 can do the work, though, it's considerably slower than nico1, nico2 or marco (in practise, it is probably faster than marco because it has a faster disk). But we can surely try.

Just pushing high priority models to other nodes first will not do, as we will always have bursst with more models than nodes, so nico1 would always get some.

But yeah, my point is, we can surely reduce the load on nico1, e.g. at night, and probably not use nico2 almost all of the time.

This can also be modified, e.g. I could make two "daily" groups of models, let's call them daily junk and daily intertesting, e.g. at -1000 and 0. I'll play around with it the next few weeks. See what works.

I mean from now we wont get as many models with high priority, so we dont need asap quants, so just normal queue can stay on rich1, and if anything high priority can go to another server . That way we keep rich1 loaded at all times and nico with more money =)

For some reason I expected more than 5077 TB lol

@mradermacher Text only llama 4 support got merged: https://github.com/ggml-org/llama.cpp/pull/12791. I updated ouer llama.cpp fork. Please update the workers. I already have the models locally and plan on queuing them tomorrow to nico1 using manually provided GGUF.

Llama 4 is affected by https://github.com/ggml-org/llama.cpp/pull/12727 as well. I wonder if we like bartowski should risk using this change despite not being merged yet. It results in much better quants for all MoE models with an expert count other than 8. I think we should as we will have to redo Llama 4 anyways once its missing features are implemented.

I will now start preparing the Llama 4 source GGUFs. Latest llama.cpp now requires libcurl4-openssl-dev to build so in case it fails to build you know what to install.

I mean from now we wont get as many models with high priority,

Why would we get fewer high priority models?

I wonder if we like bartowski should risk using this change despite not being merged yet.

I think we should wait till it's merged and the kinks are gone in a "released" version. Why? Because every single time in the past, we had to redo the quants anyway. Different this time? Sure, maybe.

We have to redo them anyway for sure no matter what we do. The current state of Llama 4 support is quite bad. The idea was to go for slightly better quants than what is currently on master so we at least don't have to redo them again until things are stable but I'm fine with using master.

Different this time? Sure, maybe.

No and initial support maybe never was as bad as for Llama 4. So many features are missing in the current implementation. Thing is that this model is insanely popular and everyone wants to try it so just waiting a few months for everything to be implemented is not really an option. I also really hate that every time they implement one of them many missing Llama 4 features that users will demand a requant.

Let's wait at least a day, till hopefully somebody had time to review 12727

Sounds good?

So many features are missing in the current implementation.

I am not talking about features, I am talking about the model being outright broken every single time, and being fixed within a week.

Thing is that this model is insanely popular and everyone wants to try it

Well, here we disagree :) I want to deliver good work, based on official llama.cpp releases. I don't want to chase popular models and provide the experimental testbed for them, don't want to invent experimental quants etc. etc.

I am not concerned about missing features. I am not terribly concerned with requants. I am concerned about the quants being outright broken, just as it was the case with llama 3, 3.1, ... and practically every new non-llama architecture.

I also feel bad about not having an obvious improvement like 12727, but I feel terrible about it not even having been reviewed.

I'll let you decide then, to take the risk or not - say the word. But we should at least wait on a resolution of 12727, IMHO.

I just checked and nobody uploaded a GGUF for Llama-4-Maverick-17B-128E-Instruct so far despite the really high demand so let's just do it without 12727 using latest version of ouer fork. I'm fine with waiting a day but I see it as unlikely that 12727 is getting reviewed in the next day as last change was 3 days ago and nobody is assigned as reviewer yet. Llama-4-Maverick-17B-128E-Instruct.gguf will be ready in less than 2 hours.

Slightly related only: Since rpc parameters are somewhat standardized by now, I could provide a llmc command to "reconfigure" an imatrix job into a "big rpc imatrix job", maybe also adds some more imatrix management commands. Of course, timing is a bit bad, but we can discuss - I love making you more independent.

I just checked and nobody uploaded a GGUF for Llama-4-Maverick-17B-128E-Instruct so far despite the really high demand

Maybe some people are more reasonable... Anyway, preparing the llama build now. If you want to do it without 12727 there is no point in waiting for it. Although the likely result might be publishing the first and worst quant...

I will now start preparing the Llama 4 source GGUFs. Latest llama.cpp now requires libcurl4-openssl-dev to build so in case it fails to build you know what to install.

It's not installed on my build machine, and the build succeeded - are you sure it needs openssl specifically? debian defaults to gnutls.

A command to restart failed imatrix tasks would be supper useful. Reconfiguring an imatrix job to use RPC would be awesome as well. Maybe a way for me to trigger an update to latest llama.cpp or have it auto-update would be great too as currently I always have to ask for it and with all the Llama 4 changes this likely require us to update way more often than usual.

I have not forgotten about adding the llama.cpp version metadata. I'm just really busy at the moment and so had no time for it so far but it seems like we can use the git tags for a consistent versioning as they are synched over from llama.cpp upstream.

It's not installed on my build machine, and the build succeeded - are you sure it needs openssl specifically? debian defaults to gnutls.

No every libcurl-dev will work. I just went with openssl as I was not aware that gnutls is the default.

llama.cpp has been updated, but I didn't test it, for lack of time. best see if new jobs work, then feel free to queue.

Llama-4-Maverick-17B-128E-Instruct is now being quantized. We will need to use the imatrix setup for Llama-4-Maverick-17B-128E-Instruct. Please restart the MM-Thinker-72B imatrix task if you have time. Llama-3_1-Nemotron-Ultra-253B-v1 has quite a strange failure we will need to look into.

Let's summarize important open llama.cpp pull requests so we can better keep track of them:

Support Qwen3 and Qwen3MoE : https://github.com/ggml-org/llama.cpp/pull/12828
DeepSeek V2/V3 MLA implementation: https://github.com/ggml-org/llama.cpp/pull/12801
Update llama-quant.cpp llama_tensor_get_type with DeepSeek friendly modifications: https://github.com/ggml-org/llama.cpp/pull/12727
Add Qwen2.5VL support: https://github.com/ggml-org/llama.cpp/pull/12402

18634 Segmentation fault

[MN-Thinker-72B] Hmm.. what would cause a segfault... other than, say, starting the imatrix process in the split second where rsync does the renames (or a bug in llama.cpp).

hmm, there has been a transformers release, I will test if that fixes the qwen2.5vl problems (text-only).

Llama-3_1-Nemotron-Ultra-253B-v1

Well, I suspect it's a slightly tweaked architecture and/or simply broken or missing support for it in llama.cpp. Cheap to retry, maybe the transformers upate helped. OR will hlep, once I vereified it happened.

wait, why does pip3 give me transformers 4.46.3, which is really old.

aha... so llama.cpp is fine with newer transformers versions, kind of:

/llmjob/llama.cpp/requirements/requirements-convert_legacy_llama.txt:transformers>=4.45.1,<5.0.0

However, when I pip3 install -U -r /llmjob/llama.cpp/requirements.txt, then it downgrades to transformers 4.46 (and uninstalls 4.51), so maybe some other dependency force-downgrades it.

But I don't understand how pip3 works - do dependencies only matter at the time of install, and are afterwards ignored? Because when I pip3 install -U transformers afterwards, it happily upgrades to 4.51 again. It also looks weird when I -r requirements.txt, as if it starts with 4.51, and then it searches down in versions till it finds 4.36 (total python noob here btw., don't assume I understand what I am doing):

Requirement already satisfied: transformers<5.0.0,>=4.45.1 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 3)) (4.46.3)
Collecting transformers<5.0.0,>=4.45.1
  Using cached transformers-4.51.1-py3-none-any.whl (10.4 MB)
Requirement already satisfied: gguf>=0.1.0 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 4)) (0.14.0)
Requirement already satisfied: protobuf<5.0.0,>=4.21.0 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 5)) (4.25.6)
Requirement already satisfied: torch~=2.2.1 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-convert_hf_to_gguf.txt (line 3)) (2.2.2+cpu)
Requirement already satisfied: aiohttp~=3.9.3 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 1)) (3.9.5)
Requirement already satisfied: pytest~=8.3.3 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 2)) (8.3.5)
Requirement already satisfied: huggingface_hub~=0.23.2 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 3)) (0.23.5)
Requirement already satisfied: matplotlib~=3.10.0 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 4)) (3.10.1)
Requirement already satisfied: openai~=1.55.3 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 6)) (1.55.3)
Requirement already satisfied: pandas~=2.2.3 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 7)) (2.2.3)
Requirement already satisfied: prometheus-client~=0.20.0 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 8)) (0.20.0)
Requirement already satisfied: requests~=2.32.3 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 9)) (2.32.3)
Requirement already satisfied: wget~=3.2 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 10)) (3.2)
Requirement already satisfied: typer~=0.15.1 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 11)) (0.15.2)
Requirement already satisfied: seaborn~=0.13.2 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 12)) (0.13.2)
Requirement already satisfied: filelock in /llmjob/share/python/lib/python3.11/site-packages (from transformers<5.0.0,>=4.45.1->-r /llmjob/llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 3)) (3.17.0)
  Using cached transformers-4.51.0-py3-none-any.whl (10.4 MB)
  Using cached transformers-4.50.3-py3-none-any.whl (10.2 MB)
  Using cached transformers-4.50.2-py3-none-any.whl (10.2 MB)
  Using cached transformers-4.50.1-py3-none-any.whl (10.2 MB)
  Using cached transformers-4.50.0-py3-none-any.whl (10.2 MB)
  Using cached transformers-4.49.0-py3-none-any.whl (10.0 MB)
  Using cached transformers-4.48.3-py3-none-any.whl (9.7 MB)
  Using cached transformers-4.48.2-py3-none-any.whl (9.7 MB)
  Using cached transformers-4.48.1-py3-none-any.whl (9.7 MB)
  Using cached transformers-4.48.0-py3-none-any.whl (9.7 MB)
  Using cached transformers-4.47.1-py3-none-any.whl (10.1 MB)
  Using cached transformers-4.47.0-py3-none-any.whl (10.1 MB)
Requirement already satisfied: packaging>=20.0 in /llmjob/share/python/lib/python3.11/site-packages (from transformers<5.0.0,>=4.45.1->-r /llmjob/llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 3)) (24.2)

We will need to use the imatrix setup for Llama-4-Maverick-17B-128E-Instruct.

If you mean rpc setup, I configured the imatrix job accordingly (with max. one hfd and one quant job on nico1), but it's currently in override.

PS: untested, you might be able to use llmc shell kaos and rm /tmp/Llama-4-Maverick-17B-128E-Instruct.soverride to remove the override, followed by llmc push, to enable it. Likewise this cna be used to override imatrix jobs. That works because on kaos, the global /tmp is shared to llmc shell (there are no secrets there in /tmp, presumably :)

PPS: false alarm, /tmp of course is +t, so it won't work. will have to move it elsewhere. but that should be easy, just not right now.

Please start the Llama-4-Maverick-17B-128E-Instruct imatrix task. The RPC imatrix setup is ready!

PS: untested, you might be able to use llmc shell kaos and rm /tmp/Llama-4-Maverick-17B-128E-Instruct.soverride to remove the override, followed by llmc push, to enable it. Likewise this cna be used to override imatrix jobs. That works because on kaos, the global /tmp is shared to llmc shell (there are no secrets there in /tmp, presumably :)
PPS: false alarm, /tmp of course is +t, so it won't work. will have to move it elsewhere. but that should be easy, just not right now.

As you predicted it doesn't work due to wrong permissions and the Sticky-Bit. It currently has 600 permissions so I can't do anything with it - not even read or write. For others that have 666 permissions I had at least some fun but even those I can't delete due to the Sticky-Bit. The 666 ones I was able to hardlink just to figure out I can't delete the hardlinked copy due to the Sticky-Bit - please delete addtxt_.txt which I created while messing around. Your jail terminal shows all files owned by 65534 no matter who created them so file permissions got a bit confusing. In any case looks secure from the few minutes I messed around with it. Whoever created that jail had the smart idea to read-only mount almost everything heavily limiting any potential attack surface.

Some quite awesome llama.cpp pull requests for convert_hf_to_gguf.py got created today:
convert : ability to lazy-load safetensors remotely without downloading to disk: https://github.com/ggml-org/llama.cpp/pull/12820
convert : write tensors in parallel: https://github.com/ggml-org/llama.cpp/pull/12837

#12820 allows us to convert SafeTensor to source GGUF without even having to store the actual files and not having to download model that immediately fail the conversion.
#12837 allows us to use multiple threads convert_hf_to_gguf.py making it much faster

Yesterday night I finished the perf measurement project freeing up 4 TB of SSD storage. That storage I now mounted to your container under /dpool. I'm currently creating Llama-4-Maverick-17B-128E source gguf on it. It should be done in around 3 hours. Please whitelist dpool so we can softlink to it.

#12820 allows us to convert SafeTensor to source GGUF without even having to store the actual files and not having to download model that immediately fail the conversion.

That is potentially useful, yet... with the amount of download errors and retries, I wonder if it would be a net-win.

#12837 allows us to use multiple threads convert_hf_to_gguf.py making it much faster

You mean likely much slower? Why would multiple threads make it faster, when it's already I/O bound everywhere? It would likely it make it slower almost everywhere if it used multiple threads.

The obvious optimisation for convert_hf_to_gguf.py would be to not make an extra copy of all tensors on disk.

Please whitelist dpool so we can softlink to it.

done! What's the usage guidelines for that, and... how fast is reading speed fro there? :)

In any case looks secure from the few minutes I messed around with it.

Makes me very happy to hear :)

I started with umask 0 for all hf-related things, because I have very few "foreign" users on my systems, and thought it might come in handy eventually. It kind of does, but using /tmp klind of destroyed that. In any casem I will likely provide imatrix-related llmc commands, that seems easier and safer. The llmc shell can also get some obvious improvements (I work with it every day during "audit"), but right now, I have very little time. Hopefully that changes soon-ish.

Also, since, at least between us two, you are the god of python, I am really at a loss of why pip always downgrades transformers, and that's my most pressing problem. We really need a more uptodate version for a lot of architectures. Not even pip3 -vvv tells anything about why it choses or downloads certain versions, only that it does.

I could force it by reinstalling transformers again after everything is installed, but there must be a reason why it downgrades (but then, why does it upgrade it when installed alone)

Hmm, I was offline an hour and now lots of imatrix jobs are missing, including the llama 4 maverick one. WQhat horrible error cascade/explosion did I miss?

Hmm, I was offline an hour and now lots of imatrix jobs are missing, including the llama 4 maverick one. WQhat horrible error cascade/explosion did I miss?

No idea. It already was like this when I woke up over an hour ago. I was so surprised to see llama 4 maverick gone without doing it's RPC imatrix task - it in fact never even connected to RPC servers. The imatrix RPC setup is still ready so please run it once things are fixed and imatrix for other models are done.

I turned on the nico2 LXC container but left it disabled for now in case you want to start the RPC imatrix task. If you don't want to do RPC now you can let nico2 work quantization tasks in the meantime. I disabled the nico2 shutdown handler so it will not get turned off by accident while we do imatrix RPC computation.

Edit: I enabled nico2 again so we make use of it in the meantime as it is great weather.

Oh great so llama.cpp fixed Llama-3_1-Nemotron-Ultra-253B-v1 support in: https://github.com/ggml-org/llama.cpp/pull/12843 - cool we still kept the model downloaded. I will give it a try and if it works, we either wait until this PR is merged or merge to to ouer own branch.

Edit: We will have to wait for the merge as the model will not load using mainline llama.cpp and so it would be unusable for normal users if we do it now.

done! What's the usage guidelines for that, and... how fast is reading speed fro there? :)

Usage guidance for ZFS based file volumes is always that you can fill them as much you want as I can easily set a storage limit. I recommend to always keep maybe 1 GB free, or performance will get terrible.

dpool uses a 4000G Kingston FURY Renegade PCIe 4.0 NVMe M.2 SSD. It should have 7300MB/s read- and 7000MB/s write speed and 1000000 IOPS. It however currently uses ZFS which will likely be the limiting factor. I might BTRFS format it in the future.

dpool uses a 4000G Kingston FURY Renegade PCIe 4.0 NVMe M.2 SSD. It should have 7300MB/s read- and 7000MB/s write speed and 1000000 IOPS. It however currently uses ZFS which will likely be the limiting factor. I might BTRFS format it in the future. Doing it just before Llama 4 llama.cpp support was great timing.

back and started.

the weird thing is that not only the jobs are gone (stored on kaos), but for two jobs which still existed (gemma), the .gguf file was not in /tmp, despite the transfer having been successful.

as for logs, the last log for the maverick job was yesterday 14:55:41, when the scheduler found thjere is not enough budget, meaning the job has not been touched since then.

very strange.

/dpool

Wow, that sounds fast. But indeed, it's hardly faster than /bpool at the moment.

And indeed, a lot of imatrix jobs have either been lost, or somehow their gguf file in /tmp was lost. Two hard to connect failure modes.

Oh great so llama.cpp fixed Llama-3_1-Nemotron-Ultra-253B-v1 support in

Looking at that issue, I must say your use of the word "fixed" is refreshingly optimistic. But at least somebody cares, that's very good :)

@mradermacher The imatrix RPC task OOM failed. So computing the imatrix of this model must be really tight even with RPC. It tried to allocate 32 GB of GPU memory on a 24 GB GPU so it wasn't even close. I now switched to the max memory imatrix RPC setup during which StormPeak is almost unusable. Please restart the Llama-4-Maverick-17B-128E-Instruct imatrix task.

Well, on the positive side, despite numerous attempts to oom-crash your nodes, it seems to be quite resilient by now. I've set temporary hfd and quant limits to 0 on nico1.

on a related note, recently, imatrix jobs have started to crash like this:

CUDA error: an illegal memory access was encountered

it's very noticable, because this has never happene dbefore, and now it affects about a third of the models. I suspect a bug in current llama, or some config change - maybe we should go back to an older llama version?

Could be also the models, but it's too common and appeared too sudden for this to be likely.

compute_imatrix: computing over 314 chunks with batch_size 512
/llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu:2480
  cudaStreamSynchronize(cuda_ctx->stream())
[New LWP 3765900]
[New LWP 3765901]
[New LWP 3765902]
[New LWP 3765903]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007c1c69012c17 in __GI___wait4 (pid=3766092, stat_loc=0x7fff5c24162c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007c1c69012c17 in __GI___wait4 (pid=3766092, stat_loc=0x7fff5c24162c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007c1c695ef0b4 in ggml_print_backtrace () at /llmjob/llama.cpp-cuda512/ggml/src/ggml.c:156
156             waitpid(pid, &wstatus, 0);
#2  ggml_abort (file=0x7c1c620e0f40 "/llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu", line=75, fmt=0x7c1c62106e88 "CUDA error") at /llmjob/llama.cpp-cuda512/ggml/src/ggml.c:183
183         ggml_print_backtrace();
#3  0x00007c1c61e8d033 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /llmjob/llama.cpp/build/bin/libggml-cuda.so
#4  0x00007c1c61e8e59a in ggml_backend_cuda_synchronize(ggml_backend*) () from /llmjob/llama.cpp/build/bin/libggml-cuda.so
#5  0x00007c1c6960434c in ggml_backend_sched_compute_splits (sched=0x60e771a7ca00) at /llmjob/llama.cpp-cuda512/ggml/src/ggml-backend.cpp:1427
1427    /llmjob/llama.cpp-cuda512/ggml/src/ggml-backend.cpp: No such file or directory.
#6  ggml_backend_sched_graph_compute_async (sched=0x60e771a7ca00, graph=<optimized out>) at /llmjob/llama.cpp-cuda512/ggml/src/ggml-backend.cpp:1590
1590    in /llmjob/llama.cpp-cuda512/ggml/src/ggml-backend.cpp
#7  0x00007c1c697226d9 in llama_context::graph_compute (this=this@entry=0x60e771b9a500, gf=gf@entry=0x7c1c226fb030, batched=<optimized out>) at /usr/include/c++/12/bits/unique_ptr.h:191
191           pointer    _M_ptr() const noexcept { return std::get<0>(_M_t); }
#8  0x00007c1c69725522 in llama_context::decode (this=0x60e771b9a500, inp_batch=...) at /llmjob/llama.cpp-cuda512/src/llama-context.cpp:1329
1329    /llmjob/llama.cpp-cuda512/src/llama-context.cpp: No such file or directory.
#9  0x00007c1c697267ab in llama_decode (ctx=<optimized out>, batch=...) at /llmjob/llama.cpp-cuda512/src/llama-context.cpp:2792
2792    in /llmjob/llama.cpp-cuda512/src/llama-context.cpp
#10 0x000060e7644af309 in compute_imatrix (params=..., ctx=0x60e771b9a500) at /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:554
554     /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp: No such file or directory.
#11 main (argc=<optimized out>, argv=<optimized out>) at /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:686
686     in /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp
[Inferior 1 (process 3765898) detached]

And now leia can't create repositories anymore (403 Forbidden). If I try to refresh the token via the web interface, I also get a 403. Hopefully that's not a sign of things to come.

nope, same on nico1 now.

And now leia can't create repositories anymore (403 Forbidden). If I try to refresh the token via the web interface, I also get a 403. Hopefully that's not a sign of things to come.

Same for nico1

And now leia can't create repositories anymore (403 Forbidden). If I try to refresh the token via the web interface, I also get a 403. Hopefully that's not a sign of things to come.

Its fixed now.

Typical hf day then.

I configured Llama-4-Maverick-17B-128E for rpc as well. nico1 is paused (not a good moment), so I can't easily see if it is configured properly, but it likely is.

I configured Llama-4-Maverick-17B-128E for rpc as well. nico1 is paused (not a good moment), so I can't easily see if it is configured properly, but it likely is.

I paused nico1 and nico2 because quantisation tasks where still running when you started the RPC imatrix task and I didn't want it to OOM again. Memory will be very tight so it is unclear if there will be enough memory left to have any quantisation tasks running while imatrix RPC is running.

I resumed nico1 and instead only paused llmjob.nico1. I should have done so from the beginning but using host-pause and canceling it before it disables the host is more convinient.

Imatrix RPC setup looks great so far.

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: RPC[192.168.200.201:7201] model buffer size = 221778.05 MiB
load_tensors: RPC[192.168.200.202:7202] model buffer size = 443556.09 MiB
load_tensors: RPC[192.168.200.203:7203] model buffer size =   600.04 MiB
load_tensors: RPC[192.168.200.204:7204] model buffer size = 96420.84 MiB
load_tensors:   CPU_Mapped model buffer size =  1973.12 MiB
...........................

Llama-3_1-Nemotron-Ultra-253B-v1

Well, I suspect it's a slightly tweaked architecture and/or simply broken or missing support for it in llama.cpp. Cheap to retry, maybe the transformers upate helped. OR will hlep, once I vereified it happened.

#12843 is already fixed! We can even keep our current source GGUF. I'm so excited for this. We could already quant it but then nobody could run it using official llama.cpp releases. I really hope that this is getting merged soon. The model is quite amazing for its size beating Llama-4-Maverick and DeepSeek R1 in some benchmarks despite being much smaller. I guess I will just upload my Q4_K_M quant I made for testing to my own HuggingFace account for now so the community has something to test the model in the meantime.

on a related note, recently, imatrix jobs have started to crash like this:
CUDA error: an illegal memory access was encountered
it's very noticable, because this has never happene dbefore, and now it affects about a third of the models. I suspect a bug in current llama, or some config change - maybe we should go back to an older llama version?
Could be also the models, but it's too common and appeared too sudden for this to be likely.

Guess what crashed the RPC imatrix setup this time. Not OOM but this garbage:

root@RPC-GPU:~# GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./run.sh

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('192.168.200.201') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
Starting RPC server
  endpoint       : 192.168.200.201:7201
  local cache    : n/a
  backend memory : 35 MB
Accepted client connection, free_mem=36700160, total_mem=36700160
Client connection closed
Accepted client connection, free_mem=36700160, total_mem=36700160
Client connection closed
Accepted client connection, free_mem=36700160, total_mem=36700160
Client connection closed
Accepted client connection, free_mem=36700160, total_mem=36700160
Client connection closed
Accepted client connection, free_mem=36700160, total_mem=36700160
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /root/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2465
  cudaStreamSynchronize(cuda_ctx->stream())
/root/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
./run.sh: line 8:  5960 Aborted                 ./rpc-server -H 192.168.200.201 -p 7201 -m 35.5

This is likely related to https://github.com/ggml-org/llama.cpp/issues/12798
I updated ouer fork so we have the now merged support for Qwen3 and Qwen3MoE (https://github.com/ggml-org/llama.cpp/pull/12828).
Lets get latest and rebuild with GGML_CUDA_GRAPHS=OFF
In the case of RPC servers this is: cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON -DGGML_CUDA_GRAPHS=OFF

CUDA error: an illegal memory access was encountered

Yup, seems a common error. All failed models failed with that. But it's not universal. Also not related to llama-4. But seems mostly to affect llama.

rebuilding all with -DGGML_CUDA_GRAPHS=off - if possible, drop me a note when this is fixed so I can remove it (or alternatively, we can investigate if this is even useful for us)

The imatrix setup is now ready again. I got llatest llama.cpp and rebuilt without DGGML_CUDA_GRAPHS. Please restart the imatrix RPC task.

rebuilding all with -DGGML_CUDA_GRAPHS=off - if possible, drop me a note when this is fixed so I can remove it (or alternatively, we can investigate if this is even useful for us)

I will monitor that issue. Lates hope the temporary workaround of setting GGML_CUDA_GRAPHS works. Please restart the RPC imatrix task so it starts after the current imatrix tasks as I really want to have the RPC imatrix task running within the next few hours as otherwise the timing is quite bad for me. If that doesn't work for you feel free to even kill the currently running imatrix tasks in favour of RPC imatrix computation.

I started some of the small iomatrix tasks that failed before and they work. Unfortunately, a bigger one sneaked in, but after that, the first rpc task should start automatically (~40min).

nite

Nice this time RPC imatrix computation started successfully. After the 2.5 hours it took to load the model RPC imatrix computation seems to be much faster than expected.

RAM usage:
CastlePeak: 90.79% (228.36 GiB of 251.53 GiB) (hosts nico2)
StormPeak: 88.67% (446.15 GiB of 503.18 GiB) (hosts nico1)
Threadripper: 88.61% (111.32 GiB of 125.63 GiB) (hosts OpenWrt router)

I think maybe a single quantisation task on nico1 could work if you want to risk it.

It's going so fast you could RPC imatirx compute Llama-4-Maverick-17B-128E after the current instruct one. Not sure if you already configured it to use RPC but if you did it's next in queue and should automaticaly start once the curent one is done.

a good day to wake up to :)

I think we might ran out of nico1 spool storage for a breaf period as booth Llama-4-Maverick-17B-128E-Instruct and Llama-4-Maverick-17B-128E randomly failed simultanously. We are currently at 326 GB free storage at the time of me checking.

The The-Omega-Concession-M-24B-v1.0 imatrix task might broke due to the storage situation as well as it failed once done without any error as the log file just stops likely when things ran out of storage so please restart it as well.

I wonder if we have to restart some HuggingFace uploades due to the storage situation as HuggingFace upload failed due to no space left in 'Llama-4-Maverick-17B-128E-Instruct-i1-GGUF-Llama-4-Maverick-17B-128E-Instruct.i1-IQ3_M.gguf*.log':

Llama-4-Maverick-17B-128E-Instruct.i1-IQ3_M.gguf.part1of4:  83%|████████▎ | 36.6G/44.0G [13:19<02:02, 60.9MB/s]'(ProtocolError('Connection aborted.', OSError(28, 'No space left on device')), '(Request ID: 7fb0487d-b688-43c2-8d71-dd2642d519a7)')' thrown while requesting PUT 

No it actually restarted itself after the storage situation was fixed but then failed with that error:

Llama-4-Maverick-17B-128E-Instruct.i1-IQ3_M.gguf.part1of4: 100%|██████████| 44.0G/44.0G [16:03<00:00, 45.7MB/s]
BadRequestError(' (Request ID: Root=1-67f79659-5ebf7bfd6737a9b26474a95c;a3f6c7cb-2d0c-4828-aed8-182ed473112a)\n\nBad request for commit endpoint:\nYour push was rejected because an LFS pointer pointed to a file that does not exist. For instance, this can happen if you used git push --no-verify to push your changes. Offending file: - Llama-4-Maverick-17B-128E-Instruct.i1-IQ3_M.gguf.part1of4') at /llmjob/share/bin/llmjob line 2804.

Let's hope that fixed it:

nico1 ~# ils 
hfu 81284 212230 212235 212289 213285 213289 213293 237369 237370 3788005 3850513
hfu-Llama-4-Maverick-17B-128E-Instruct-i1-GGUF 3788005
hfu-Llama-4-Maverick-17B-128E-Instruct.i1-IQ3_M.gguf 3788005
nico1 ~# ikil 3788005

This killed the faulty upload task but now it’s not trying to upload it again.

i1-IQ3_M for Llama-4-Maverick-17B-128E-Instruct never got uploaded to https://huggingface.co/mradermacher/Llama-4-Maverick-17B-128E-Instruct-i1-GGUF/tree/main as they no longer seam to exist on nico1.

Edit: Neverminded I somehow missed them. They did get uploaded! Confirmation bias is so real. So everything is perfect. Strange they did despite me having to kill the upload. I should put more trust in your system automatically doing the right thing. I'm so used to systems not working the way they should.

"Your push was rejected because an LFS pointer pointed to a file that does not exist."

How could this even happen with the hub api?

Edit: Neverminded I somehow missed them. They did get uploaded!

It's actually quite a common failure mode (for us) to have it uploaded, but not deleted.

In any case, the logic is quite accidental - here is a high-level description:

  1. When a job starts, it first enumerates all running hfu process groups, then downloads the remote file listing and merges them.
  2. The quantizer then skips creating quant that already exist on disk, or are in the previous list.
  3. Whether it was created or not, when it exists, the script will then upload it, and, if successful, delete it.

If a job is interrupted, it still goes through all quants again, so as long as it restarts, uploads will be restarted as needed. That's all both by luck, and of course because I try to design processes to be idempotent, if convenient.

The only hole is if the job is done (i.e. 9999 blocked/nonempty, or imatrix phase after static phase), and the uploads get interrupted (e.g. after a crash) then uploads will not be reattempted. The symptom is that
the job will stay forever in blocked/nonempty, until the quants are deleted (preferably after being uploaded). This is kind of a design bug - the "quantize" script does so much work on its own. If all the various subtasks (like uploads) were jobs, they could be scheduled more efficiently. However, I would have never invested the effort into making something so complicated...

[the rest]

Yeah, I noticed. No clue why it was full, but it was most certainly full at some point. Fortunately after the rpc imatrix was done...

regarding my python troubles, this must be a bug. first of all, there seems to be no way whatsoever to get some useful debug log out of pip3 - you can get download logging to no end, but not a single line explaining why pip downgrades a package.

But it makes no sense. llama.cpp's top-level requirements.txt looks like this:

-r ./requirements/requirements-convert_legacy_llama.txt
-r ./requirements/requirements-convert_hf_to_gguf.txt
-r ./requirements/requirements-convert_hf_to_gguf_update.txt
-r ./requirements/requirements-convert_llama_ggml_to_gguf.txt
-r ./requirements/requirements-convert_lora_to_gguf.txt
-r ./requirements/requirements-tool_bench.txt

I can pip install -U -r every single one of these files individually and I get transformers 4.51. If I install them in one go via the top-level requirements.txt, it downgrades to 4.46.

Ok, apparently, this dependency causes it: huggingface_hub~=0.23.2

And yes, that looks to be a design bug, as these dependencies are not checked anymore once a package is installed, so they are absolutely pointless.

@nicoboss both llamas 4 maverick quants stopped with the same problem:

llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.1.ffn_gate_exps.weight in a very low-bit quantization

Normally, that's an imatrix coverage problem, but in this case, I think it's simply a bug in llama 4 support. We could provide "the other" imatrix quants (basically non-IQ1/2/3 and IQ3_M or so, I have a list).

But that might mean that the other imatrix quants will have lower quality, depending on the nature of the problemn.

Suggestions?

In other news, I started the ling-plus* imatrix quants with whatever I found in /tmp on nico1. I have no clue why there were imatrix files there, and I have no clue what is in them :) The truth will come when we hit the first IQ2 or so quant.

I am not sure thats a good idea, obviously, but the only other option would be to declare the Ling-plus models as broken and ignore them (ok, the third option would be to resurrect imatrix-data-quarter, but I also don't feel like using the rpc setup for that).

pip always tries to satisfy the dependencies of newly installed packet even if that requires upgrading/downgrading dependencies of previously installed packets. This because in a python environment you can only have a single version installed for a specific packet. If the situation is totally unresolvable pip shows a dependency conflict. I personally find pip unable to resolve dependencies way more annoying than it breaking previously installed packets as there is usually no way to resolve such conflicts. With pip doing something and trying its best there usually is a decent possibility of everything working. Packet maintainers and developers are mostly to blame for this mess. Some just specify that they require a very specific version of a packet for no reason other then that’s what they tested their software with instead of specifying that everything of a specific major/minor version or anything newer than a specific version is supported. Most do that because they don’t want to be blamed if whoever maintains that dependent packet doesn’t follow sane versioning guidelines and introduces breaking API changes in a bugfix release or introduces catastrophic bugs in a bugfix release. In any case this behavior just results in nobody taking version requirement seriously anymore as they are just annoying if they are used more as recommendations than true requirements.

I wouldn’t be surprised if we indeed didn’t cover all 128 experts of Llama 4 maverick but given that we use our algorithm to store them anyways as long we cover most experts this should have worked. It would be interesting to check the log and see what happened if you still have that somewhere. It could be a bug with the current relatively new Llama 4 implementation but hard to say without looking at the logs.

Regarding long-plus models there we had the NaN crash. The imatrix files in /tmp are probably the last autosave before the NaN crash. The clean solution would be llama.cpp fixing the NaN issue on their side, you editing your imatrix training data and delete the chunk of training data which caused NaN crash for that specific model or using different imatrix training data for this model like the imatrix dataset under /root. I planed on trying some other imatrix training datasets on some smaller quant and if they work do them using the RPC imatrix setup. Then unfortunately Llama 4, Nemotron Ultra 253B and a ton of llama.cpp PRs to monitor like MLA implementation and Qwen 3 support came in combination with me having a busy time at work leaving me with no time to reinvestigating long-plus. I guess providing imatrix quants with the imatrix we have is fine. Just keep in mind that this was only trained on like a quarter of data we usually use for imatrix training.

Until 25th of April I will be limited in my ability to respond to unimportant HuggingFace comments like model requests and user questions, queue models, execute llmc audit, investigate issues with certain models, manually provide GGUFs of large models or start the RPC imatrix setup so you might have to do a bit more by your own than usual during this time. It’s not that I can’t do those things just not as much as usually. On the bright side I will now finally have time to work on our llama.cpp fork and add the long-requested versioning metadata.

llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.1.ffn_gate_exps.weight in a very low-bit quantization

Normally, that's an imatrix coverage problem, but in this case, I think it's simply a bug in llama 4 support.

Unsloth also reports similar

image.png

They also provide how much calibration data they used:

We tried adding more uncommon languages to our calibration dataset, and tried using more tokens (1 million) vs Scout's 250K tokens for calibration, but we still found issues.

They got around it by:

We decided to leave these MoE layers as 3bit and 4bit.

I think there is an issue with the tensors in block 45 as it is below 50%, block 3 is above 95% and thus gets saved even if it is partial, block 1 is not above 95% so it does not get saved hence why you hit the error, but it is above 90% so it doesn't seem that strange it doesn't save but block 45 is below 50% which is strange, but I think you can force all three through anyway and test by adjusting the cutoff or adding conditions to the changes you used here: https://github.com/nicoboss/llama.cpp/pull/1

Since if the calibration doesn't hit those experts then inference has a good chance of not which is why it might be worth the test if you guys want to or you can do like unsloth and just adjust those tensors in 45 (and 1 if you don't want to set cutoff to 90%) to be saved in higher bit.

Also on pip I tend to use the dry run option documented here: https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-dry-run which makes pip a little more bearable to use.

hi @mradermacher , how are you? a small oops happened on rich1 and we deadlocked status page and potentially scheduler. I need a tiny bit of help to recover without making a huge mess =) sorry for inconvenience and thank you for help =)

And I guess it would be a great idea to make a command to force release a lock withput killall9

Llmc audit still works, so just status page is dead

And it's every interesting what broke, so if you fix tell me please what happened =)

pip always tries to satisfy the dependencies of newly installed packet even if that requires upgrading/downgrading dependencies of previously installed packets.

Right, but it seems to me very broken to have dependencies, only to have them ignored. Sure, if a package creator says transformers<4.46 needlessly, the creator is to blame (and is probably the case here).
But if that were the only reason for these dependencies to exist, they should not exist. And in the cases where these dependencies are actually required for a package to work, then
it makes no sense for pip to break them. Puzzling.

This because in a python environment you can only have a single version installed for a specific packet.

That's actually very much sane :)

I wouldn’t be surprised if we indeed didn’t cover all 128 experts of Llama 4 maverick but given that we use our algorithm to store them anyways as long we cover most experts this should have worked.

This is a case of a tensor not stored at all. Either our patch ignores 0% covered tensors, or there ius another issue. Since llama-4 has some duplicated tensor issues, maybe this is it.

It would be interesting to check the log and see what happened if you still have that somewhere.

Nope :)

Regarding long-plus models there we had the NaN crash. The imatrix files in /tmp are probably the last autosave before the NaN crash.

It looks like that, but autosaves are off, because of the potential problem I reported with the weight patch. Since we also don't get messages anymore when it's off, I thought it would really be off.

If it is on, then this is worrying, because I then think our imatrix weight patch might slightly corrupt the imatrix data.

The clean solution would be llama.cpp fixing the NaN issue on their side,

Assuming (without evidence) that this is not a model issue.

Until 25th of April I will be limited in my ability to respond to unimportant HuggingFace comments like model requests and user questions, queue models, execute llmc audit, investigate issues with certain models,

Good, that means we are both busy :) Maybe we need to pres somebody else into work :) @RichardErkhov what do you do in your ample free time?= :)

@tdh111

Unsloth also reports similar

Yes, similar, but I think it's different - we still save partially covered tensors (or should), and this is a tensor completely missing, which is why I think it's not the usual "some expert not fully covered" issue.

to be saved in higher bit.

I don't think this can easily be done with llama.cpp. I think we should either wait a bit longer for something magical to happen (such as an upstream fix), or skip those quants. Or both, in the order.

Somebody has helpfully reported the llama 4 issue btw.: https://github.com/ggml-org/llama.cpp/issues/12913

@RichardErkhov what do you do in your ample free time?= :)

Who said I have free time haha? Im last year of school, I have like 20 exams and then preparation to move to another country for uni lol. And everything is mixed with writing code of course

What do you want though? Maybe I will have some time to work in parallel haha

@RichardErkhov

What do you want though?

Well, a great help would be to go through the mradermacher model requests, queue them, and report back to the requester. Typically, it looks like this:

https://huggingface.co/mradermacher/model_requests/discussions/828#67f577bfba17fca922c920b9

It would already be a great help if you just queued some normal/simple requests and simply ignore everything too complex.

You can queue models like this from rich1 (e.g. for the example above):

llmc add -2000 si https://huggingface.co/deepcogito/cogito-v1-preview-qwen-32B

It would be great help because I can sometimes only check once per day or even less often, so that would reduce latency a lot.

Im last year of school, I have like 20 exams and then preparation to move to another country for uni lol.

Haha, as if anybody cares for school grades :)

(Well, yeah, depending on what you want to do, they are important indeed :)

And it's every interesting what broke, so if you fix tell me please what happened =)

I don't know, but typically it results in rsh hanging and still keeping the socket open that the status daemon wants to see closed.

Don't worry about things breaking though, at least, as long as there was a reason to.

killall9

It could be made less crude, but there is little alternative in killing everything that holds a lock when there is a real deadlock (but there hasn't been for a long while). Maybe I can make the llmstatusd separately restartable, because that is always safe to restart.

i will try to queue them, and will ping you if something is wrong. I hope I will not break anything, as I tend to "if anything can go wrong will go wrong", but I will try

Haha, as if anybody cares for school grades :)
(Well, yeah, depending on what you want to do, they are important indeed :)
with my amazing school I need to grind to get something above D on a level exam, even with my 98.6 average...

And yes, please make a command to release a lock haha, or else you will wake up every day from a deadlock just because I failed to properly add a model

@tdh111

Unsloth also reports similar

Yes, similar, but I think it's different - we still save partially covered tensors (or should), and this is a tensor completely missing, which is why I think it's not the usual "some expert not fully covered" issue.

You only save partial when they are above 95%, like I showed in unsloth's image block 1 is 93% so not covered by your patch. Your patch does catch block 3 though which unsloth does not save but yours should be. Block 45 is the problematic one at under 50% though.

@nicoboss nice audit action :)

+                if (bad_experts.size() < round(kv.second.n_as * 0.05)) {

@tdh11 Hmm, I was misinformed about what that patch does then :( Thanks for pointing it out.

@RichardErkhov

i will try to queue them. I hope I will not break anything, as I tend to "if anything can go wrong will go wrong", but I will try

Don't worry, I know you enough by now. But it should be relatively foolproof (you are just queuing a very few models per day at most :). Also, maybe I was too sloppy earlier: of course school goes first, but I assume you know that, just covering my base :)

Anyway, thanks for your help, once more.

will ping you if something is wrong.

I normaly go through all model requests (because I can't see if any went wrong anyway, or have been successfully acted upon). If not sure, just ignore it, you don't have to specifically ping me. Likewise, is you see that a model failed, feel free to report back to the original submitter in the issue, but don't feel you need to alert me.

  •            if (bad_experts.size() < round(kv.second.n_as * 0.05)) {
    

@tdh11 Hmm, I was misinformed about what that patch does then :( Thanks for pointing it out.

I'm sorry for the poor communication in the past, but yes that is what the patch does, if it allowed anything through than it would hide situations where you do not have enough coverage for the model (which with the modern MoE's is actually difficult).

Well, theopen question is why we do have "auto-saves" despite -ofreq not specified. Because my fear is that the patch changes the imatrix data in-place, which then goes into the next computation. When I dropped -ofreq, I also didn't get any messages anymore about tensor coverage, except the end. Very strange.

@nicoboss I think you queued a bunch of glm4 models - they are not supported by our llama.cpp version - fortunately, as the current upstream supports it, but apparently silently data corrupts them when quantizing (which is weird, but hey).

@x0wllaar helpfully pointed this out in https://huggingface.co/mradermacher/model_requests/discussions/844#67fe8fb6046aebf7c8dcdcf5

update: i hate this broken @ completion thingy. gets me every single time

@nicoboss and while I have you, this is also interesting: https://huggingface.co/mradermacher/model_requests/discussions/847#67ffc38637b9f0d1152b9d45

basically, the rather common llama_vocab_get_add_eos crash might actually be a fixed llama.cpp bug

@nicoboss the GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed issue seems to be a bug in our fork only - quants that fail like seem to work fine with normal llama.cpp (and e.g., koboldcpp). (i have not tested the exact same versions, though)

it was almost certainly introduced with the dryrun changes

yup, verified. that seems to affect 161 models :) and is the most common killer for things that survive DRYRUN

@nicoboss also, it seems our build numbers are unrelated to llama.cpp build numbers. that is unfortunate.

for testing, i'll switch back to llama.cpp

nope, happens with upstream llama as well, so it must be our compile options. but it happens with all variants (cuda, nocuda, noavx...)

sucks that we both don't have time :()

i debugged a bit, so add_eos is true for me, and it seems it is so because the model has: tokenizer.ggml.add_eos_token bool = true

and this happens in main:

if (!llama_model_has_encoder(model)) {
    GGML_ASSERT(!llama_vocab_get_add_eos(vocab));
}

has_encoder is false, because arch is llama, and therefore, assertion failure.

Good thing we waited with MLA:

Well... It's always like this. And the only thing to avoid it is to let it cook for a while and let people test it - it's normal to have bugs, especially with llama.cpp's essentially zero testing/quality approach.

I think we might have glm4 support upstream now: https://github.com/ggml-org/llama.cpp/pull/12957 https://github.com/ggml-org/llama.cpp/pull/13021

I updated our fork with latest llama.cpp so you can update and try it out. Please keep in mind that GLM4-0414 is currently probably still broken and will be fixed once https://github.com/ggml-org/llama.cpp/pull/13099 gets merged but all other GLM-4 models should be fine.

Well... It's always like this. And the only thing to avoid it is to let it cook for a while and let people test it - it's normal to have bugs, especially with llama.cpp's essentially zero testing/quality approach.

Yes I know and my code is also not always perfect but I usually do a lot of automated and manual testing to at least ensure a certain quality standard. llama.cpp even mentions inside https://github.com/ggml-org/llama.cpp/blob/master/docs/development/HOWTO-add-model.md that main, imatrix, quantize and server should be tested when adding a new architecture. I would expect the same when making significant changes to the implementation of an existing architecture. But in the end all llama.cpp contributors just do so in their spare time so I guess we can all be happy that they contribute at all and things breaking from times to times is not that bad as we can always wait a bit for things to mature unless a model is urgent.

Please keep in mind that GLM4-0414 is currently probably still broken

I see, I see... I guess we should wait then, because the queued models are essentially all glm4-0414 based.

But in the end all llama.cpp contributors just do so in their spare time so I guess we can all be happy that they contribute at all

I'm doing free software development for decades now. In my opinion, backed by experience with lots of gnu projects, is that it's better to have no maintainer than a bad maintainer. A bad maintainer keeps other people from joining and doing a better job. Likewise, if you can't do a job properly, it's almost always better to not do it at all.

Furthermore, doing a bad job because "it's just my spare time" would be extremely disrespectful towards the other people consuming the results and wasting their "spare time".

@nicoboss that is a lot of red. but it looks valid.

Question: do you do any kind of result control? Because I am happy to clean stuff up (well, queing models that are known to not be supported by llama.cpp is a bit extreme, but missing tokenizers etc. is mostly unavoidable), but I wonder if I should do so because it might mean you won't even notice.

The background is, I am wondering about some kind of automatic diagnosis, for example, I normally want to know which models have failed and why, and when some failures get automatically nuked and just end up in a log file somewhere, I surely will not notice. That's why I have so far refrained from automatic clean up. But todays situation is a bit extreme and suggests automating it. Any thoughts on that?

@nicoboss another data point: 26 of the 27 failed imatrix gens are due to the llama_vocab_get_add_eos assertion which seemingly only exists in llama-cli and llama-imatrix, but not in llama-server, and where spot tests have shown that the model actually works. (I have cleaned those up because you have now "seen" it :)

Also, I have not checked their size, but does it even make sense to queue imatrix for small models (3b or smaller)? I practically never do.

@nicoboss that is a lot of red. but it looks valid.

It's even more red now. 74 quant errors and 2 imatrix errors but I also queued a lot of models so this is expected. I queued all top 50 pages of trending models supported by llama.cpp that we have not yet done and have not failed in the past.

Question: do you do any kind of result control?

If it's a common error I'm just nuking. I only investigate it if it is a user requested model or an interesting special error.

Because I am happy to clean stuff up (well, queing models that are known to not be supported by llama.cpp is a bit extreme, but missing tokenizers etc. is mostly unavoidable), but I wonder if I should do so because it might mean you won't even notice.

I can always check llmc why so no worries go ahead and clean them all up.

The background is, I am wondering about some kind of automatic diagnosis, for example, I normally want to know which models have failed and why, and when some failures get automatically nuked and just end up in a log file somewhere, I surely will not notice. That's why I have so far refrained from automatic clean up. But todays situation is a bit extreme and suggests automating it. Any thoughts on that?

For me it would be fine to automatically nuke common errors occurring in low priority models. Auditing like 100 failed models is somewhat tedious. Especially when I'm on holiday and have to do everything from my phone with limited mobile internet. Luckely I will back home again in a few hours.

Also, I have not checked their size, but does it even make sense to queue imatrix for small models (3b or smaller)? I practically never do.

For trending models, I usually always queue with imatrix no matter their size unless it’s a big non-English model or something I personally conside garbage. I think doing imatrix for small models might be justified as it's everyone runs on their phones but it only makes sense if the model is popular enough to justify the additional resources required to create imatrix quants.

If it's a common error I'm just nuking. I only investigate it if it is a user requested model or an interesting special error.

That means no, because you can only nuke when you see it, so you essentially would have no feedback on failure rates etc.

I can always check llmc why so no worries go ahead and clean them all up.

My concern is not that you can check it, but that there is no learning effect, no possibly correcting feedback. For example, you queued models that simply aren't supported, which should not have been possible, as you said you filter for supported architectures. Without you seeing it, you would never know.

And the indirect problem is that cleaning these models up is a lot of work, because it requires a conscious decision every time. It also makes it hard for me to see the feedback on models where I am concerned (and where I might want to manually go in and fix things if possible).

Maybe the solution is to create a junk class of models (another flag hack really) which are simply cleaned up either on any error, or on specific errors.

if the model is popular enough to justify the additional resources required to create imatrix quants.

Well, for tiny models, the additional resources are also tiny, so this would be justified almost always, even for non-popular models.

Unless you think of human resources (e.g. you and me :) who will have to handle extra failures. But I think the imatrix failures specifically are something we should investigate further. I hope you can find some time at some point to look into this and maybe support or correct my findings, as it affects a lot of non-junk models - 29 of the 31 imatrix failures so far today (I actually just paste a one-liner to clean those up because it is the #1 reaosn for imatrix gen to fail).

I've added a "besteffort" token that can be used when submitting a model. completely untested, of course. it also doesn't do anything other than setting a flag yet, but if you use it next time to do mass-queued, i can probably do something useful with it. just add it somewhere after the url when adding.

That means no, because you can only nuke when you see it, so you essentially would have no feedback on failure rates etc.

I usualy run llmc audit and go through all of them but if I see it is bpe-pt missing, missing spiece.model, ValueError Unprocessed experts or missing tokenizer.model I'm not further investigating the issue and am just nuking it.

It is not true that you need to actually run llmc audit to nuke models. Sometimes I'm also just nuking based on what I see on the status page. You can copy the model page into a text editor and delete all lines not containing some specific common errors and then just llmc nuke them in mass.

My concern is not that you can check it, but that there is no learning effect, no possibly correcting feedback

That's why I recommend to only automatically nuke common errors like the ones I would manually nuke without closely looking at them.

For example, you queued models that simply aren't supported, which should not have been possible, as you said you filter for supported architectures. Without you seeing it, you would never know.

They all seam to be supported so I'm quite confused they are not. Maybe because latest version of ouer fork doesn't match the version we currently have on ouer servers as I updated it yesturday but because of https://github.com/ggml-org/llama.cpp/pull/13099 we didn't update the workers?

Idefics3ForConditionalGeneration and SmolVLMForConditionalGeneration https://github.com/nicoboss/llama.cpp/blob/3092176a929c1e4f4c88d95b7dbec4587a8fe4eb/convert_hf_to_gguf.py#L1950
LlavaForConditionalGeneration: https://github.com/nicoboss/llama.cpp/blob/3092176a929c1e4f4c88d95b7dbec4587a8fe4eb/convert_hf_to_gguf.py#L1901

Maybe the solution is to create a junk class of models (another flag hack really) which are simply cleaned up either on any error, or on specific errors.

That would be awesome.

I usualy run llmc audit and go through all of them but if I see it is bpe-pt missing, missing spiece.model, ValueError Unprocessed experts or missing tokenizer.model I'm not further investigating the issue and am just nuking it.

That is a good algorithm, I think.

It is not true that you need to actually run llmc audit to nuke models. Sometimes I'm also just nuking based on what I see on the status page.

Sure, but only llmc audit will log the error, as the other tools are simply basic management commands that are used, e.g. when you want to queue a job elsewhere, you nuke, and don't want a permanent failure record,

That's why I recommend to only automatically nuke common errors like the ones I would manually nuke without closely looking at them.

I totally agree, but we clearly need a separate class for that, as I don't want automatic nuking on almost everything I queue (mostly because I have a much lower failure rate, so it is managable). And of course you need to manually investigate such cases to find out when it is reasonable to automatically nuke.

They all seam to be supported so I'm quite confused they are not. Maybe because latest version of ouer fork doesn't match the version we currently have on ouer servers as I updated it yesturday

That smells like a bug - you should not extract models from the llama version you made, but from the llama version the workers have installed. Indeed, Idefics3 was newly supported, and I did update llama.cpp, but I first have to be awake and find time to do it, so it can take a day sometimes. I see I have been too responsive in the past ;)

It is still a good example - no matter why it wasn't supported, without feedback we wouldn't have caught it. And this can happen for missing tokenizer.model or unsupported bpe-pretokinzer as well, i.e. the trhings we normally nuke without much thinking.

create a junk class of models (another flag hack really) which are simply cleaned up either on any error, or on specific errors.
That would be awesome.

I haven't pointed this out clearly, but the besteffort flag is exactly that (or rather, supposed to be that).

Hmm, scripts the worked fine for a year. Two days ago:

exporting up to 1000 of 877 models

Yesterday:

exporting up to 1000 of 931 models

Today:

no models to export

Problems I don't need in my morning routine.

Update: found it, my method that predates convert's ability to tell me its supported models was good enough to pass visual inspection (I got a list of supported archs) but wasn't actually good enough for my script. Things I absolutely don't need in the morning.

@nicoboss right, of course, we never made a llava mmproj extractor. also, idefics3 is a vision model. do you happen to know if there is an extractor already?

@mradermacher Can you please fix ther status page? I think it likely crashed due to nico1 loosing internet conectivity while still in the pausing process.

I've added a "besteffort" token that can be used when submitting a model. completely untested, of course. it also doesn't do anything other than setting a flag yet, but if you use it next time to do mass-queued, i can probably do something useful with it. just add it somewhere after the url when adding.

Amazing. I will for sure use that flag in the future!

Sure, but only llmc audit will log the error, as the other tools are simply basic management commands that are used, e.g. when you want to queue a job elsewhere, you nuke, and don't want a permanent failure record,

Good to know. I wasn't aware that directly nuking doesn't log the error. I was only aware that nukeall does not which is the main reason I often end up accidentally queuing the same imatrix failures again and again.

I totally agree, but we clearly need a separate class for that, as I don't want automatic nuking on almost everything I queue (mostly because I have a much lower failure rate, so it is managable). And of course you need to manually investigate such cases to find out when it is reasonable to automatically nuke.

Makes perfect sense and I prefer this approach as for user requested models I want to always investigate the error and report back to the person who requested the model.

That smells like a bug - you should not extract models from the llama version you made, but from the llama version the workers have installed. Indeed, Idefics3 was newly supported, and I did update llama.cpp, but I first have to be awake and find time to do it, so it can take a day sometimes. I see I have been too responsive in the past ;)

I'm aware of it. I will soon change my script to instead use /llmjob/llama.cpp/convert_hf_to_gguf.py to get the supported models.

It is still a good example - no matter why it wasn't supported, without feedback we wouldn't have caught it. And this can happen for missing tokenizer.model or unsupported bpe-pretokinzer as well, i.e. the trhings we normally nuke without much thinking.

We definitely don't want to auto-nuke unsupported model as this usually means that a model supposed to be compatible with llama.cpp somehow was not.

Regarding missing tokenizer.model I am aware of the issue and know how to generate tokenizer.model based on tokenizer.json but doing so is a lossy process as tokenizer.model does not contain all tokens that where inside tokenizer.json so I better get this perfect and do extensive testing before we automate that. It would be a shame if we do this wrong and then generate hundreds of models that aren’t perfect. I know it is probably better than not generating them at all but I'm too much of a perfectionist to go with a potentially non-perfect solution as I know I would regret it.

Unsupported bpe-pretokinzer is simular. We could add our own mapping to our fork to make almost all of them work but there is no guarantee whatever got changed in the pre-tokenizer doesn't affect the model in any way unless we manually review the pre-tokenizer changes for every mapping we add and think about it for which we probably won't have time. But to be fair the official convert_hf_to_gguf_update.py also does not do any validation so it likely would be fine but not perfect.

Hmm, scripts the worked fine for a year

It's surprising it lasts for a year. Mine usually break much quicker because either HuggingFace or llama.cpp change something I relied on. For example just a few days ago llama.cpp broke my script by changing @Model.register to @ModelBase.register inside convert_hf_to_gguf.py.

@nicoboss right, of course, we never made a llava mmproj extractor. also, idefics3 is a vision model. do you happen to know if there is an extractor already?

You could try the newly added --mmproj argument of convert_hf_to_gguf.py and see if they work. If not, they are likely not yet supported. I'm quite happy that llama.cpp developers finally integrated mmproj extraction into convert_hf_to_gguf.py as using model specific extractors was a massive pain. Keep in mind that this option is still considered experimental so there might be models with a model specific extractor not yet supported by it or some other issues with it.

Oh I see in audit many errors like /llmjob/share/python/bin/python3: can't open file '/llmjob/llama.cpp/examples/llava/gemma3_convert_encoder_to_gguf.py': [Errno 2] No such file or directory - this probably got removed as you are now supposed to use convert_hf_to_gguf.py` instead of model specific mmproj extractors many vision models.

I also checked and can confirm that support for SmolVLM (version 1 and 2) got implemented in https://github.com/ggml-org/llama.cpp/pull/13050 which should be able to handle Idefics3ForConditionalGeneration and SmolVLMForConditionalGeneration if I'm not mistaken. Vision extraction for LlavaForConditionalGeneration

The status page is still broken but things don't seem to be soft locked so killall9 would likely be an overreaction to fix an issue that only seams to break the status page.

--mmproj argument of convert_hf_to_gguf.py

I see... but how to specify the quant type? and does the only create the mmproj file? love how they force the filename, too :( I suspect you need an external quantizer to quantize mmproj. Makes more sense, IMHO,l but also generates more work than I am able to easily do right now.

Unfortunately, that is likely too much to handle for me right now, I'll "freeze" gemma3conditional in the queue for the tiime being.

--mmproj              (Experimental)

Yeah baby.

The status page is still broken but things don't seem to be soft locked so killall9 would likely be an overreaction to fix an issue that only seams to break the status page.

In each case it was a hanging rsh, which probably affects llmstatusd because it connects every minute. Don't see a fix coming very soon, but I have one in the pipeline.

which should be able to handle Idefics3ForConditionalGeneration and SmolVLMForConditionalGeneration if I'm not mistaken. Vision extraction for LlavaForConditionalGeneration

Must be why they are all failing (well, smolvlm). We probably haven't found the private model they have tested it with.

-mmproj argument of convert_hf_to_gguf.py

Anyway, your input is invaluable, as always :) You wouldn't know of a way to extract supported vision model architectures? OR should we just blindly try to extract mmproj for every model (I wonder what the memory usage will be, and if it requires cuda).

(partial answer: https://github.com/ggml-org/llama.cpp/blob/master/examples/llava/README.md lists gemma3, smolvlm* and pixtral, so not qwen, for mmproj, but maybe that's because it is the llava page. for most models, it seems you still need to call other converters)

final answer: convert_hf_to_gguf.py --print-supported-models actually prints it, nice. now the question is how tofind out if a specific model actually has vision support (e.g. gemma3 does not always have)

actually not final answer: just because it supports that arch does not mean it supports mmproj extraction though

I added "llmc status", which should give you the same style of output as telnet to the lmstatusd. if that doesn't hang, you know it's "only" llmstatusd hanging.

I added "llmc restart-llmstatusd", which does systemctl restart llmstatusd. feel free to try when it hangs. note that it sometimes times out because the systemd timeout is slower than the disk in kaos (it kills the service, but then fails to restart it), in which case you must run it again for systemd to attempt a restart, because systemd sucks that way.

This is very unintrusive, other than me having to restart my status window.

Qwen2.5VL support https://github.com/ggml-org/llama.cpp/pull/12402 and its's fix https://github.com/ggml-org/llama.cpp/pull/13133 just got merged. I updated our llama.cpp branch. Please update and requeue all previously skipped Qwen2.5VL models.

I see... but how to specify the quant type?

Mot sure if this is implemented yet. Try using --outtype but I have not tested it it this applies to mmproj or only works for the normal GGUF. The options are: ["f32", "f16", "bf16", "q8_0", "tq1_0", "tq2_0", "auto"], default="f16"

and does the only create the mmproj file?

I think it only exports the mmproj if specified but I have not tried it.

love how they force the filename, too :(

If the output filename containes "mmproj" it should use the specified filename instead:

if "mmproj" not in fname_out.name:
    fname_out = ModelBase.add_prefix_to_filename(fname_out, "mmproj-")

final answer: convert_hf_to_gguf.py --print-supported-models actually prints it, nice. now the question is how tofind out if a specific model actually has vision support (e.g. gemma3 does not always have)

Look at the architecture suffix. ForCausalLM is no vision while ForConditionalGeneration has vision.

actually not final answer: just because it supports that arch does not mean it supports mmproj extraction though

I would just try and see if it works. Seams like the easiest way to implement it as that way you don't have to update your code when they add support for it.

I added "llmc status", which should give you the same style of output as telnet to the lmstatusd. if that doesn't hang, you know it's "only" llmstatusd hanging.

Awesome! Thanks a lot. Telnet was always a bit of a hassle while this command is super convenient in comparison.

I added "llmc restart-llmstatusd", which does systemctl restart llmstatusd. feel free to try when it hangs. note that it sometimes times out because the systemd timeout is slower than the disk in kaos (it kills the service, but then fails to restart it), in which case you must run it again for systemd to attempt a restart, because systemd sucks that way. This is very unintrusive, other than me having to restart my status window.

Amazing! I will for sure try it the next time it freezes. Thanks a lot!

@mradermacher Please restart Qwen2.5-Coder-32B-ftjob-3653382d559c and Qwen2.5-32B-YOYO-V2 imatrix tasks. I accidentialy rebooted your container due to entering the command in to the wrong window. Sorry I intended to reboot @RichardErkhov 's container as it was hogging all the GPU resources.

Thanks for your inputs!

If the output filename containes "mmproj" it should use the specified filename instead:

Wow, that's even worse. Couldn't even document that properly.

Mot sure if this is implemented yet. [quantize]

In the last,m you'd have to ujse clip-quantize or somesuch thing.

Look at the architecture suffix. ForCausalLM is no vision while ForConditionalGeneration has vision.

I was obviously talking about support for it in convert_hf_to_gguf.py :)

I would just try and see if it works. Seams like the easiest way to implement it as that way you don't have to update your code when they add support for it.

That would mean only nico1 can do conversions form now on.

Awesome! Thanks a lot. Telnet was always a bit of a hassle while this command is super convenient in comparison.

Wow, if you think so, great for you :-)

It seems most or at least a large majority of the GLM-4 models generate nan's during imatrix training. I guess that... hasn't been tested upstream? Sigh.

Please update and requeue all previously skipped Qwen2.5VL models.

Hmm, did we skip 2.5 models? You mean ones that filed? (I'll read the issues now).

Oh, right, we skipped all q2.5 models. I also have to re-enable mmproj extraction for that, hmm.

Wow, that's even worse. Couldn't even document that properly.

No idea why this isn't documented anywhere.

That would mean only nico1 can do conversions form now on.

Maybe put all ForConditionalGeneration archidectures to nico1 and only try mmproj extraction for thouse. I don't think there are any ForCausalLM with vision support.

It seems most or at least a large majority of the GLM-4 models generate nan's during imatrix training. I guess that... hasn't been tested upstream? Sigh.

Yes seams like it.

Hmm, did we skip 2.5 models? You mean ones that filed? (I'll read the issues now).
Oh, right, we skipped all q2.5 models. I also have to re-enable mmproj extraction for that, hmm.

Would be awesome if you could do so if you find time for it.

Wow, if you think so, great for you :-)

It seems like llmc status doesn't show the imatrix part of the status page while telnet 10.28.1.1 16732 does. Ut would be cool if you could add that if doing so is easely possible.

@nicoboss q2.5 doesn't seem to be supported:

+ python3 /llmjob/llama.cpp/examples/llava/qwen2_vl_surgery.py --data_type fp16 -- ../Qwen2.5-VL-7B-Instruct
You are using a model of type qwen2_5_vl to instantiate a model of type qwen2_vl. This is not supported for all configurations of models and can yield errors.
        size mismatch for bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([1280]).

There doesn't seem to be a 2.5-specific surgery script. I tested with Qwen2.5-VL-7B-Instruct

Maybe put all ForConditionalGeneration archidectures to nico1 and only try mmproj extraction for thouse. I don't think there are any ForCausalLM with vision support.

I don't think that works, either, as most models need special treatment in from of some external surgery script. I think going by the official list output by convert* is the way to go. And I guess I can live with f16 only for the time being.

It seems like llmc status doesn't show the imatrix part of the status page while telnet 10.28.1.1 16732 does. Ut would be cool if you could add that if doing so is easely possible.

That's a good point, it only does the llmjob status. It also won't print any flags. At least the former is easy to add. (update: or not so easy, llmc exited with the first exit status it received)

Also, it has to connect (and lock) to all workers, so, uhm the web or telnet are still better for keeping some ongoing display.

q2.5 doesn't seem to be supported:
There doesn't seem to be a 2.5-specific surgery script. I tested with Qwen2.5-VL-7B-Instruct

Strange. In the now marged PR https://github.com/ggml-org/llama.cpp/pull/12402 they implemented support for it inside qwen2_vl_surgery.py. Have you specified the correct --model_type? This is how it should be called:

python3 examples/llava/qwen2_vl_surgery.py "/path/to/model" --data_type fp16 --model_type "qwen2.5vl"

I will test myself and let you know.

@mradermacher Qwen2.5-VL-7B-Instruct mmproj extraction worked for me. Just make sure to pip install --upgrade transformers and pip install accelerate after installing the normal llama.cpp requirements and call it like venv/bin/python examples/llava/qwen2_vl_surgery.py "/bpool/Qwen2.5-VL-7B-Instruct" --data_type fp16 --model_type "qwen2.5vl"

root@AI:/apool/llama.cpp# venv/bin/python examples/llava/qwen2_vl_surgery.py "/bpool/Qwen2.5-VL-7B-Instruct" --data_type fp16 --model_type "qwen2.5vl"
model_name:  /bpool/Qwen2.5-VL-7B-Instruct
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████| 5/5 [00:12<00:00,  2.53s/it]
Qwen2_5_VLVisionConfig {
  "depth": 32,
  "fullatt_block_indexes": [
    7,
    15,
    23,
    31
  ],
  "hidden_act": "silu",
  "hidden_size": 1280,
  "in_channels": 3,
  "in_chans": 3,
  "intermediate_size": 3420,
  "model_type": "qwen2_5_vl",
  "num_heads": 16,
  "out_hidden_size": 3584,
  "patch_size": 14,
  "spatial_merge_size": 2,
  "spatial_patch_size": 14,
  "temporal_patch_size": 2,
  "tokens_per_second": 2,
  "torch_dtype": "float16",
  "transformers_version": "4.51.3",
  "window_size": 112
}

[vl25][to_gguf_name] vision_model.blocks.0.norm1.weight --> v.blk.0.ln1.weight
[vl25][to_gguf_name] vision_model.blocks.0.norm2.weight --> v.blk.0.ln2.weight
[vl25][to_gguf_name] vision_model.blocks.0.attn.qkv.weight --> v.blk.0.attn_qkv.weight
[vl25][to_gguf_name] vision_model.blocks.0.attn.qkv.weight --> v.blk.0.attn_qkv.weight
[vl25][to_gguf_name] vision_model.blocks.0.attn.qkv.weight --> v.blk.0.attn_qkv.weight
[vl25][to_gguf_name] vision_model.blocks.0.attn.qkv.bias --> v.blk.0.attn_qkv.bias
[vl25][to_gguf_name] vision_model.blocks.0.attn.qkv.bias --> v.blk.0.attn_qkv.bias
[vl25][to_gguf_name] vision_model.blocks.0.attn.qkv.bias --> v.blk.0.attn_qkv.bias
[vl25][to_gguf_name] vision_model.blocks.0.attn.proj.weight --> v.blk.0.attn_out.weight
[vl25][to_gguf_name] vision_model.blocks.0.attn.proj.bias --> v.blk.0.attn_out.bias
[vl25][to_gguf_name] vision_model.blocks.0.mlp.gate_proj.weight --> v.blk.0.ffn_gate.weight
[vl25][to_gguf_name] vision_model.blocks.0.mlp.gate_proj.bias --> v.blk.0.ffn_gate.bias
[vl25][to_gguf_name] vision_model.blocks.0.mlp.up_proj.weight --> v.blk.0.ffn_up.weight
[vl25][to_gguf_name] vision_model.blocks.0.mlp.up_proj.bias --> v.blk.0.ffn_up.bias
[vl25][to_gguf_name] vision_model.blocks.0.mlp.down_proj.weight --> v.blk.0.ffn_down.weight
[vl25][to_gguf_name] vision_model.blocks.0.mlp.down_proj.bias --> v.blk.0.ffn_down.bias
(...)
[vl25][to_gguf_name] vision_model.blocks.31.norm1.weight --> v.blk.31.ln1.weight
[vl25][to_gguf_name] vision_model.blocks.31.norm2.weight --> v.blk.31.ln2.weight
[vl25][to_gguf_name] vision_model.blocks.31.attn.qkv.weight --> v.blk.31.attn_qkv.weight
[vl25][to_gguf_name] vision_model.blocks.31.attn.qkv.weight --> v.blk.31.attn_qkv.weight
[vl25][to_gguf_name] vision_model.blocks.31.attn.qkv.weight --> v.blk.31.attn_qkv.weight
[vl25][to_gguf_name] vision_model.blocks.31.attn.qkv.bias --> v.blk.31.attn_qkv.bias
[vl25][to_gguf_name] vision_model.blocks.31.attn.qkv.bias --> v.blk.31.attn_qkv.bias
[vl25][to_gguf_name] vision_model.blocks.31.attn.qkv.bias --> v.blk.31.attn_qkv.bias
[vl25][to_gguf_name] vision_model.blocks.31.attn.proj.weight --> v.blk.31.attn_out.weight
[vl25][to_gguf_name] vision_model.blocks.31.attn.proj.bias --> v.blk.31.attn_out.bias
[vl25][to_gguf_name] vision_model.blocks.31.mlp.gate_proj.weight --> v.blk.31.ffn_gate.weight
[vl25][to_gguf_name] vision_model.blocks.31.mlp.gate_proj.bias --> v.blk.31.ffn_gate.bias
[vl25][to_gguf_name] vision_model.blocks.31.mlp.up_proj.weight --> v.blk.31.ffn_up.weight
[vl25][to_gguf_name] vision_model.blocks.31.mlp.up_proj.bias --> v.blk.31.ffn_up.bias
[vl25][to_gguf_name] vision_model.blocks.31.mlp.down_proj.weight --> v.blk.31.ffn_down.weight
[vl25][to_gguf_name] vision_model.blocks.31.mlp.down_proj.bias --> v.blk.31.ffn_down.bias
[vl25][to_gguf_name] merger.mlp.0.weight --> mm.0.weight
[vl25][to_gguf_name] merger.mlp.0.bias --> mm.0.bias
[vl25][to_gguf_name] merger.mlp.2.weight --> mm.2.weight
[vl25][to_gguf_name] merger.mlp.2.bias --> mm.2.bias
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
save model as:  qwen2.5-vl-7b-instruct-vision.gguf

Thanks for looking into this!

Have you specified the correct --model_type

Nope, didn't know, hope that that is it. So it detects the model correctly itself but insists you specify it yourself anyway. No, it doesn't even do that, it just fails without even telling you what you did wrong. Wow.

the python venv is rebuilt when llama is updated, but indeed, I install the versions that llama.cpp explicity (or implicitly) requests rather than the newest one., But I don't think that is it.

Update: yup, that was it.

Soon things should no longer be as painfull as they integrate them into convert_hf_to_gguf in https://github.com/ggml-org/llama.cpp/pull/13209 but looks like that useless but required --model_type argument will stay.

Update: yup, that was it.

Awesome that you got it working!

Any ideas on GLM-4 and the qwen3-30b models? Just give up?

Qwen is commonly releasing broken models, so that's probably the way to go. As for GLM, koboldcpp release notes mention a requires a "blasbatchsize of 16" for GLM-4 to be "coherent". Worrying, but probably the model is just bad as well, and the way to go is to only do static quants.

There is now a new "skip_mmproj" token fror llmc add - which skips mmproj extraction in static quant jobs.

And just fyi, the new daily models (that I look at) have reduced to ~500 from ~900 in the last few days, and that isn't the normal weekly rhythm. That is very noticable.

Any ideas on GLM-4 and the qwen3-30b models? Just give up?

Qwen3-30B-A3B failing makes me really worried about Qwen3-235B-A22B which is the one I'm most looking forward to. I guess I will provide the source GGUF this evening and we will see. It is likely worth waiting a few days before we give up on them as Qwen3 just released yesterday and it is what I wished Llama 4 to be so I’m really excited about it.

GLM-4-9B-0414 is currently a bit of a shit show so it's not just us having issues with it: https://github.com/ggml-org/llama.cpp/issues/12946. I would say just do only static quants for now and revisit it once its issues are fixed. I just hope we won't forget as this is one of the models, I see me forgetting about relatively quickly.

There is now a new "skip_mmproj" token fror llmc add - which skips mmproj extraction in static quant jobs.

Great. Maybe I will find use for that flag one day but generally having the mmproj for a vision model is important.

And just fyi, the new daily models (that I look at) have reduced to ~500 from ~900 in the last few days, and that isn't the normal weekly rhythm. That is very noticable.

That is very interesting. Maybe everything is now busy working on Qwen3 models.

Maybe I will find use for that flag one day but generally having the mmproj for a vision model is important.

You will, likely, in the form of the "m" choice I just added to llmc audit, which restarts the job with this flag set. It's quite common for vision models to lack the actual config or weights. In the past, I just nuked.

GLM-4

sad, but I suspect it will be quickly forgotten by everbody.

As for GLM-4, the fp16 shouldn't even be the issue, as we force mmq for imatrix generation.

How is this one working but not the original model?!?

0   61 Josiefied-Qwen3-30B-A3B-abliterated-v1        run/imatrix (GPU-18) 16/48 5.71s/c 7.9/30.3m(18.3-20.8) [120/318] 7.9912

In any case I now queued https://huggingface.co/Qwen/Qwen3-235B-A22B using a manual provided GGUF. Let's see if that one works because I care more about 235B than 30B.

How did llama.cpp break DeepSeek-V2-Lite-Chat - this is my own model and I know it should be supported by llama.cpp - seems like they never tested MLA for DeepSeek-V2-Lite-Chat or something - imatrix failed and if I try to load the GGUF it complains about the chat template:

-2000   31 DeepSeek-V2-Lite-Chat-Uncensored-Unbiased-Reasoner error/1 (GPU-2d) (status: failure)
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
terminate called after throwing an instance of 'std::runtime_error'
  what():  this custom template is not supported
Aborted

Regarding the README.md update project I today was able to confirm my theory that the update date shown by HuggingFace is just the date of the last git commit and so can easily be spoofed. I had to do so myself as I accidentally uploaded my model into an already existing model. I also realized that HuggingFace offers a nice UI to delete LFS files.

That budget check is a bit stupid and thinks it needs to download the model despite it already beeing downloaded and softlinked:

-7000  471 si Qwen3-235B-A22B                              budget/hfd/1294G

I set Qwen3-235B-A22B.force and Qwen3-235B-A22B.hfd-success and it still didn't start but llmc status proved usefull and showed more information than status website:

Qwen3-235B-A22B: no gguf file found
 remote error from '10.28.1.6', propagated at /llmjob/share/bin/llmjob line 538, <GEN11> line 3.
        main::__ANON__() called at /llmjob/share/bin/llmjob line 2131
rsh: Broken pipe

I see what is going on. I forgot that I need to specify the source path absolute on symbolic links and because it pointed on itself it didn't show up as proben symbolic link in ls. So all my fault and your system is awesome as always.

nico1 /tmp/quant# cat Qwen3-235B-A22B.gguf
cat: Qwen3-235B-A22B.gguf: Too many levels of symbolic links

It worked now! :D

-7000  471 si Qwen3-235B-A22B                              ready/static

How is this one working but not the original model?!?

Maybe it hasn't reached that point yet. OR maybe the changed weights somehow work. We always knew censorship is bad for performance.

I also realized that HuggingFace offers a nice UI to delete LFS files.

It does? What happens to the git history if you do?

It worked now! :D

Yes, but did you do anything else? Because pavement7bv1 failed in the weirdest way:

what():  filesystem error: status: Too many levels of symbolic links [/tmp/quant/Qwen3-235B-A22B.gguf]

I tend to download models to test to /tmp, and llama crashes when there is a non-utf-8-encoded filename in /tmp (the cwd) for example (why do they keep doing this shit). Maybe it also crashes when there is a symlink loop. That would explain it.

How is this one working but not the original model?!?

        1   62  I Josiefied-Qwen3-30B-A3B-abliterated-v1       error/47 8/24,IQ2_M [515/579]

Why did you even worry :)

@mradermacher Why is nico1 quantisation tasks stuck by hfu-drain since noon? I think something broke as this is not normal at all. Also Llama_3.x_70b_Tristar_V2.1 is stuck at run/hfd (from nico2) for over a day now as it decided to do the transfer from nico2 to nico1 at the exact timing we automaticaly shutd down nico2.

I rebooted nico1 as I saw a ton of stuck upload tasks. Let's see if that fixed it.

It seams like a reboot partially fixed it. Llama_3.x_70b_Tristar_V2.1 is now blocked/paused and probably corrupted. But Qwen3-235B-A22B Q3_K_L hfu is now actually uploading something and quantisation tasks continued insad of beeing blocked by hfu-drain for no reason.

We now have vision support for Mistral Small 3.1 (Mistral3ForCoditinonalGeneration) on latest llama.cpp using convert_hf_to_gguf.py
So once you implement convert_hf_to_gguf.py mmproj extraction and update to latest llama.cpp we should do it for https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503 which will resolve https://huggingface.co/mradermacher/Mistral-Small-3.1-24B-Instruct-2503-i1-GGUF/discussions/1

Why is nico1 quantisation tasks stuck by hfu-drain since noon?

I don't know. If you have a specific model, I can look at the upload log. But hfu-drain means there are too many active uploads - how many uploads did you see?

now actually uploading something and quantisation tasks continued insad of beeing blocked by hfu-drain for no reason.

A reboot obviously would fix it, as it kills all uploads. It also requires a lot of manual cleanup for any uploads after the model is done, so... be less trigger happy and let me have a look next time.

Also Llama_3.x_70b_Tristar_V2.1

Yeah, there was no code dealing with that, but it should not get stuck but timeout. I've added some code, but it will stay a race condition.

So once you implement convert_hf_to_gguf.py mmproj extraction and update to latest llama.cpp

I am experimenting, but it doesn't work so well. It claims to support Llava*, but all models fail complaining about an unsupported model_type (e.g. siglip_vision_model or clip_vision_model). Any ideas?

    if self.hparams["model_type"] == "pixtral":

Great. Seems when they say "llava" they mean pixtral and pixtral only.

The good news, for a change, is that mmproj extraction via convert_hf seems to honor use_temp_file, so can be attempted on other hosts than nico1.

python3 2206468 root    3u   REG  0,158 4414730240   4476380 /tmp/tmp9z8l0hha (deleted)

Update: well, it copies everything to a temp file, then reads in all vision tensors anyway. Still better than loading the whole model (the vision parts are usually just a few GB), but this is almost malicious.

llama updated and mistral mmproj extraction seems to work fine. what a surprise.

As for killing possibly stuck hfu's, the python process is safe to kill (and then should take perl with it. The other way round probably will leave a python process hanging). The sh process will then retry. I could probably create a new process group name must for those, so you can use ikil. Notr sure about the name though, hfufu-MODEL? :)

# ils
hfu-Planetoid_27B_V.2.Q6_K.gguf 1069924 1069935 1069973
# ps v 1069924 1069935 1069973
    PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
1069924 ?        SN     0:00      0    74  2505  1076  0.0 /bin/sh /llmjob/share/bin/hfu Planetoid_27B_V.2-GGUF --includ
1069935 ?        SN     0:00      0  1616 16943 15536  0.0 /usr/bin/perl /llmjob/share/bin/llmjob hf-upload-folder Plane
1069973 ?        SNl    0:39      0  2770 120925 39224  0.0 /llmjob/share/python/bin/python3 -c #!/usr/bin/python3  impo

I don't know. If you have a specific model, I can look at the upload log. But hfu-drain means there are too many active uploads - how many uploads did you see?

The status page only showed 3 uploads running for 3 different models and there was zero upload bandwidth usage. One of the stuck uploads was Qwen3-235B-A22B Q3_K_L hfu. What was strange is that when I tried to pause nico1 before rebooting it showed so many running tasks it has to ... shorten the PIDs it displayed despite nothing really running so there might where many more stuck hfu tasks running than the ones it showed on the status page.

A reboot obviously would fix it, as it kills all uploads. It also requires a lot of manual cleanup for any uploads after the model is done, so... be less trigger happy and let me have a look next time.

There was nothing running as it was all either hfu-drain or timeofday blocked. I even ran the pause script until it only waited for stuck hfu tasks and then disabled the host before rebooting so it should have only killed the stuck hfu tasks and the faulty nico2 to nico1 transfer. I probably should have waited but I already waited like 9 hours without you fixing the issue but I also didn't really inform you as I was too busy with work and you usually monitor the status page. I will try to quicker notify you the next time.

Yeah, there was no code dealing with that, but it should not get stuck but timeout. I've added some code, but it will stay a race condition.

Great! Thanks a lot. Yet it was super rare as nico2 to nico1 transfers are really quick and so unlikely to happen at the exact time it turns off but I'm still glad that you added code to cover this case.

Great. Seems when they say "llava" they mean pixtral and pixtral only.

That's kind of stupid on their part but it makes somewhat sense as they all share the same vision code. I really don't get why they insist on having the user to specify the model_type instead of automatically deducting it. The architecture alone should tell them what vision model_type to use.

Update: well, it copies everything to a temp file, then reads in all vision tensors anyway. Still better than loading the whole model (the vision parts are usually just a few GB), but this is almost malicious.

It's at least an improvement from how they did it before.

llama updated and mistral mmproj extraction seems to work fine. what a surprise.

Thanks a lot for implementing that! I'm supper excited to try it out. Thanks a lot for your effort!

As for killing possibly stuck hfu's, the python process is safe to kill (and then should take perl with it. The other way round probably will leave a python process hanging). The sh process will then retry. I could probably create a new process group name must for those, so you can use ikil. Notr sure about the name though, hfufu-MODEL? :)

That would be useful as the best I currently can do with ikill is: ikil -9 llmjob this does kill the uploads but likely causes a bigger mess than to pause, disable and reboot. I guess I could just do something like killall -9 python in the meantime.

Qwen3-235B-A22B

[ 311/1131]          blk.25.ffn_down_exps.weight - [ 1536,  4096,   128,     1], type =    f16, converting to q2_K .. ggml_validate_row_data: found nan value at block 23
ggml_validate_row_data: found nan value at block 53
ggml_validate_row_data: found nan value at block 41
ggml_validate_row_data: found nan value at block 59
ggml_validate_row_data: found nan value at block 65
ggml_validate_row_data: found nan value at block 65
llama_model_quantize: failed to quantize: quantized data validation failed

Can we just skip Q2_K_S? I can confirm that all ouer Qwen3-235B-A22B i1 quants are working perfectly fine so that must just be llama.cpp beeing stupid.

I will try to create a custom llama.cpp build with https://github.com/ggml-org/llama.cpp/pull/11773 merged and retry Qwen3-235B-A22B Q2_K_S quants because I really hate when quantisation failes because of llama.cpp bugs.

hfu-drain simply counts the process groups matching ^hfu-.*.([^.]+)\.gguf\z

nico1 has a limit of 19 concurrent uploads. and I am relatively confidernt the counting is not broken, so I would find it very surprising if there hadm't been so many uploads.

nothing really running so there might where many more stuck hfu tasks running than the ones it showed on the status page.

Likely. But keep in mind that jobs being stuck is just a guess at this point, not an established fact. For example, back recentlyx had a 30GB upload that took two days(!) due to the many, many retries.

There was nothing running as it was all either hfu-drain or timeofday blocked. I even ran the pause script until it only waited for stuck hfu tasks and then disabled the host before rebooting so it should have only killed the stuck hfu tasks and the faulty nico2 to nico1 transfer.

The problem is interrupting the uploads when they have no more job holder that would retry them, as killing all three processes in the upload also kills the one controlling retries.

Notifying me can be hard, but yes, it's even harder if I am not notified at all. And sometimes I am asleep, too :)

The architecture alone should tell them what vision model_type to use.

The architecture is not chosen by llama.cpp, so they can't implement that, they have to wotk with the architetcures given. Since Llava* maps to many vision model types, there needs to be a distinction elsewhere. But maybe they could deduce it automatically somehow.

Thanks a lot for implementing that! I'm supper excited to try it out. Thanks a lot for your effort!

Strictly speaking I wasn't even implementing it, it was just a result of convert_hf_to_gguf.py now being a generic way for mmproj extraction (well, for a minority of vision models, but you have to start somewhere).

That would be useful as the best I currently can do with ikill is: ikil -9 llmjob

That should not kill uploads at all but only the quantize/noquant/download jobs (and also cause files that need to be cleaned up manually, as the deletion of the quant file on disk is not the task of the hfu part, so after successful hfu the files wil stay). The best you could so is wait. Or you can now use the method I provided - both would preclude improving things, or investigating things, though. You could strace or debug the python process, also, to see what it does (e.g. if it hangs in a retry sleep or in the event loop, etc.).

And/or run netstat to see if a tcp connection exists - in the past, a very common reason for hangs on nico1 were that it has a tcp connection open, but the other side doesn't, and python doesn't have any timeouts for that case. I use an LD_PRELOAD wrapper around socket() that forcibly enables keepalive, which seems to have worked around this problem, but that is of course sensitive to how the python code works, so might not work in all cases.

There are also logs in /tmp/hfu-log with the output and any error messages, not sure if I ever mentioned that.

Overall, the best, especially with rare issues, is to investigate and see what is going wrong. The rarer an issue is, the more important it gets to not just kill, but investigate.

Well, next time :)

I'm not complaining about you doing the wrong decision - wo gehobelt wird, da fallen spähne :)

The hf-upload-folder llmjob group is now called hf-upload-folder-"$repo,$2" - I don't know what this will show up as, actually :)

Update: hf-upload-folder-Qwen-3-32B-Medical-Reasoning-i1-GGUF,Qwen-3-32B-Medical-Reasoning.i1-IQ2_XS.gguf* 1310425 1310507

Hmm.. right, I don't actually have the quant name anymore in hfu. Well, should be good enough. We all know how to quote shell globs :)

Update 2: changed the name slightly to use "," instead of "-" to separate args.

Update 3: or maybe just use the glob arg for the name? Details matter?

Update 4: I think I overdid it. hope it helps. Doesn't cost much, fortunately, other than visual overload in ils.

Thanks a lot. It looks really cool.

nico1 ~# ils
hf-upload-folder-Anathema-V8-LLaMA-70B-i1-GGUF,Anathema-V8-LLaMA-70B.i1-Q5_K_M.gguf* 1379902 1379907
hf-upload-folder-Qwen-3-32B-Medical-Reasoning-i1-GGUF,Qwen-3-32B-Medical-Reasoning.i1-Q5_K_S.gguf* 1369770 1369776
hfu 1369766 1369770 1369776 1379898 1379902 1379907
hfu-Anathema-V8-LLaMA-70B-i1-GGUF 1379898 1379902 1379907
hfu-Anathema-V8-LLaMA-70B.i1-Q5_K_M.gguf 1379898 1379902 1379907
hfu-Qwen-3-32B-Medical-Reasoning-i1-GGUF 1369766 1369770 1369776
hfu-Qwen-3-32B-Medical-Reasoning.i1-Q5_K_S.gguf 1369766 1369770 1369776
llmjob 1369764 1379896
llmjob-Anathema-V8-LLaMA-70B 1379896
llmjob-Anathema-V8-LLaMA-70B-imatrix 1379896
llmjob-Qwen-3-32B-Medical-Reasoning 1369764
llmjob-Qwen-3-32B-Medical-Reasoning-imatrix 1369764
llmwrap-Anathema-V8-LLaMA-70B-imatrix 1379896
llmwrap-Qwen-3-32B-Medical-Reasoning-imatrix 1369764

Regarding Qwen, I think we had the same issues with the older Qwen A* models.

Regarding Qwen, I think we had the same issues with the older Qwen A* models.

It doesn't seam to crash with 'sl/more-imatrix-nan-fixes' merged
Edit: No it just did which was expected

grafik.png

I'm now generating Q2_K_S quants without an imatrix because llama.cpp developers claim that such errors are often because of NaN's inside the imatrix but I'm quite confident it's because of bad math. llama.cpp really doesn't want me to try its. It asked me to Please do not use IQ1_S, IQ1_M, IQ2_S, IQ2_XXS, IQ2_XS or Q2_K_S quantization without an importance matrix and now keeps printing Missing importance matrix for tensor XXX in a very low-bit quantization The result will be garbage, so bailing out but I don't care as all I want to see if there are still NaNs.

@nicoboss when you paste llmc why output, be aware that it can contain private data, such as hf access tokens (either in the job, or when something dumps env variables in the log).

but I don't care as all I want to see if there are still NaNs.

llama.cpp claim all kinds of shit, blaming the model, us etc.

And indeed, they blamed me multiple times in the past for doing unspeakable things, somehow, then miraculously the problems disappeared in later versions.

However, it can be due to a model weight problem. And certainly due to math errors - or both, when a model contains overflows but still somehow expects to work.

However, I would rather try generating static IQ3* quants - Q2_K_S and lower IQ2 quants are really bad without an imatrix.

As an experiment, of course, feel free. Maybe this model works with un-imatrixed Q2_K_S.

We can do such static quants by editing the job (or submitting it differently - llmc add does not allow that currently).

@nicoboss when you paste llmc why output, be aware that it can contain private data, such as hf access tokens (either in the job, or when something dumps env variables in the log).

Good to know. I usually double check before posting any logs so I likely would have caught it if it. I never expected llmc why to actually print tokens. At least when I write software I usually make sure that secrets never appear in console outputs or logs because logs are often not treated which a lot of confidentiality. Why would anything just randomly dump all environment variables? They almost always contain some sensitive data.

llama.cpp claim all kinds of shit, blaming the model, us etc.

I checked some historic NaN ticket and they kept blaming bartowski's imatrix for NaN's. But even if the imatrix is at fault it is their faulty code that generates one containing NaNs and not bartowski's fault. In that same conversation the llama.cpp developer posted their own imatrix just to realize it contains NaNs as well.

However, it can be due to a model weight problem. And certainly due to math errors - or both, when a model contains overflows but still somehow expects to work.

It doesn't seem to be a model issue as generating static Q2_K_S quants worked without any issues and without generating any NaN's. For me most likely cause is the math where they apply the imatrix because that one is known to be flawed.

We can do such static quants by editing the job (or submitting it differently - llmc add does not allow that currently).

I recommend we for now skip i1-Q2_K_S and see if there are any other quants that fail. Please edit the task so it continues generating the remaining quants.

However, I would rather try generating static IQ3* quants - Q2_K_S and lower IQ2 quants are really bad without an imatrix.

So bad in fact I had to patch out multiple fatal warnings to have llama.cpp even do it. I think it makes sense to make those warnings non-fatal on our fork as we know what we are doing but it would still be nice to not miss them which we would as we never check logs unless there is an error. I guess the proper approach would be to add another environment variable flag similar to DRYRUN to make them non-fatal so we only skip them when we explicitly tell llama.cpp to do so.

As an experiment, of course, feel free. Maybe this model works with un-imatrixed Q2_K_S.

It works surprisingly well. Kind of a shame they don't let users generate such quants. The model loses some quality but it is super-fast to run and still one of the best models despite beeing a static Q2_K_S quant. The model is in fact so good I'm thinking that for models where we get Missing importance matrix for tensor XXX in a very low-bit quantization The result will be garbage, so bailing out errors for some tensores it is absolutely safe to just continue. In this model this warning appeared for every single tensor yet it is perfectly usable and one would have a hard time even noticing that it is an un-imatrixed Q2_K_S. It definitely is not garbage as llama.cpp developer claims it to be. Here an example so you can have your own opinion:

> What is the meaning of life?
<think>
Okay, so I need to figure out the meaning of life. Where do I even start? The user is asking a big question, and I have to provide a thoughtful answer. Let me break it down.

First, I know that the meaning of life is a philosophical question. Different people have different views. Maybe I should consider various perspectives: religious, scientific, personal. But how do I organize this?

Wait, the user might be looking for a comprehensive answer that covers different angles. Let me think about the main theories. There's the religious perspective, like in Christianity, Buddhism, etc., where the meaning might be serving a deity or achieving enlightenment. Then there's the scientific angle, like evolution, the universe, and the idea that life has no inherent meaning, so individuals create their own. Then there are philosophical views like existentialism, where meaning is created through our choices and actions.

Also, some thinkers like Viktor Frankl emphasize finding meaning through purpose, even in suffering. And there's the nihilistic view that life has no objective meaning, which might be a bit too negative. But how do I present these without bias?

I should make sure to explain each perspective clearly, maybe list them with examples. The user might want to know that there's no single answer, so highlighting the diversity of answers is important. Also, include some personal reflection on how people find their own meaning through relationships, goals, etc.

Wait, but the user might be struggling with this question personally. They might need reassurance that it's a complex question without a single answer, and that exploring different viewpoints can help. Including a bit of encouragement to find their own meaning could be useful.

Let me check if there are other perspectives. Oh, right, some philosophical theories like existentialism, absurdism (Camus), and others. Also, the idea that meaning comes from connection with others, contributing to society, or leaving a legacy.

I should structure the answer by outlining major perspectives, mention key thinkers, and then perhaps a conclusion that it's a personal journey. Make sure to note that there's no one answer, and it's up to each person to find what works for them.

Is there anything I'm missing? Maybe cultural perspectives beyond Western ones. For example, in some Indigenous cultures, the meaning of life might be tied to community, nature, and spiritual practices. Including that adds depth.

Also, the question might be influenced by the current state of the world, like existential risks (climate change, AI), which might impact how people think about meaning. But that's more tangential.

Wait, but the original question is straightforward. So the answer should focus on explaining the different viewpoints without too much extra fluff. Need to be concise but thorough.

Let me outline the sections:

1. Introduction: Acknowledge the complexity and that there's no single answer.
2. Religious Perspectives: Christianity, Buddhism, others.
3. Scientific Perspective: Evolutionary, universe's indifference.
4. Existentialist View: Creating one's own meaning.
5. Nihilism and responses to it.
6. Psychological and personal meaning (Frank1, finding purpose).
7. Cultural variations.
8. Conclusion: Encourage exploration and personal reflection.

Make sure each section is clear and examples are given. Also, avoid any bias towards one perspective, present them as possibilities.

Now, write the answer in a way that's accessible, not too academic, but informative. Use examples like Camus, Viktor Frankl's work. Mention that people might find meaning in different ways, like through helping others, creating art, relationships, etc.

Check for any possible mistakes: Is existentialism correctly attributed? Yes. Nihilism vs. existentialism. Differentiate between creating meaning vs. absence of it.

I think that's a solid approach. Now, time to put it all together into a coherent answer.
</think>

The meaning of life is a profound and multifaceted question that has inspired countless perspectives. Here’s a synthesis of the major viewpoints to guide your exploration:

### 1. **Religious and Spiritual Perspectives**  
   - **Abrahamic Traditions (Christianity, Judaism, Islam):** The purpose of life often centers on serving a divine will, cultivating faith, and living ethically to attain an afterlife or divine union.  
   - **Eastern Philosophies:** In Buddhism, the goal is to achieve enlightenment (nirvana) by overcoming suffering through mindfulness and detachment. Hinduism emphasizes fulfilling one’s dharma (duty) to progress spiritually across lifetimes.  
   - **Indigenous Cultures:** Many traditions tie meaning to harmony with nature, community, and spiritual practices, emphasizing interconnectedness and ancestral legacy.

### 2. **Scientific and Existential Perspectives**  
   - **Cosmic Indifference:** From a scientific standpoint (e.g., Carl Sagan, Neil deGrasse Tyson), the universe is neutral and life inherently has no predefined meaning. This aligns with the **existentialist** view (Jean-Paul Sartre, Albert Camus) that we are "condemned to be free"—we create our purpose through choices.  
   - **Nihilism:** Suggests life has no objective meaning, a perspective countered by **existential resilience**, where individuals craft meaning through love, art, or activism despite existential voids.

### 3. **Psychological and Personal Approaches**  
   - **Viktor Frankl’s Logotherapy:** In *Man’s Search for Meaning*, Frankl argues that finding purpose (even in suffering) is key to resilience. His work highlights how meaning can emerge from relationships, work, or facing challenges.  
   - **Modern Psychology:** Suggests meaning arises from self-actualization, relationships, purposeful goals, and contributing to societal well-being (e.g., Martin Seligman’s PERMA model).

### 4. **Cultural and Philosophical Variations**  
   - **Absurdism (Camus):** Life’s lack of inherent meaning necessitates embracing the "absurd" and finding joy in the struggle.  
   - **Stoicism:** Emphasizes living in accordance with nature and reason, finding contentment in one’s role within the cosmos.  
   - **Postmodernism:** Argues that meaning is fluid, shaped by individual experiences and societal narratives.

### 5. **Personal Reflection and Pathways**  
   - **Relationships:** Many find purpose in family, friendships, and community.  
   - **Legacy and Contribution:** Acts like mentoring, creating art, or addressing global challenges (e.g., climate action) can provide meaning.  
   - **Self-Discovery:** Journeys of personal growth, mindfulness practices, or embracing curiosity may also illuminate purpose.

### Conclusion  
The answer to life’s meaning is deeply personal. It may emerge from blending cultural, spiritual, and existential insights with your own values and experiences. As Camus wrote, “The struggle itself … is enough to fill a man’s heart.” The pursuit of meaning, rather than a fixed answer, becomes the journey itself—a continuous act of creation and reflection. What resonates with you?

uuuh, we have actual besteffort models in the queue, exciting :-)

Why would anything just randomly dump all environment variables?

For debugging, to check parameters, or so you can see how the model was queued when you look at the log.

It's not so much the log that concerns me, but the JOB: line, which contains all the job parameters.

They almost always contain some sensitive data.

As do most logs... :)

I checked some historic NaN ticket and they kept blaming bartowski's imatrix for NaN's.

Yeah. But it's a recurring pattern.

I think it makes sense to make those warnings non-fatal on our fork as we know

I think these warnings are mostly warranted. Quality gets really useless. But I am not opposed to making them optional.

I guess the proper approach would be to add another environment variable flag similar to DRYRUN to make them non-fatal so we only skip them when we explicitly tell llama.cpp to do so.

That would be harmless, indeed, and likely will not generate many conflicts.

The model loses some quality but it is super-fast to run and still one of the best models despite beeing a static Q2_K_S quant.

I suspect Q2_K_S is the best of the lot, but it's still surprisingly good. Extremely surprisingly good, really.

It definitely is not garbage as llama.cpp developer claims it to be. Here an example so you can have your own opinion:

Defnitely not the garbage I experienced with this when the errors were still warnings in the past. Maybe it was other bugs that made them so bad. Or our model is hardly typical...

It definitely is not garbage as llama.cpp developer claims it to be.

We should generate the Q2_K_S (I'd just upload yours, but it's missing the metadata, making me feel bad) and also tets out IQ3*. Maybe even IQ2*. Maybe even IQ1* just for the lulz.

I have my suspicion that the static IQ3* might have been "fixed", too. And I think we should also test it with some other models. Maybe we could have a "likely static-only" model, where we generate some more quants. Maybe for bigger models.

As for quants, I have a listr of quants for when this happens:

     "iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S",

Chances are almost 1 (but not exactly 1, exceptions have been seen :) that if IQ* or Q2_K_S fails like this, the other low-bit quants will also fail. So I will simply skip those and just generate the above list.

Interestingly enough, my "all squants" list is this:

     squants "f16 bf16 Q8_0 IQ3_S Q4_K_S IQ3_M Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ3_XS IQ4_XS Q2_K Q4_0 Q4_1 Q5_0 Q5_1 IQ4_NL Q2_K_S Q4_0_4_4 Q4_0_4_8 Q4_0_8_8" \

so Q2_K_S refusal seems to be newer than IQ* refusal.

Llama-3_1-Nemotron-Ultra-253B-v1 support finally got merged after over 3 weeks so time to update llama.cpp and finally do Llama-3_1-Nemotron-Ultra-253B-v1.

I really want to soon update Nvidia GPU drivers. https://github.com/axolotl-ai-cloud/axolotl/issues/2409 is driving me crazy as I now have to reboot the entire server every time, I want to launch axolotl. I assume NVidia has fixed this garbage in their latest driver or many more axolotl users would be affected by this. I for sure had to reboot like 5 times today because of this. Would upgrading to Nvidia driver 570.144 work for you?

On some very bright news: I managed to fix the Intel GPU kernel crash which every time I rebooted or shutdown my Windows VM caused the entire server to crash. All that was require to fix this garbage was clearing Intel’s broken reset method every time before starting a VM to which I want to assign those GPUs with a pre-start hook containing:

echo > /sys/bus/pci/devices/0000:04:00.0/reset_method
echo > /sys/bus/pci/devices/0000:05:00.0/reset_method
echo > /sys/bus/pci/devices/0000:c3:00.0/reset_method
echo > /sys/bus/pci/devices/0000:c4:00.0/reset_method

Thanks to this all the Intel GPU related crashes should be gone.

Would upgrading to Nvidia driver 570.144 work for you?

Hmm, nvidia has no debian packages for this version (yet). It has 575.51.03-1, 570.133.20-1, 570.124.06-1, 570.86.15-1...

So that would be rather inconvenient. I wouldn't even know where I would get matching cuda packages.

nvidia's list of packages is here: https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/

Thanks to this all the Intel GPU related crashes should be gone.

So obvious, why would you even complain :*)

Llama-3_1-Nemotron-Ultra-253B-v1 support finally got merged

Holy shit, and the whole directory structure has changed. This will not be fun. I'll think about how I can do this upgrade without freezing everything.

Hmm, nvidia has no debian packages for this version (yet). It has 575.51.03-1, 570.133.20-1, 570.124.06-1, 570.86.15-1...

Let's go with 575.51.03 as it is the first driver supporting CUDA 12.9. All older drivers only support CUDA 12.8 and older. Fun fact: NVidia doesn't even list this driver anywhere on thar website but I can get using the CUDA Toolkit 12.9 installer. What time would work best for you for the driver update?

So obvious, why would you even complain :*)

So obvious it took me 1.5 month to figure out why it causes a kernel crash. The stack trace and any logs were totally useless. The fun thing was also that the kernel crash happened even when no intel driver was using the GPUs as I modprobe blacklisted all of them. This itself was massive pain as unlike any normal driver Intel bypasses modprobe blacklists in their driver so you have to instead write this into modprobe which was nowhere documented:

install i915 /bin/false 
install xe /bin/false

Holy shit, and the whole directory structure has changed. This will not be fun. I'll think about how I can do this upgrade without freezing everything.

That explains why I had to manually resolve merge conflicts for the second time ever.

Let's go with 575.51.03 What time would work best for you for the driver update?

Fine with me, although it's probably not a stable release. 19:00+ is probably best, because the activity is lowest. Otherwise, mornings or night. Even right now is fine.

Llama-3_1-Nemotron-Ultra-253B-v1 support finally got merged

But has anybody tried it before merging ? Anyways, it failed with status 135 (probably a SIGBUS nee OOM) - are you running something else on nico1? If not, it's not working well.

I forgot to mention that my latest version now also can perform the mmproj extraction for Qwen2/2.5VL using convert_hf_to_gguf.py thanks to https://github.com/ggml-org/llama.cpp/pull/13209 so examples/llava/qwen2_vl_surgery.py no longer exists.

This itself was massive pain as unlike any normal driver Intel bypasses modprobe blacklists in their driver so you have to instead write this into modprobe which was nowhere documented:

Are you sure this is not based on the wrong assumption that blacklist is supposed to prevent module loading? It does no such thing (and imho, makes blacklist a bad option name). I would be surprised if intel did anything differently than other kernel drivers w.r.t. module loading.

I forgot to mention that my latest version now also can perform the mmproj extraction for Qwen2/2.5VL using convert_hf_to_gguf.py

That's so useful to know :)

Oh, before you update the driver, simply disable nico1 imatrix gen. That is probably what you have done, but I mention it just to be sure :=) That way, a few hours delayx of me picking it up when I am busy is not an issue.

Are you sure this is not based on the wrong assumption that blacklist is supposed to prevent module loading? It does no such thing (and imho, makes blacklist a bad option name). I would be surprised if intel did anything differently than other kernel drivers w.r.t. module loading.

I just looked at https://wiki.debian.org/KernelModuleBlacklisting

Disable automatic loading of Linux kernel driver modules. Note that this procedure does not prevent another process from requesting a module addition during boot or runtime. 
As mentioned in the comment headers in /etc/modprobe.d/blacklist.conf adding modules there ...does not affect autoloading of modules by the Linux kernel.; this means that to blacklist a driver such as ipv6 you must do a fake install

I see. So modprobe blacklisting only disables automatic kernel model loading but if the kernel itself autoloads a model then it gets bypassed. This explains why blacklisting works for nvidia and all other kernel models I installed manually but not Intel GPU drivers as the kernel part of their drivers is already pre-included in the Linux kernel. They even mentioned in the documentation that I have to do that fake install thing which then was the only thing that finally worked. I actually don't care if the model gets loaded but if the driver is controlling the GPU. When I do lcpvi -vv I did not want to see them listed under Kernel driver in use for the intel GPUs. But the thing crashed despite this as intel’s default PCIe reset method is broken so not that it really mattered. But interesting to know why things behave the way they did.

Llama-3_1-Nemotron-Ultra-253B-v1 support finally got merged

Or maybe it was because a tensor didn't fit into the memory limits? Trying with 64GB

But the thing crashed despite this as intel’s default PCIe reset method is broken so not that it really mattered.

Obviously it should never ever crash the kernel... But yeah, I am surprised intel still makes linux drivers. Maybe we have to thank the AI craze for that :() Or valve? Both? Nvidia? uh-oh.

Oh, before you update the driver, simply disable nico1 imatrix gen. That is probably what you have done, but I mention it just to be sure :=) That way, a few hours delayx of me picking it up when I am busy is not an issue.

No worries I will have to reboot the host anyways to install the new GPU driver. I now started pausing nico1 and will update the driver and reboot as soon it is done pausing. You will need to update the user-level driver on your side before unpausing imatrix.nico1 taks after the reboot.

Or maybe it was because a tensor didn't fit into the memory limits? Trying with 64GB

It worked for me a few weeks ago so no idea why it failed. I even back then created nicoboss/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF. Try with more RAM and if it still doesn't work I will investigate.

It errored again and the error is quite strange:

CONVERT hf
INFO:hf-to-gguf:Loading model: Llama-3_1-Nemotron-Ultra-253B-v1
WARNING:hf-to-gguf:Failed to load model config from Llama-3_1-Nemotron-Ultra-253B-v1: Loading Llama-3_1-Nemotron-Ultra-253B-v1 requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.
WARNING:hf-to-gguf:Trying to load config.json instead
INFO:hf-to-gguf:Model architecture: DeciLMForCausalLM
WARNING:hf-to-gguf:Failed to load model config from Llama-3_1-Nemotron-Ultra-253B-v1: Loading Llama-3_1-Nemotron-Ultra-253B-v1 requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.
WARNING:hf-to-gguf:Trying to load config.json instead
(...)
INFO:hf-to-gguf:gguf: loading model part 'model-00021-of-00049.safetensors'
INFO:hf-to-gguf:blk.68.ffn_down.weight,       torch.bfloat16 --> F16, shape = {53248, 16384}
INFO:hf-to-gguf:blk.69.ffn_down.weight,       torch.bfloat16 --> F16, shape = {53248, 16384}
yes: standard output: Broken pipe
job finished, status 135
job-done<0 Llama-3_1-Nemotron-Ultra-253B-v1 noquant 135>

It errored again and the error is quite strange:

Still SIGBUS. Does it try to load every tensor into memory? I'll temporary bump it up to 400G (systemd units, whatever that is)

After some struggeling I sucessfully installed the NVidia 575.51.03 driver and am currently rebooting StormPeak.

@mradermacher StormPeak successfully rebooted and loaded the new Nvidia driver. The GPUs are attached to your LXC container and should be usable as soon you install the userland drivers. Please unpause it once you are done.

The NVidia Cuda Toolkit installer was total garbage and lacked the option to sign the NVidia driver using my secure boot keys so it obviously simply failed to load. Luckily this installer had a single useful command line option ./cuda_12.9.0_575.51.03_linux.run --extract=/root/cuda_12.9.0_575.51.03_linux which extracts all the files contained inside including the actual driver into a folder. After that I then was able to manually install the driver using NVIDIA-Linux-x86_64-575.51.03.run which is a driver not yet available on the NVidia driver page.

@mradermacher WTF! Looks like a file system issue to me.

root@AI:/mradermacher/tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1# sha256sum model-00021-of-00049.safetensors
sha256sum: model-00021-of-00049.safetensors: Input/output error

It was a painfully slow download, but an uneventful apt-get upgrade for me. The question is, will my llama build work with cuda 12.9 out of the box.

@nicoboss btrfs scrub start /mradermacher and see what it outputs (or the kernel log)., But yes, that would be a reason for a bus error as well. A rare one.

If it's a checksum error, wow, I've never seen a checksum error (well, I've seen millions, but always explainable - nothing we did on nico1 should cause these).

In any case, just delete the unreadable files (and in general, the .hfd-success file to redo the download). I deleted the .hfd-succesa file already and put an override flag.,

Also, apologies for suspecting poor llama.cpp. This time, they are off the hook. What an exceptional case :-)

Seems cuda-12.-9 is compatible to 12.5 binaries (as one would expect naively). I will continue to build with cuda 12.5 for the time being.

root@StormPeak:~# btrfs scrub start /spool/images/108/subvol-108-disk-0.subvol
scrub started on /spool/images/108/subvol-108-disk-0.subvol, fsid 15ba0231-4a65-44c2-a84d-1b8040b9e6d3 (pid=1761413)

It's still going but I already see some errors:

root@StormPeak:~# btrfs scrub status /spool/images/108/subvol-108-disk-0.subvol
UUID:             15ba0231-4a65-44c2-a84d-1b8040b9e6d3
Scrub started:    Sun May  4 00:25:01 2025
Status:           running
Duration:         0:02:21
Time left:        0:03:42
ETA:              Sun May  4 00:31:06 2025
Total to scrub:   2.06TiB
Bytes scrubbed:   821.10GiB  (38.84%)
Rate:             5.82GiB/s
Error summary:    read=256
  Corrected:      225
  Uncorrectable:  31
  Unverified:     0

BTRF scrub is done now. It warned with ERROR: there are uncorrectable errors. Here the full output.

(base) root@StormPeak:~# btrfs scrub status /spool/images/108/subvol-108-disk-0.subvol
UUID:             15ba0231-4a65-44c2-a84d-1b8040b9e6d3
Scrub started:    Sun May  4 00:25:01 2025
Status:           finished
Duration:         0:06:56
Total to scrub:   1.94TiB
Rate:             5.13GiB/s
Error summary:    read=256
  Corrected:      225
  Uncorrectable:  31
  Unverified:     0

Is there any way to see whcih files are affected by thouse uncorrectable errors? How can it even correct any errors on a RAID 0?

Wow, that was painless (but 575 was the series where they fixed the debian dependencies)

Oh dmesg showes quite a lot of usefull information:

grafik.png

Is there any way to see whcih files are affected by thouse uncorrectable errors?

dmesg will show some. Unfortunately, btrfs rate-limits these messages. I once reported this to the btrfs mailing list as not optimal, but was ignored :) Last time I looked there was no configurable knob for it, and when you want to patch it out of the kernel, it's actually like 5+ places which rate limit such messages.

But you have relatively few errors, so it should be fine.

Correctable is probably all metadata, while hopefulyl all uncorretcable ones are in file data only.

That's quite troubling. Are you trusting your hardware? Because I now wouldn't. Without any joking, this should not happen, no matter how crappy you think the samsung disks are.

Unless... samsung still lies about FUA and what you see is the result of a disk claiming it has flushed data to stable storage and then later reverted it. You'd probably get transaction id or so mismatches. And this would then be the result of raid'ing those two disks - I have no experience with samsung disks in raid.

Well, I trust the rest of your hardware again - fault is clearly inside nvme4 and not elsewhere. Obviously, this should never happen, unless the flash itself is faulty.

Anyway, delete the files mentioned, and run a scrub again. With luck, all will be well.

Eh, delete the files if they are inside nemotron. If you find more important files, such as OS-files, delete and make a list please, of course :->

Here the dmesg results of the entire scrub:

[ 3822.087279] BTRFS info (device nvme4n1): scrub: started on devid 1
[ 3822.087397] BTRFS info (device nvme4n1): scrub: started on devid 2
[ 3839.769983] nvme4n1: I/O Cmd(0x2) @ LBA 182204160, 256 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3839.779680] critical medium error, dev nvme4n1, sector 182204160 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 3
[ 3839.792684] BTRFS error (device nvme4n1): fixed up error at logical 248132519657472 on dev /dev/nvme4n1 physical 93288595456
[ 3839.804292] BTRFS error (device nvme4n1): fixed up error at logical 248132519657472 on dev /dev/nvme4n1 physical 93288595456
[ 3839.815843] BTRFS error (device nvme4n1): fixed up error at logical 248132519657472 on dev /dev/nvme4n1 physical 93288595456
[ 3839.827381] BTRFS error (device nvme4n1): fixed up error at logical 248132519657472 on dev /dev/nvme4n1 physical 93288595456
[ 3839.838928] BTRFS error (device nvme4n1): fixed up error at logical 248132519657472 on dev /dev/nvme4n1 physical 93288595456
[ 3839.850465] BTRFS error (device nvme4n1): fixed up error at logical 248132519657472 on dev /dev/nvme4n1 physical 93288595456
[ 3839.861997] BTRFS error (device nvme4n1): fixed up error at logical 248132519657472 on dev /dev/nvme4n1 physical 93288595456
[ 3839.873525] BTRFS error (device nvme4n1): fixed up error at logical 248132519657472 on dev /dev/nvme4n1 physical 93288595456
[ 3839.885058] BTRFS error (device nvme4n1): fixed up error at logical 248132519657472 on dev /dev/nvme4n1 physical 93288595456
[ 3839.896599] BTRFS error (device nvme4n1): fixed up error at logical 248132519657472 on dev /dev/nvme4n1 physical 93288595456
[ 3839.941784] nvme4n1: I/O Cmd(0x2) @ LBA 182210304, 256 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3839.951418] critical medium error, dev nvme4n1, sector 182210304 op 0x0:(READ) flags 0x0 phys_seg 28 prio class 3
[ 3840.126607] nvme4n1: I/O Cmd(0x2) @ LBA 182216448, 256 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3840.136300] critical medium error, dev nvme4n1, sector 182216448 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 3
[ 3840.307372] nvme4n1: I/O Cmd(0x2) @ LBA 182204248, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3840.316903] critical medium error, dev nvme4n1, sector 182204248 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3840.488400] nvme4n1: I/O Cmd(0x2) @ LBA 182210392, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3840.497928] critical medium error, dev nvme4n1, sector 182210392 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3840.668850] nvme4n1: I/O Cmd(0x2) @ LBA 182216536, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3840.678382] critical medium error, dev nvme4n1, sector 182216536 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3840.853455] nvme4n1: I/O Cmd(0x2) @ LBA 182204256, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3840.862967] critical medium error, dev nvme4n1, sector 182204256 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3841.035300] nvme4n1: I/O Cmd(0x2) @ LBA 182210400, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3841.044832] critical medium error, dev nvme4n1, sector 182210400 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3841.216589] nvme4n1: I/O Cmd(0x2) @ LBA 182216544, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3841.226098] critical medium error, dev nvme4n1, sector 182216544 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3841.398039] nvme4n1: I/O Cmd(0x2) @ LBA 182204264, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3841.407568] critical medium error, dev nvme4n1, sector 182204264 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3841.945708] BTRFS error (device nvme4n1): unable to fixup (regular) error at logical 248132519526400 on dev /dev/nvme4n1 physical 93288529920
[ 3841.959241] BTRFS warning (device nvme4n1): i/o error at logical 248132519526400 on dev /dev/nvme4n1, physical 93288529920, root 256, inode 10959203, offset 2855403520, length 4096, links 1 (path: tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1/model-00021-of-00049.safetensors)
[ 3841.984429] BTRFS error (device nvme4n1): unable to fixup (regular) error at logical 248132519526400 on dev /dev/nvme4n1 physical 93288529920
[ 3841.998867] BTRFS warning (device nvme4n1): i/o error at logical 248132519526400 on dev /dev/nvme4n1, physical 93288529920, root 256, inode 10959203, offset 2855403520, length 4096, links 1 (path: tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1/model-00021-of-00049.safetensors)
[ 3842.024931] BTRFS error (device nvme4n1): unable to fixup (regular) error at logical 248132519526400 on dev /dev/nvme4n1 physical 93288529920
[ 3842.039484] BTRFS warning (device nvme4n1): i/o error at logical 248132519526400 on dev /dev/nvme4n1, physical 93288529920, root 256, inode 10959203, offset 2855403520, length 4096, links 1 (path: tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1/model-00021-of-00049.safetensors)
[ 3842.064839] BTRFS error (device nvme4n1): unable to fixup (regular) error at logical 248132519526400 on dev /dev/nvme4n1 physical 93288529920
[ 3842.079191] BTRFS warning (device nvme4n1): i/o error at logical 248132519526400 on dev /dev/nvme4n1, physical 93288529920, root 256, inode 10959203, offset 2855403520, length 4096, links 1 (path: tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1/model-00021-of-00049.safetensors)
[ 3842.124828] BTRFS error (device nvme4n1): unable to fixup (regular) error at logical 248132525817856 on dev /dev/nvme4n1 physical 93291675648
[ 3842.139213] BTRFS warning (device nvme4n1): i/o error at logical 248132525817856 on dev /dev/nvme4n1, physical 93291675648, root 256, inode 10959203, offset 2863398912, length 4096, links 1 (path: tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1/model-00021-of-00049.safetensors)
[ 3842.164682] BTRFS error (device nvme4n1): unable to fixup (regular) error at logical 248132525817856 on dev /dev/nvme4n1 physical 93291675648
[ 3842.178877] BTRFS warning (device nvme4n1): i/o error at logical 248132525817856 on dev /dev/nvme4n1, physical 93291675648, root 256, inode 10959203, offset 2863398912, length 4096, links 1 (path: tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1/model-00021-of-00049.safetensors)
[ 3842.204552] BTRFS error (device nvme4n1): unable to fixup (regular) error at logical 248132525817856 on dev /dev/nvme4n1 physical 93291675648
[ 3842.219125] BTRFS warning (device nvme4n1): i/o error at logical 248132525817856 on dev /dev/nvme4n1, physical 93291675648, root 256, inode 10959203, offset 2863398912, length 4096, links 1 (path: tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1/model-00021-of-00049.safetensors)
[ 3842.244748] BTRFS error (device nvme4n1): unable to fixup (regular) error at logical 248132525817856 on dev /dev/nvme4n1 physical 93291675648
[ 3842.259106] BTRFS warning (device nvme4n1): i/o error at logical 248132525817856 on dev /dev/nvme4n1, physical 93291675648, root 256, inode 10959203, offset 2863398912, length 4096, links 1 (path: tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1/model-00021-of-00049.safetensors)
[ 3842.308018] BTRFS error (device nvme4n1): unable to fixup (regular) error at logical 248132532109312 on dev /dev/nvme4n1 physical 93294821376
[ 3842.321762] BTRFS warning (device nvme4n1): i/o error at logical 248132532109312 on dev /dev/nvme4n1, physical 93294821376, root 256, inode 10959203, offset 2871525376, length 4096, links 1 (path: tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1/model-00021-of-00049.safetensors)
[ 3842.347383] BTRFS error (device nvme4n1): unable to fixup (regular) error at logical 248132532109312 on dev /dev/nvme4n1 physical 93294821376
[ 3842.362222] BTRFS warning (device nvme4n1): i/o error at logical 248132532109312 on dev /dev/nvme4n1, physical 93294821376, root 256, inode 10959203, offset 2871525376, length 4096, links 1 (path: tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1/model-00021-of-00049.safetensors)
[ 3844.909526] nvme_log_error: 17 callbacks suppressed
[ 3844.909531] nvme4n1: I/O Cmd(0x2) @ LBA 182234968, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3844.924676] blk_print_req_error: 17 callbacks suppressed
[ 3844.924678] critical medium error, dev nvme4n1, sector 182234968 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3845.096261] nvme4n1: I/O Cmd(0x2) @ LBA 182241112, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3845.105910] critical medium error, dev nvme4n1, sector 182241112 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3845.276924] nvme4n1: I/O Cmd(0x2) @ LBA 182247256, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3845.286582] critical medium error, dev nvme4n1, sector 182247256 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3845.438387] nvme4n1: I/O Cmd(0x2) @ LBA 182234976, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3845.448027] critical medium error, dev nvme4n1, sector 182234976 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3845.596155] nvme4n1: I/O Cmd(0x2) @ LBA 182241120, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3845.605806] critical medium error, dev nvme4n1, sector 182241120 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3845.753134] nvme4n1: I/O Cmd(0x2) @ LBA 182247264, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3845.762791] critical medium error, dev nvme4n1, sector 182247264 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3845.910226] nvme4n1: I/O Cmd(0x2) @ LBA 182234984, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3845.919858] critical medium error, dev nvme4n1, sector 182234984 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3846.067922] nvme4n1: I/O Cmd(0x2) @ LBA 182241128, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3846.077562] critical medium error, dev nvme4n1, sector 182241128 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3846.224498] nvme4n1: I/O Cmd(0x2) @ LBA 182247272, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3846.234080] critical medium error, dev nvme4n1, sector 182247272 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3846.382230] nvme4n1: I/O Cmd(0x2) @ LBA 182234992, 8 blocks, I/O Error (sct 0x2 / sc 0x81) MORE DNR 
[ 3846.391736] critical medium error, dev nvme4n1, sector 182234992 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 3846.402861] scrub_stripe_report_errors: 179 callbacks suppressed
[ 3846.402888] BTRFS error (device nvme4n1): fixed up error at logical 248132550983680 on dev /dev/nvme4n1 physical 93304258560
[ 3846.421068] BTRFS error (device nvme4n1): fixed up error at logical 248132550983680 on dev /dev/nvme4n1 physical 93304258560
[ 3846.432725] BTRFS error (device nvme4n1): fixed up error at logical 248132550983680 on dev /dev/nvme4n1 physical 93304258560
[ 3846.444365] BTRFS error (device nvme4n1): fixed up error at logical 248132550983680 on dev /dev/nvme4n1 physical 93304258560
[ 3846.456007] BTRFS error (device nvme4n1): fixed up error at logical 248132550983680 on dev /dev/nvme4n1 physical 93304258560
[ 3846.467622] BTRFS error (device nvme4n1): fixed up error at logical 248132550983680 on dev /dev/nvme4n1 physical 93304258560
[ 3846.479882] BTRFS error (device nvme4n1): fixed up error at logical 248132550983680 on dev /dev/nvme4n1 physical 93304258560
[ 3846.491575] BTRFS error (device nvme4n1): fixed up error at logical 248132550983680 on dev /dev/nvme4n1 physical 93304258560
[ 3846.503203] BTRFS error (device nvme4n1): fixed up error at logical 248132550983680 on dev /dev/nvme4n1 physical 93304258560
[ 3846.514815] BTRFS error (device nvme4n1): fixed up error at logical 248132550983680 on dev /dev/nvme4n1 physical 93304258560
[ 4238.136797] BTRFS info (device nvme4n1): scrub: finished on devid 2 with status: 0
[ 4238.180171] BTRFS info (device nvme4n1): scrub: finished on devid 1 with status: 0

Ouer btrfs layout:

root@StormPeak:~# btrfs filesystem show
Label: 'spool'  uuid: 15ba0231-4a65-44c2-a84d-1b8040b9e6d3
        Total devices 2 FS bytes used 1.99TiB
        devid    1 size 1.86TiB used 1.26TiB path /dev/nvme4n1
        devid    2 size 1.86TiB used 1.26TiB path /dev/nvme7n1

/tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1/model-00021-of-00049.safetensors seams to be the only file affected.

S.M.A.R.T of /dev/nvme4n1 (the one that caused ouer current issues):

grafik.png

S.M.A.R.T of /dev/nvme7n1 (the "good" one):

grafik.png

Here the dmesg results of the entire scrub:

Sans actual warnings with the paths.

Anyway, if you delete the file and then the scrub works we are fine to go. You cna reset error counters with btrfs dev stats -z

Holy shit, they are already using spare? After <40% usage? Both??? :)

To be fair booth have many "Media and Data Integrity Errors" which is the main reason we are using that SSDs for nico1 and I would almost be happy about them eventually running out of write cycles. SAMSUNG MZVL22T0HBLB-00B00 disks suffer from severe bit rot which causes data to get corrupted if it is not read for a while. I'm actually quite surprised how long we went without encountering bit rot. The only reason we even did is because we just kept Llama-3_1-Nemotron-Ultra-253B-v1 sitting around for an entire month without reading it which heavily increases the likelihood of bit root getting severe enough to no longer be correctable using the SSD internal automatic error correction. This is not a BTRFS but a bit rot issue. How relieved to know. I’m quite happy with how well BTRFS deals with bit rot.

SAMSUNG MZVL22T0HBLB-00B00 disks suffer severe bit rot which causes data to get corrupted

I have a hard time accepting that - if that were true, we would have heard about it, and it would certainly affect a lot of users quickly (most data is not touched often). Wait, I have a deja vu, we maybe talked about it already. Still wouldn't that be a simple case of contract defect and replacement? Clearly the disks are broken.

The only reason we even did is because we just kept Llama-3_1-Nemotron-Ultra-253B-v1 sitting around for an entire month without reading it which heavily increases the likelihood of bit root getting severe enough to no longer be correctable

But we have lots of OS files that are not used for many months. And metadata. And log file data. And none of those are affected, as the scrub shows. Anyway, a weekly scrub would also help (btrfs scrub start -c3), because that would read all data. Don't you already do that? I vaguely remember you added a cron job.

Anyway, I restarted nemotron.

I deleted model-00021-of-00049.safetensors and all oeur problems went away:

(base) root@StormPeak:~# btrfs scrub status /spool/images/108/subvol-108-disk-0.subvol
UUID:             15ba0231-4a65-44c2-a84d-1b8040b9e6d3
Scrub started:    Sun May  4 01:14:50 2025
Status:           finished
Duration:         0:06:03
Total to scrub:   2.01TiB
Rate:             6.15GiB/s
Error summary:    no errors found
[ 6810.736022] BTRFS info (device nvme4n1): scrub: started on devid 2
[ 6810.736053] BTRFS info (device nvme4n1): scrub: started on devid 1
[ 7173.897906] BTRFS info (device nvme4n1): scrub: finished on devid 2 with status: 0
[ 7174.187275] BTRFS info (device nvme4n1): scrub: finished on devid 1 with status: 0

BTW, most of your queued models that fail at the imatrix gen stage fail because of the assertion that is not in llama-server, and does not seem to be an actual issue for the model.

Basically, I just go back in my history and hit return on this command (yes, I am still trying to nudge you into action in a form that you can safely ignore :)

for i in *.slog; do grep -q GGML_ASSERT.*llama_vocab_get_add_eos "$i" && l nukeall "$i";done
-2000  506 Llama-3_1-Nemotron-Ultra-253B-v1              blocked/budget

It should be safe to run this on nico1 anyways despite it thinking there is not enough budged after you pause the quantisation tasks. There is currently nothing beside your container running on StormPeak.

let's do so then

You think there is enough memory for a quant job?

MemFree: 7003500 kB
MemAvailable: 454645028 kB
Buffers: 24 kB
Cached: 449808756 kB
SwapCached: 0 kB
Active: 107158420 kB
Inactive: 345717012 kB
Active(anon): 2056404 kB
Inactive(anon): 1158136 kB
Active(file): 105102016 kB
Inactive(file): 344558876 kB

There would have been enough memory but -ngl 6 unfortunately exceeded the GPU memory so computation is taking twice as long and CUDA unified memory is taking away 24 GiB of RAM for GPU memory swapping we would have needed. But it’s all fine as it is already over halfway done. It’s just going slower than if we would have chosen to offload less layers. Ouer layer offload formula is perfect for the vast majority of models but Llama-3_1-Nemotron-Ultra-253B-v1 is special.

Oh shit seams like we lost one of the RTX 4090 GPUs. I better hope this is not because of the new GPU drivers...

So we lost it due to whatever this is:

[83494.573576] NVRM: GPU 0000:c5:00.0: RmInitAdapter failed! (0x62:0x51:2584)
[83494.591242] NVRM: GPU 0000:c5:00.0: rm_init_adapter failed, device minor number 0
[83494.889076] NVRM: GPU 0000:c5:00.0: RmInitAdapter failed! (0x62:0x51:2584)
[83494.907695] NVRM: GPU 0000:c5:00.0: rm_init_adapter failed, device minor number 0
[83624.555379] NVRM: GPU 0000:c5:00.0: RmInitAdapter failed! (0x62:0x51:2584)
[83624.571245] NVRM: GPU 0000:c5:00.0: rm_init_adapter failed, device minor number 0
[83700.215891] NVRM: GPU 0000:c5:00.0: RmInitAdapter failed! (0x22:0x51:884)
[83700.232341] NVRM: GPU 0000:c5:00.0: rm_init_adapter failed, device minor number 0

Wow it is now back again but the imatrix task already started so it is too late. I guess I will reboot the host and lets hope that this was a one-time thing.

At least we didn't lose the nemotron one. OTOH, what the fuck's going on with Mellum, it shouldn't take more than a few minutes, not three hours.

Hmmm

7]11.1033,[108]11.1341,[109]11.2641,[110]11.3499,[111]11.4552,[112]11.6055,[113]11.7437,[114]11.7471,[115]11.7921,[116]11.6353,[117]11.6344,[118]11.6147,[119]11.5795,[120]11.5593,[121]11.6234,[122]11.6877,[123]11.6848,[124]11.7064,[125]11.7269,Connection to 10.28.1.6 closed by remote host.

At least we didn't lose the nemotron one. OTOH, what the fuck's going on with Mellum, it shouldn't take more than a few minutes, not three hours.

Exactly what I mentioned above. It started during the few minutes when the GPU disappeared. I now just rebooted the host hopefully fixing any future GPU issues.

7]11.1033,[108]11.1341,[109]11.2641,[110]11.3499,[111]11.4552,[112]11.6055,[113]11.7437,[114]11.7471,[115]11.7921,[116]11.6353,[117]11.6344,[118]11.6147,[119]11.5795,[120]11.5593,[121]11.6234,[122]11.6877,[123]11.6848,[124]11.7064,[125]11.7269,Connection to 10.28.1.6 closed by remote host.

Please just restart it. I got tired of waiting 3.5 hours for it compleating on CPU as it only takes like 2 minutes using a GPU so I just proceeded with disabeling nico1 and rebooting. I still wish there would some day be a command for me to restart imatrix tasks myself.

Ahh... yeah, interrupting it was definitely the way to go.

The tiny-random models are typically just random weights for testing. Maybe you should filter them out.

As you have seen I besteffort queued a lot of models today as I saw many workers being idle. I again went through all the trending and medical models but now in addition I also checked the most downloaded and most liked models.

The tiny-random models are typically just random weights for testing. Maybe you should filter them out.

Great Idea. I will do so in the future. For https://huggingface.co/katuni4ka/tiny-random-baichuan2 I just saw it has 95,845 downloads last month which is quite an insane number.

It's used for automatic testing, most likely

Nice so the status page froze for over an hour and executing llmc restart-llmstatusd && llmc push twice fixed it. First time I used restart-llmstatusd and it was a huge success.

twice. uh.

well, now for something completely different - i switched off nico2, and disabled the wake-up cronjob. I think it is nice to have a fallback, but I don't see an urgent need for it very soon, so it would make sense to repurpose it for whatever you want again, without being able to switch it on or off.

@mradermacher The sheduler is already softlocked for 4 hours. I tried retsarting status page but didn't work as this time its a true softlock.

I had a chance to look at it - it's the poor network connection between rich1 and kaos, and the timeouts for llmc are very high by default. I decreased the timeout. Additionally, the local scheduler grabs the lock, but then does llmc push, needlessly holding the lock at that point. I changed both, and hope this will work better. At least only the job itself will then hang.

help! rich1 disk is full, because 1.5tb disk space have went missing unexplained. no quanting possible

@nicoboss I also notice that in the last efw days, I get a lot of connection failures to nico1, i.e. tcp connections get closed

help! rich1 disk is full, because 1.5tb disk space have went missing unexplained. no quanting possible

@RichardErkhov Will fix the issue as soon he is home. He already aknoledged it over Discord.

@nicoboss I also notice that in the last efw days, I get a lot of connection failures to nico1, i.e. tcp connections get closed

That's strange. I haven't changed anything with my internet and have not rebooted StormPeak in the past 3 days. I don't really experiance any internet issues myself. Should the issues persist please let me know and I will reboot my router. It already uploaded 800 TB and downloaded 250 TB since I last rebooted it 68 days ago.

PyTorch finaly prevents arbitrary code execution using default settings when unpickeling. Great for security but annoying if you want to resume finetuning from an intermediate checkpint:

[rank1]:   File "/apool/axolotl/venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 3067, in _get_all_zero_checkpoint_state_dicts
[rank1]:     _state = self.checkpoint_engine.load(
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apool/axolotl/venv/lib/python3.11/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 28, in load
[rank1]:     partition = torch.load(path, map_location=map_location)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apool/axolotl/venv/lib/python3.11/site-packages/torch/serialization.py", line 1524, in load
[rank1]:     raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
[rank1]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
[rank1]:        (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[rank1]:        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
[rank1]:        WeightsUnpickler error: Unsupported global: GLOBAL deepspeed.runtime.fp16.loss_scaler.LossScaler was not an allowed global by default. Please use `torch.serialization.add_safe_globals([deepspeed.runtime.fp16.loss_scaler.LossScaler])` or the `torch.serialization.safe_globals([deepspeed.runtime.fp16.loss_scaler.LossScaler])` context manager to allowlist this global if you trust this class/function.

Luckely I could just nano +28 /apool/axolotl/venv/lib/python3.11/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py and make it load it anyways.

Any idea why I see so many idle and ready/hfd and ready/noquant on https://hf.tst.eu/status.html? There is plenty of work from math and coding models I queued earlier today but workers seam to simply refuse to work.

Because they dont like working without rich1 lmao. When rich1 goes down, everything goes down

@mradermacher @nicoboss drive fixed, for some reason container decided to grow the entire 1tb limit. I have no idea why. I did have a limit, that's why container grew only to 1tb, and not the whole complete space, but for some reason inside the container it showed up as 200gb, not 1tb, weird. I recreated a disk, so now it should be fine

image.png

Somehow the token went missing, too. Will fix. Unfortunately, somehow the failure to create the repo didn't cause a job failure, so this is actual data looss.

Somehow the token went missing, too. Will fix. Unfortunately, somehow the failure to create the repo didn't cause a job failure, so this is actual data looss.

Do we know which models are affected by this so we can retry them?

I did check a few models manually and fortunately found no model without readme. the regular check will tell us in a few days (I "just" have to remember to look at the log). With luck, I was confused - there shouldn't be actual quant uploads without a repo having been created, but clearly, there are. Soi not sure what exactly happened. But maybe the upload code will magically "fix" it.

The jobs ready but not running is a different issue. No clue what it would be - obviously the scheduler didn't run. You can "llmc push" in such a case and see if that helps.

The only changes were to enable tighter 600s timeouts when llmc push is done after a job, and not hold the slave lock. Obviously, neither should cause this, unless the network failed for all jobs and the push got lost. But then, there is also a once-per-hour cron-based push.

Ah, right the "holding the lock temporarily" is the culprit, of course, because that stores it in a local variable (so it can go out of scope), but the job scheduler insists that a lock is held globally, othwerwise it refuses, for "safety" reasons. Well, better not to clobber the json file then to run jobs, but clobber the state. Well, time for a good old "local"ised variable, that will do the job as well and satisfy my safety checks.

The jobs ready but not running is a different issue. No clue what it would be - obviously the scheduler didn't run. You can "llmc push" in such a case and see if that helps.

That is exactly what I thought as well and did try llmc push many times but it didn't make any difference.

Then we have yet another unidentified issue - the problem I introduced was simply that jobs did not successfully do lmc push at the end. But that seems to have been it.

Please update to latest llama.cpp so we can do some now supported/fixed models and finnaly try MLA. Here the most important changes:

Features:

  • Support for InternVL 3 38B and 78B mmproj extraction
  • Support for InternVL 2.5 and 3 including mmproj extraction
  • Support for ByteDance-Seed/Seed-Coder
  • qwen2/3moe/bailingmoe: set yarn metadata if present when creating GGUF from SafeTensors

Bug fixes:

  • Fix of crash with partial offloading of MoE
  • Fix of race conditions in FlashAttention kernels

imatrix changes:

  • --parse-special for enabling parsing of special tokens in imatrix calculation

special tokens? Like... ? That would sound interesting, but would it make a difference (I could create a new imatrix dataset with eos tokens between fragments, but it would likely cost a lot of tokens). I know you don't know either if that would make sense :=)

Any idea on how we could create a list of models to redo?

Update: updated

special tokens? Like... ? That would sound interesting, but would it make a difference (I could create a new imatrix dataset with eos tokens between fragments, but it would likely cost a lot of tokens). I know you don't know either if that would make sense :=)

Such tokens are also architecture specific. Let’s wait for now see what bartowski does with it. If he starts using special tokens, I will do some tests and see if it makes any real-world difference.

Any idea on how we could create a list of models to redo?

  • Once you added mmproj extraction we would do all of the InternVL models. I don't think we already did any of them so no models to redo.
  • The Seed-Coder I will all mass queue now as they are now finally supported. They weren’t before so no models to redo. Edit: Done. They are queued!
  • Regarding qwen2/3moe/bailingmoe yarn metadata I don't think its worth redoing them as llama.cpp offers a command line argument to force yarn which as far I'm aware works even if the yarn metadata are missing. In https://github.com/ggml-org/llama.cpp/pull/13312 the author mentions that ./llama-cli -m Ling-Coder-lite.gguf -c 16384 --rope-scaling yarn --rope-scale 4 can be used for old GGUFs. There also apparently is gguf_editor_gui.py hosted under https://huggingface.co/spaces/CISCai/gguf-editor to update old GGUFs.

Update: updated

Awesome! Thanks a lot.

@mradermacher Can you please enable mmproj extraction for InternVL. It is natively integrated in convert_hf_to_gguf.py as documented in https://github.com/ngxson/llama.cpp/blob/master/tools/mtmd/README.md so adding support for it should be relatively simple. I just did a testrun and can confirm that other than mmproj extraction InternVL works perfectly fine: https://huggingface.co/mradermacher/InternVL3-1B-GGUF

In theory, I could parse the output of convert_hf... --print and always use all, as the usage should be arch-agnostic (probably by providing a separate helper tool that could also be used by queueing. Funny how just writing it down can make you come up with a reaosnable design). Right now, it's simply a glob in quantize. But I do like the fact that it forces me to think about the supported models... :)

Anyway, it's added, I will requeue that model to test.

Hmm, convert_Hf_to_gguf.py supports InternVisionModel, while InternVL3-1B has an arch of InternVLChatModel, so it's apparently not supported by llama.cpp at the moment. I'm too afraid to wonder if thatw as even tested upstream :) Clicking through a few internvl models I couldn't even find one with InternVisionModel. Feels like a bug.

Sign up or log in to comment