Join the force

#426
by RichardErkhov - opened

Hello @mradermacher , as you noticed we have been competing for the amount of models for quite a while. So instead of competing, want to join forces? I talked to @nicoboss , he is up for it, and I have my quant server for you with 2 big bananas (E5-2697Av4), 64 gigs of ram, and a 10gbps line ready for you!

Well, "take what I have" and "join forces" are not exactly the same thing. When we talked last about it, I realised we were doing very different things and thought diversity is good, especially when I actually saw what models you quantize and how :) BTW, I am far from beating your amount of models (remember, I have roughly two repos per model, so you have twice the amount), and wasn't in the business of competing, as it was clear I couldn't :)

But of course, I won't say no to such an offer, especially not at this moment (if you have seen my queue recently...).

So how do we go about it? Nico runs some virtualisation solution, and we decided on a linux container to be able to access his graphics cards, but since direct hardware access is not a concern, a more traditional VM would probably be the simplest option. I could give you an image, or you could create a VM with debian 12/bookworm and my ssh key on it (nico can just copy the authorized_kleys file).

Or, if you have any other ideas, let's talk.

Oh, and how much diskspace are you willing to give me? :)

Otherwise, welcome to team mradermacher. Really should have called it something else in the beginning.

Ah, and as for network access, I only need some port to reach ssh, and be able to get a tunnel out (wireguard, udp). having a random port go to the vm ssh port and forward udp port 7103 to the same vm port would be ideal. I can help with all that, and am open to alternative arrangements, but I have total trust in you that you can figure everything out :)

No worries I will help him setting up everything infrastructure wise. He already successfully created a Debian 12 LXC container. While a VMs might be easier those few percentages of lost performance bother me but if you prefer a VM I can also help him with that.

LXC sits perfectly well with me.

this brings me joy

@mradermacher Your new server "richard1" is ready. Make sure to abuse the internet as hard as you can. Details were provided by email by @nicoboss , so check it please as soon as you can

Oh, and how much diskspace are you willing to give me? :)

2 TB of SSD as this is all he has. Some resources are currently still in use by his own quantize tasks but should be gone by tomorrow once the models that are currently being processed are done but just already start your own tasks once the container is ready. He is also running a satellite imagery data processing project for me for the next few weeks but its resource usage will be minimal. Just go all in and try to use as much resources as you can on this server. For his quantization tasks he usually runs 10 models in parallel and uses an increased number of connections to download them in order to optimally make use of all resources available.

I'm on it. Wow, load average of 700 :)

I'll probably allocate something like 600GB + 600GB temp space and see how that works.

The unfortunate lack of zero copy support in zfs unfortunately will compound space issues :)

Awesome to hear that it all worked and you already started with it!

I'll probably allocate something like 600GB + 600GB temp space and see how that works.

For now 600+600 sounds great. Once his models are done tomorrow you should be able to use even more storage.

The unfortunate lack of zero copy support in zfs unfortunately will compound space issues :)

We can consider reformatting the SSD using btrfs. I will discuss this with him tomorrow. He was using gguf-split for his models so zfs was a great fit for his use-case but now btrfs would be a better choice. If we decide to reformate, we likely have let the queue on his node run try or there will not be enough storage to temporary move the LXC container to the relatively small boot disk.

An option would be to loop-mount a partition image (which is probably supported on zfs). I am wary of asking for a full reformat just for this :) Would also result in a hard quota, which might or might not be good.

I also wil finally have to sort out the nico1-has-no-access issue -. right now, nico1 cheats and directly takes the imatrix from local storage, but that won't work with rich1 :)

An option would be to loop-mount a partition image (which is probably supported on zfs). I am wary of asking for a full reformat just for this :) Would also result in a hard quota, which might or might not be good.

rich1 is privileged LXC container so you can mount things yourself if there is an easy workaround to get it work on zfs. If not a reformat would likely not require that much effort. Richard reformatted and switched to zfs for compression around a month ago without much issues. There already is no disk quota in place so not something we would miss on btrfs. Because this is a privileged container you could likely even use the btrfs functions to specify which files to compress if not blocked by AppArmour/SELinux/Capabilities.

I also wil finally have to sort out the nico1-has-no-access issue -. right now, nico1 cheats and directly takes the imatrix from local storage, but that won't work with rich1 :)

Oh no. If not easy to solve nico1 could just push them all to rich1 to continue the cheating as Richard is fine with me having access to rich1.

If not a reformat would likely not require

Well, that's then probably the path of least resistance. The idea behind "quota" is more to not disturb other processes if things go wrong - spaceutulision will be much better when shared.

Although, a file-backed partition image will shrink when fstrim'ed in current linux kernels (not sure the 5.15 on rich1 qualifies as such though), so space could be recovered even then.

If not easy to solve nico1 could just push them all to rich1

Not easy, but certainly solvable. It was an easy design to just share filesystems between all nodes, but it sucks for this kind of application :) The problem is simply that the part that has the job (the local node)is the one that will miss the imatrix file and went hunting in various (networked) places. I'll have to have it report back toi the scheduler, but I have wonderously overdesigned solutions for that already, so its just a matter of code refactoring (easy) glue code (easy) and doing all that while the machinery is running (I hate it, but it's what I do since february).

Just trying to talk myself into it.

The last remaining issue of this sort after this is the distribution of the job scheduler itself, which is done via rsync for nico1 (and soon rich1), which costs a few milliseconds on each start.

And then it's kind of nicely distributed, with no need to access over administrative borders.

And once that is done, I want to go back to local scheduling, where local nodes can start the next job autonously. The only reason they can't anymore is that certain events (such as patching the model card, or communicating imatrix urgency) are not "queued" events and would get lost. Probably easy to fix.

Anyway, rich1 has statically quanted it's first quant now, so I am kind of forced to fix the imatrix transfer :)

Ugh, and I'll have to fix the Thanks section, to possibly move it out of every single model card. Too many people involved :)

I wonder if I should somehow generate the model card dynamically on load from some metadata... I wonder if huggingface allows that. I am really annoyed at generating thousands of notifications/uploads for every change.

Hmm,. I get a record-breaking 60MBps from rich to nico. And a116162.example.com (richards host name) is refreshingly adventorous :-)

Just go all in and try to use as much resources as you can on this server.

Hmm, right now, the performance is... not good, and the quant jobs hardly get any CPU time. But then, at a load avg. of 700, good performance wouldn't necessarily be expected. (right now, load avg is 1200, systime is 40%, what the heck is this poor machine forced to do to spend 40% of its 32 cores in the kernel).

For his quantization tasks he usually runs 10 models in parallel

Hmm, I would expect them to mostly fight for the cache then.

At the moment, I am wary to run anything larger than a 7b. It feels wrong to make performance even worse :)

On the good news side, it seems to work, imatrix transfers have been "fixed" and rich1 (had to rename it to something shorter to type) is doing it's first imatrix quant (https://huggingface.co/mradermacher/Thespis-Mistral-7b-Alpha-v0.7-i1-GGUF).

@RichardErkhov you need to do something. I am staring at stats on your box for a while now. I cannot come up with a reasonable explanation for what I see, other than that poor server is abused to the point of being almost useless. If you don't immediately have a good idea on why it spends 40-50% of the whole time in the kernel, I would suggest it is completely overloaded with many hundreds of processes fighting for cache and cpu.

I get 800% cpu time sometimes for my quants, but the overall progress is lower than my slowest box (which is a 4 core i7-3770, or a 4-core E3-1275 - they are both dog slow).

I bet your server could do almost twice as much work per time if it wasn't this senselessly overloaded. I cannot imagine why anyone would run 10 quants concurrently, making things even worse. I am scared to start quants on this poor thing.

Your box... yes, it is just a lifeless piece of electronics. But it is suffering. You gotta listen to its pain.

Update: 8 million context switches per second. That explains the system time. Clearly, that box doesn't actually do work, it just spins.

Well, that's then probably the path of least resistance.

Richard is fine with switching to BTRFS and I already sent him the steps to do so. We will likely switch tomorrow so maybe let it rich1 run out of models as he only has like 80 GB on of storage on his boot disk. I recommend backing up all your crucial files on rich1 in case something goes wrong.

The idea behind "quota" is more to not disturb other processes if things go wrong - spaceutulision will be much better when shared.

Noting else should run on this SSD in the future so things should be fine.

Ugh, and I'll have to fix the Thanks section, to possibly move it out of every single model card. Too many people involved :)

Persons that helped with a specific model should be mentioned in the model card. Just list anyone that contributed to a specific model. On a static model just mention on whoever persons node its quantization ran and if weighted mention whoever computed the imatrix and whoever ran the quantization.

I wonder if I should somehow generate the model card dynamically on load from some metadata... I wonder if huggingface allows that. I am really annoyed at generating thousands of notifications/uploads for every change.

I unfortunately don't think something like this is possible on HuggingFace.

Hmm,. I get a record-breaking 60MBps from rich to nico. And a116162.example.com (richards host name) is refreshingly adventorous :-)

I'm quite impressed as for HuggingFace he sometimes experienced really slow per connection speeds.

Hmm, right now, the performance is... not good, and the quant jobs hardly get any CPU time. But then, at a load avg. of 700, good performance wouldn't necessarily be expected. (right now, load avg is 1200, systime is 40%, what the heck is this poor machine forced to do to spend 40% of its 32 cores in the kernel).
@RichardErkhov you need to do something. I am staring at stats on your box for a while now. I cannot come up with a reasonable explanation for what I see, other than that poor server is abused to the point of being almost useless. If you don't immediately have a good idea on why it spends 40-50% of the whole time in the kernel, I would suggest it is completely overloaded with many hundreds of processes fighting for cache and cpu.
I bet your server could do almost twice as much work per time if it wasn't this senselessly overloaded. I cannot imagine why anyone would run 10 quants concurrently, making things even worse. I am scared to start quants on this poor thing.
Update: 8 million context switches per second. That explains the system time. Clearly, that box doesn't actually do work, it just spins.

We found and fixed the root cause of this. Turns out processing satellite images using 192 processes was a bad idea. Quantizing 10 models in parallel might not be that bad after all but less would likely perform better. This issue also made his own quantization tasks much slower than anticipated which is why they are currently still running despite no new ones getting queued for over a day.

Persons that helped with a specific model should be mentioned in the model card. Just list anyone that contributed to a specific model.

I thought about it, but that's going to be too hard because it means I would have to track every single upload separately in metadata as it might come from a different box. I already didn't do this (because I forgot) for your quant jobs, and each individual model guilherme ungated. It's a similar problem for imatrix quants, except we "luckily" only have one for some time now. And lastly, just because it was quantized on my box doesn't mean I contributed to that model, I think. But I certainly contributed to the project as a whole.

Just think of yourself as part of team mradermacher (I renamed the hf account to reflect reality better :)

I unfortunately don't think something like this is possible on HuggingFace.

Yeah, I hoped they would allow javascript - maybe they do, I haven't checked it, but of course it would be a security issue. Not that I wouldn't put it past them. Not sure there wouldn't be a way around it, either :/

Alternatively, maybe we could have a very simple model card and a link to an external, more interactive quant choser that we can update and improve.

I'm quite impressed as for HuggingFace he sometimes experienced really slow per connection speeds.

I think that slow speed was because of the server overload. I sometimes gte 40MB/s from any server, and usually never more than ~100MB/s per model

Quantizing 10 models in parallel might not be that bad after all but less would likely perform better.

Not to mention the space required for all ten :)

We found and fixed the root cause of this.

That poor, poor server. I was really reluctant to queue anything on it :)

As for something different, would (both of) you like to queue models as well? I didn't want to push more work on nico (the person), but being the single bottleneck in queueing also doesn't feel so good. And maybe richard still wants to follow his dream(?) of quantizing everything, which might not be impossible if we limit ourselves to static quants. Well, maybe it's just a bit out of reach, but still, one could try.

Regarding the thanks section maybe just list all the team members including yourself on the model card on a single line so everyone’s contribution to the project as a whole gets the recognition they deserve while also linking to page containing a more detailed breakdown on who contributed what resources. Especially for Richard getting attribution is important to get the support required to continue working on AI related projects and fund his expensive server as he is still relatively young. Big thanks for updating the name to "Team mradermacher" - I really appreciate it. Please make it clear in the contribution breakdown that you are still the one contributing by far the most booth time- and resource wise.

Yeah, I hoped they would allow javascript - maybe they do, I haven't checked it, but of course it would be a security issue. Not that I wouldn't put it past them. Not sure there wouldn't be a way around it, either :/

No way they allow arbitrary JavaScript in the model card as it would be intentional stored XSS. I wouldn't be surprised if such a security vulnerability can be found but as soon anyone uses it, they will patch it. Even just the CSS injection found in GitHub earlier this year caused massive chaos so I can only imagine how disastrous arbitrary JavaScript would be.

Alternatively, maybe we could have a very simple model card and a link to an external, more interactive quant choser that we can update and improve.

I still don't really see the issue updating all the models. It is not even possible to follow a model on HuggingFace so nobody gets notified if you update all of them. The only thing that happens is that the "Following" feed gets temporary flooded but with the amounts of commits you push you can already only see 40 minutes back in time already making it useless. The only really unfortunate thing is that if you go on mradermacher's profile the models are sorted by "Recently updated" by default which is a pain if someone wants to browse through all our models but nobody does this with 11687 models anyways. Even worse is that if you search for a model on either the HuggingFace global search or the mradermacher specific model search it does show the "Updated" date. It ironically does so even if you sort it by "Recently created". If we update them once we can as well update them as often as we like without making things any worse.

An external quant choose would for sure be great but the model card still needs to be good as most will use HuggingFace to search for models. We can link an extrnal model card with additinal information but not sure how many would look at it. I also don't like the idea so much of decoupeling the moddle card from HuggingFace as what if we ever stop hosting them then all this information would get lost.

I think that slow speed was because of the server overload. I sometimes gte 40MB/s from any server, and usually never more than ~100MB/s per model

rich1 internet speed looks very decent especially when considering that he currently also processing and uploading some own models. We are currently fully CPU bottlenecked and max upload speed in the past few minutes was 1.56 Gbit/s while uploading 2 quants. By the way so cool you now see what quants are getting uploaded on the status page. I also noticed you can now click the models on the status page to get to their HuggingFace page. Thanks a lot for continuedly improving the status page. It got a very useful resource to me.

Not to mention the space required for all ten :)

I'm still surprised he manages to run 10 in paralell on a 2 TB SSD without much storage issues.

As for something different, would (both of) you like to queue models as well? I didn't want to push more work on nico (the person), but being the single bottleneck in queueing also doesn't feel so good. And maybe richard still wants to follow his dream(?) of quantizing everything, which might not be impossible if we limit ourselves to static quants. Well, maybe it's just a bit out of reach, but still, one could try.

I would love the ability to queue my own models and I'm sure Richard would really like being able to do so as well. With the rate you are currently queuing models I don't think we have the issue of a lack of models anytime soon. There is no way I could get anywhere close to the rate in which you find new models but I always have some I'm interested in but are not important enough to bother you about. There are also old historic models I only have in GPTQ or static quants that I would queue. I'm really impressed with how well and at what rate you select models. You somehow have gained the ability to determine if a model is any good in a faction of a second.

As mentioned, before we want to switch to the rich1 server from ZFS to BTRFS so it supports zero-copy. For this we need to get the LXC container down to less than 80 GB. Can you please make the scheduler stop scheduling new models to it so it will slowly empty? I really see no other way how we could otherwise do this as this server only has the boot disk and this SSD so the only way to switch without any data loss is if the LXC container fits into the remaining space on the boot disk. I recommend backing up all your crucial files on rich1 in case something goes wrong.

I really see no other way how we could otherwise do this as this server only has the boot disk and this SSD

I can imagine lots of exciting ways (mostly with help from the network), all of them exciting and complicated :)

Anyway, not an issue at all, no more jobs will be scheduled until you give the ok. Unfortunately, more jobs than expected have been scheduled (it's a bug). Let's see how fast it clears.

I actually have a (non-automatic) backup of nico1 and rich1, so in theory, you could flatten it (but that's likely not helping).

I could move most of the remaining models to e.g. nico1 or another box though. Although I am not in a hurry.

Anyway, not an issue at all, no more jobs will be scheduled until you give the ok. Unfortunately, more jobs than expected have been scheduled (it's a bug). Let's see how fast it clears.

Thanks a lot! I already informed Richard that he can start following my BTRFS migration guide as soon the queue is empty. I will let you know once the migration got completed.

I actually have a (non-automatic) backup of nico1 and rich1, so in theory, you could flatten it (but that's likely not helping).

Great. You might need it in case anything goes wrong. I don't think anything should as my guide involves using lxc-clone to clone the container from the SSD to the boot disk before formatting it and cloning it back afterwards but it is not something I ever tested myself.

I could move most of the remaining models to e.g. nico1 or another box though. Although I am not in a hurry.

There is no rush so it definitely makes sense to let the current queue empty naturally.

Ah, if I can set compression properties, then it would be best not to enable it with mount options.

Also, I am I/O bound a lot of the time. Fascinating.

I'm still surprised he manages to run 10 in paralell on a 2 TB SSD without much storage issues.

That's called magic =)

Hmm,. I get a record-breaking 60MBps from rich to nico. And a116162.example.com (richards host name) is refreshingly adventorous :-)

And that's a really good speed, I usually get something like 6-12mbps per thread to huggingface, idk what's the issue. We managed to get it to approximately to 60mbit with Nico, but still not as good as you would expect with 10gbps connection. I would say we are going supersonic speed here, if we think of 10gbps as light speed.

few minutes was 1.56 Gbit/s while uploading 2 quants.

just in case you are interested, that's the total traffic for my server: 2024-11-16 3.22 TiB | 1.14 Gbit/s

As for something different, would (both of) you like to queue models as well?

Sure, I have a script which queries all the models by their file size and then sorts them and gives to you as a list, you want the list if I manage to make it again? It's just with the huggingface issues it doesnt work (not sure if they fixed it)

Also, I am I/O bound a lot of the time. Fascinating.

that's interesting, this hard drive is usually getting over 3gb/s total speed, and I dont think we work with small files here, so what black magic are you doing there?

and @RichardErkhov first gave us the wondrous FATLLAMA-1.7T, followed by access to his server to quant more models, likely to atone for his sins.

I am dying from laughter

Anyways, I dont know when my script will die, but I guess when your queue finished I will force it to stop and then will reformat the drive

Below 48GB, and the last two models will be through soon, after that, my container will be quiescent (in case I never told you, you can see the status of "rich1" at http://hf.tst.eu/status.html)

That's called magic =)

Of the close-your-eye-and-just-walk-through variety?

And that's a really good speed, I usually get something like 6-12mbps per thread to huggingface, idk what's the issue.

Actually, so far, I can't really complain about speeds. It's not worse than my 1GBps boxes, and sometimes, far better.

Sure, I have a script which queries all the models by their file size and then sorts them and gives to you as a list,

I think we should give this serious thought. For example, if we limit ourselves to a few select static quant types, possibly "intelligently" based on model size, I think we should be able to quantize everything. We might want to give them a different branding, i.e. the "erkhov seal of approval" rather than "the average mradermacher", i.e.upload them to a different account (e.g. yours, possibly with similar naming scheme - your naming scheme is better suited for automatic quantisation :) And likely do some delay, so I have a chance to pick the choice models first to avoid duplicating work.

with the huggingface issues it doesnt work

I get 504 timeouts for everything all the time, and yes, it gets worse, then better, then even worse etc. It wasn't signifcant enough of an issue to cause problems.

that's interesting, this hard drive is usually getting over 3gb/s total speed, and I dont think we work with small files here, so what black magic are you doing there?

Just running two llamas-quantize processes in parallel. When I cache one source gguf in RAM, I get can even get to 100% cpu, so it's clearly I/O limited. For models up to 8B or so, that might actually be an option to do regularly. (running one quantize doesn't work well with this many cores (due to llama's design, a single quantize will essentially I/O-bottleneck itself)).

I never saw more than 1GBps total. What model is it?

And hey, that's nothing compared to the torture suite you were running :)

I do notice rather high system times when doing I/O (40% when doing 1GB/s), maybe zfs plays a role here. To be honest, I only know zfs from all the cool features and promises they didn't implement and keep. Why anybody would use it for anything in production... sigh young kids nowadays :)

I am dying from laughter

You forgot the smiley - but in any case, if you want it gone, I can change it, just tell me what you want to see, especially in the future when the joke gets too lame :)

just tell me what you want to see, especially in the future when the joke gets too lame :)

The funnier the better =)

Of the close-your-eye-and-just-walk-through variety?

Spray and pray approach and just general approximation, I got rather fast in math with parameters and stuff

Actually, so far, I can't really complain about speeds. It's not worse than my 1GBps boxes, and sometimes, far better.

idk, it usually doesnt go higher than 10mbps for me, I guess issues with provider and ubuntu or something

And likely do some delay, so I have a chance to pick the choice models first to avoid duplicating work

I mean for the model that we dont need imatrix, you can send to me. This will be better for both of us, because we dont spend time sending back and forth the model for quants, and the server is not sleeping overnight.

When I cache one source gguf in RAM

lol that's why the usage is so high lol. I was like "why is there a chainsaw on my RAM graph? I never get that"

I do notice rather high system times when doing I/O (40% when doing 1GB/s), maybe zfs plays a role here

ZFS is just random, sometimes 40%, sometimes 2%, idk, I guess Im going to reformat it and we will see

I never saw more than 1GBps total. What model is it?

before ZFS with my torture suite I manage to get 3GB/s (not gigabit), but it's on my "try to get to mradermacher's model count as fast as you can" mode, meaning I utilize every millisecond of CPU time. My usual rate is like 1-1.2 gbps. This mode goes: 2024-10-07 17.70 TiB | 1.80 Gbit/s

your naming scheme is better suited for automatic quantisation

And much better for automated parsing and searching

sigh young kids nowadays

very young, didnt even graduate yet

idk, it usually doesnt go higher than 10mbps for me, I guess issues with provider and ubuntu or something

Hmm, I get much better speeds on your box most of the time. Same kernel and same provider...

I mean for the model that we dont need imatrix, you can send to me.

I could, in theory, send everything I ignore every day to you, automatically. But it would be more efficient to use the same queuing system, that way, it would take advantage of other nodes. But I slowly start to see that this is not how things are done around here... anyway, I can start sending lists to you soon, as well, I could just make lists of urls that I didn't quant. I even have a history, so I could send you 30k urls at once. Or at leats, once I am through my second run through "old"! models (february till now. I'm in march now and nico already paniced twice).

"why is there a chainsaw on my RAM graph? I never get that"

Yup,. that was likely me.

My usual rate is like 1-1.2 gbps.

It's what I am seeing right now, and that's probably what the disk disk under the current usage pattern.

And much better for automated parsing and searching

Maybe, in this one aspect (name), but a pain in the ass to look at. But I feel both systems make sense: I do not tend to quant models that have non-unique names, and typically, the user name is not stated when people talk about llms on the 'net. While you pretty much need a collision-free system for your goals.

"try to get to mradermacher's model count as fast as you can"

You are stressing yourself out way too much. You lead comfortably, even if mradermacher had more repos, most of them are duplicates. On the other hand, your teensy beginner quant types are not even apples to my tasty oranges, so we can comfortably both win :^]

very young, didnt even graduate yet

Yeah, you haven't seen the wondrous times of 50 people logged in on a 16MB 80386...

BTW, my vm is idle now. And now we can see who is responsible for the high I/O waits (hint, not me :)

Right now, your disk is the bottleneck and it does ~300MBps.

Oh, and linux 5.15 is very very old - ubuntu wouldn't have something slightly newer, such as 6.1 or so?

Anyway, good night, and good luck :)

Sorry that the migration to BTRFS is still not completed. There are some coordination difficulties. Richard seems to be extremely busy, and I need him to perform the steps as only he has access to the server hosting rich1. I'm confident we can perform the migrations while you are sleeping.

BTW, my vm is idle now. And now we can see who is responsible for the high I/O waits (hint, not me :)

That was so expected.

Right now, your disk is the bottleneck and it does ~300MBps.

I'm not surprised that quantizing 10 models at once makes the disk reach it's IOPS limit and so gets much slower than its rated sequential speed. You can easily max out the CPU on nico1 using 2 parallel quantization tasks so I'm sure there is no point in doing 10 at once. In worst case we might need 4 at one. 2 for each CPU if llama.cpp has trouble spreading across multiple CPUs.

Oh, and linux 5.15 is very very old - ubuntu wouldn't have something slightly newer, such as 6.1 or so?

He is on Ubuntu 22.04 LTS and wants to stay on that OS so there is unfortunately nothing we can do about having such an outdated kernel version.

Anyway, good night, and good luck :)

Good night!

Sorry that the migration to BTRFS is still not completed.

No issue at all :)

if llama.cpp has trouble spreading across multiple CPUs.

The problem is the lack of readahead, i.e. it quickly quantises a tensor, then needs to wait for disk. Running two spreads it out much better, but it would of course be better if llama.cpp would be more optimized. But it's not tirival to do, so...

He is on Ubuntu 22.04 LTS

Ubuntu 22.04 LTS officially supports up to linux 6.8, but his box, his choice, of course.

Now good night for real :=)

We successfully completed the migration to BTRFS and upgraded to kernel 6.8.0-48. We mounted the BTRFS volume so it defaults to no compression but because it is a privileged LXC container compression can be enabled for specific files/folders using the usual BTRFS commands - I recommend to use zstd as compression for all source models due to the limited space on his SSD. Because you will be asleep for the next 6 hours or so Richard decided to quantize some more static models in the meantime - this time using only 4 parallel quantization tasks.

limited space on his SSD

If you are not going "try to get to mradermacher's model count as fast as you can", mode you wont need 2TB (or unless you download some 405B model)

only 4 parallel quantization tasks.

well as I said, 2 failed immediately

If you are not going "try to get to mradermacher's model count as fast as you can", mode you wont need 2TB (or unless you download some 405B model)

I am not sure what you actually point out, but let me use this opportunity to break down disk space requirements :)

A single 70B is 140GB for the download, needs +280GB for the conversion. To not wait for the upload, we need to store 2 quants, for roughly 280GB, pls, if an upload goes wrong, we need up to another 140G (or multiples). If we ever use an unpatched gguf-split, we need another 140G space to split these files. Some 70B models are twice as large.

That means a 70B is already up to 700GB (under rare, worst conditions that we can't see when assigning the job).

To try to keep costs down for nico, and also generate, cough, green AI, we try to not do imatrix calculations at night (no solar power), which means we need to queue some models at night and store them until their imatrix is available.

The alternative would be considerable downtime or scheduling restrictions (such as limitng your box to smaller models, which would be a waste, because your box is a real treat otherwise :). Or a scheduler that would account for every byte, by looking into the future.

Arguably, 2TB is indeed on the more comfortable side (my three fastest node have 1TB of disk space exclusively, and this is a major issue), and compression will not do miracles (it often achieves 20-30% on the source model and source gguf though. That helps with the I/O efficiency).

Plus, it's always good to leave some extra space. Who knows, you might want to use your box, too, and having some headroom always gives a good feeling :-)

In Summary, 2TB is comfortable for almost all model sizes, but does require some limitations. Right now, I allow for 600GB of budget (the budget is used for the source models) plus 600GB of "extra" budget reserved for running jobs. In practise, we should stay under 1TB unless we get larger models in the queue (which will happen soon).

Other numbers are indeed possible, and we can flexibly adjust these should the need arise.

Because you will be asleep for the next 6 hours

How can I be asleep at these exciting times (yeah, I should be).

Richard decided to quantize some more static

Richard can, of course, always queue as many tasks as he wishes in whatever way he wants :-)= Also, I need to give him some way to stop our queuing if he needs to - right now, it looks as if he could, but he can't.

@RichardErkhov I will provide two scripts that you can use to tell the scheduler to stop scheduling jobs.

@RichardErkhov should you ever want to gracefully pause the quanting on rich1, you can (inside my container) run /root/rich1-pause and /root/rich1-resume

They are simple shell scripts that a) tell the scheduler to not add/start new jobs (this will be reflected after the next status page update, but will take effect immediately) and b) interrupt any current quant jobs - the current quant will continue to be created, but then it will interrupt and be restarted another time.

uploads will also continue (but should also be visible on the status display).

that's in case you want to reboot for example. it will not move jobs off of your box, though, so if some poor guy is waiting for their quant (generally, negative nice levels) they will have to wait.

compression saves 20% currently, on source ggufs (and is disabled for quants) on rich1m after a night of queuing (i resumed at a suboptimal time).

Processed 38 files, 3499846 regular extents (3499846 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       80%      393G         487G         487G       
none       100%       61G          61G          61G       
zstd        77%      331G         426G         426G       

And bandwidht is really not so bad (maybe time related), but once I have 8 quants concurrently uploading, I get >>200MBps - even close to 500 for a while. In fact, that's really good :)

In fact , that's not so much different than I get to and from hf elsewhere.

@mradermacher waiting for the 30k models file

Richard paused his own quantization tasks and discovered that with only 2 parallel tasks there sometimes are short periods of time where there are some idle CPU resources. Can you increase rich1 to 3 parallel tasks as Richard hates seeing any idle resources on his server? Now would also be a good opportunity to check what the bottleneck on his server is. I believe it should be fully CPU bottlenecked despite NVMe response time being at 2 ms or is CPU just busy-waiting for the SSD?

I have increased it to three, but it will likely just reduce throughput due to cache-thrashing. But trying it is cheap. I can ramp it up to ten, too, if he feels good about it.

2 parallel tasks there sometimes are short periods of time where there are some idle CPU resources.

Even with three it wil happen, because the jobs sometimes have to wait for uploads, downloads, repository creation, or, quite often, for disk (for example, cpu load goes down when convetring to gguf). Also, his cpus use (intel-) hyperthreading, so they are essentially busy at 50% of total load (and linux understands hyperthreading, and I think llama does, too)

In short, I think richard is driven too much by feelings and too little by metrics and understanding :) You hear that, richard? :)

I believe it should be fully CPU bottlenecked

I think it is easily able to saturate disk bandwidth for Q8_0 and other _0 quants, and running more than one quant will just make it worse at those times. When two llama quantize work on some IQ quants, they usually keep the cpu around the 99% busy mark (including hyperthreading cores), which is also not good, since they are then fighting for memory bandwidth, but probably not a big deal with two quants.

The only reason it can't be busy 100% of the time llama-quantize runs is the disk. That's true for even your server, just less so. I have waited for multiple minutes for a sync to finish even on your box :)

As for efficiency, if llama would interleave loading of the next tensor and quantizing it, it could saturate the cpu with just one job (for the high cpu-usage quants). I did think about giving two jobs different priorities, but with the current state of linux scheduling, this has essentially no effect. (I cna run two llama-quantize at nice 0 and 20, and they both get ~50% of the cpu on my amd hexacores).

However, the general strategy of richard "it has idle time, start more jobs" might actually reduce idle time to some extent, but it will do no good, especially when I/O bottlenecked.

waiting for the 30k models file

I am at ~20k of 30k at the moment, so that will take a month or so.

But let's decide on an API. I mkdir's /skipped_models on rich1, and will copy files with one hf-model-url per line in there. The pattern should be "*.txt", and other files should be ignored. That way you can automate things, I hope?

But I can look into automating the daily ones - I started to mark all models that I have looked at yesterday. It's not totally trivial to produce a list because the box that does that does not have access to the queue and will have to ask individual nodes for their jobs, so my current plan is to have a strategy where I only export models that are older than n days (e.g. 7) to ensure they have gone through the queue. But if it is latency-sensitive I can give it some though and export the model list regularly somehow and use that to find models I have chosen.

Thinking about it, I can let it run again the submit log, which is easier to access via NFS.

Update: nope, the submit log of course didn't start in february, so I need a different strategy, do a lot of work, or wait till we are through with the current queue.

Sign up or log in to comment