Good outputs! Even at 2.0bpw!

by phakio - opened 3 days ago

3 days ago

Took me a while to wrestle together a working casual_cov1d install, but with the latest dev branch 2.0bpw is working good! all fits onto my single 4090, haven't tried spreading a larger quant across all 4 of my cards yet...

I noticed the prompt processing speeds vary greatly, and the generate speed seems a little lower than what i'd expect from a 4090, but nonetheless this model is working on exllama3 v0.7.0 (0.6.0 dev)

Great work on the quants!

turboderp

Owner 3 days ago

Prompt processing is a little inconsistent because it's a recurrent model, and I'm still working out the best way to checkpoint the recurrent state between requests. It has to interact with paged attention for the softmax attn layers and that gets pretty complicated. Also it uses the JIT kernels from flash-linear-attention, and they often pause everything for a few seconds to run autotune when a new input shape is encountered. There's a PR underway to address that. Finally the implementation hasn't been fully optimized yet. Haven't even profiled it to see if there are any obvious bottlenecks.

phakio

2 days ago

I see, this is my first experience with exllama, I'm normally gguf chud. I'll keep an eye on this and future updates, and report back findings I find interesting -

I think it's crazy how you guys can comprehend the backend of how all this works, and am very grateful. Keep up the good work, @turboderp !

xldistance

2 days ago

Can you help me compile casual_cov1d under the environment Torch 2.8 + CUDA 12.8 + CPython 3.10

xldistance

2 days ago

windows causal-conv1d https://github.com/loscrossos/lib_causal-conv1d/releases

xldistance

2 days ago

I'm only getting 12 tokens/s on my 4090 with 48GB VRAM; I expected a generation speed of 40–50 tokens/s.

turboderp

Owner 2 days ago

causal-conv1d should just compile. If you have the build tools installed it's just pip install . from the repo directory.

However you don't actually need it. It probably improves performance a little bit, but it was always supposed to be optional. The most recent commit should fix the import error if it isn't installed.

As for performance, I'm working on it. You should be getting at least 40 t/s on a 4090 but I'm not sure about Triton on Windows. I'm pushing some commits in a bit which are giving me about a 50% boost here, but there's still room for an extra 75% or so. It's all down to Triton has ridiculous kernel launch overhead (it's amazingly bad), and I'm not sure how to address it yet. Also no idea if it's worse or better on the Windows port.

You'll definitely also want to upgrade to the latest commit of flash-linear-attentionfrom today since they just pushed an update to cache kernel autotune results. This prevents many sudden pauses during inference that you might be reading as low token throughput.

If you're using chat.py to measure with long outputs, also be aware that the console markdown rendering is not very performant. It becomes a major bottleneck after a few thousand characters, so if you're just benchmarking you can get a clearer picture by disabling it (run the script with -basic).

Anyway, pushing some big updates in just a moment, stay tuned.

theo77186

1 day ago

on my system with 28GB of VRAM (4060Ti + 3060), I now managed to get 35t/s text generation with the latest git + latest fla. Prompt processing is still inconsistent, the kernels overhead should explain the observed behavior. A test a day earlier was a bit slower (25t/s cold, 30t/s warm) so at least there's progress.
I'm testing the 2.27bpw model btw.

xldistance

about 14 hours ago

@turboderp
I'm running a 3.5bpw model with the latest exllamav3 installed, and it now achieves a generation speed of 16 tokens per second.

turboderp

Owner about 13 hours ago

Well, that at least is a proportional speedup, so you're definitely running into a CPU bottleneck. It's possible some of it is down to using Python 3.10, but more likely something is up with the Windows NVIDIA driver or just Windows in general. Do you have any way to profile it in Nsight Systems?

xldistance

about 13 hours ago

•

edited about 11 hours ago

Well, that at least is a proportional speedup, so you're definitely running into a CPU bottleneck. It's possible some of it is down to using Python 3.10, but more likely something is up with the Windows NVIDIA driver or just Windows in general. Do you have any way to profile it in Nsight Systems?

I just realized my GPU utilization is only around 20%,I'm using the latest version of Tabby API.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment