transformers-community (Transformers Community)

You would've implemented the 3-loop matrix multiplication many times as a ML practitioner, but the naive implementation is terrible for GPU performance. Modern GPUs achieve peak performance through careful memory access patterns and minimizing scheduling overhead.

In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed.

Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles.

The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future!

Code snippet for testing: https://gist.github.com/a-r-r-o-w/28339b442d164084506c0967029968a8

(Bonus: Since I've wanted to learn Manim for a while, this was a great opportunity to make a visualization for Naive VS Persistent matmul. Enjoy ✨)

3 replies

·

Gausson

updated a model about 1 month ago

transformers-community/sep_cache

8B • Updated Aug 4 • 1.13k • 8

joaogante

in transformers-community/support about 1 month ago

Custom `generate` methods discussion

🔥 🚀 3

6

#10 opened 4 months ago by

joaogante

updated a collection about 1 month ago

Custom generation methods - Community

Collection

Custom generation methods created and maintained by the community, and highlighted by our team • 1 item • Updated Jul 29 • 3

Gausson

in transformers-community/support about 1 month ago

Custom `generate` methods discussion

🚀 🔥 3

6

#10 opened 4 months ago by

joaogante

in transformers-community/README about 1 month ago

How to integrate our method into the `transformers-community`?

❤️ 1

1

#4 opened about 2 months ago by

Gausson

pcuenq

in transformers-community/support about 1 month ago

Custom `generate` methods discussion

🚀 🔥 3

6

#10 opened 4 months ago by

joaogante

a-r-r-o-w

posted an update 2 months ago

Post

3346

Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache.

The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with.

Diffusers usage with CogView4:

import torch
from diffusers import CogView4Pipeline
from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)
pipe.to("cuda")

apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2))

prompt = "A photo of an astronaut riding a horse on mars"
image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0]
image.save("output.png")

Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT.

Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache
PR: https://github.com/huggingface/diffusers/pull/11180

References:
- First Block Cache: https://github.com/chengzeyi/ParaAttention
- TeaCache: https://github.com/ali-vilab/TeaCache

1 reply

·

a-r-r-o-w

posted an update 2 months ago

Post

2856

As you might have already heard, FLUX.1-Kontext-dev is now released and taken the generative community by storm!

In case you haven't come across it, you can get started with Kontext using 🤗 diffusers. See the official [model]( black-forest-labs/FLUX.1-Kontext-dev) and [docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux#flux).

Want to know how inference companies like Fal & Replicate are able to run the model so fast and in under 2 seconds per image? Check out this [gist](https://gist.github.com/a-r-r-o-w/d08c37e8bd3e9c26b4ce80360be148c6) for some details!

1 reply

·

a-r-r-o-w

posted an update 3 months ago

Post

2308

New diffusion model for text-to-image and video-to-world generation: Cosmos Predict-2 👽

Model collection: nvidia/cosmos-predict2-68028efc052239369a0f2959
Diffusers support: https://github.com/huggingface/diffusers/pull/11695
Documentation: https://huggingface.co/docs/diffusers/main/en/api/pipelines/cosmos

These are results with the 2B param model. Imagine what you could do with the 14B version! Go check it out now!

1 reply

·

Transformers Community

AI & ML interests

Recent Activity

transformers-community/constrained-beam-search

transformers-community/constrained-beam-search

transformers-community/group-beam-search

transformers-community/group-beam-search

transformers-community/contrastive-search

Custom generation methods - Tutorials

transformers-community/dola

transformers-community/contrastive-search

transformers-community/sep_cache

Custom `generate` methods discussion

Custom generation methods - Community

Custom `generate` methods discussion

How to integrate our method into the `transformers-community`?

Custom `generate` methods discussion

AI & ML interests

Recent Activity

Team members 16

transformers-community's activity

Custom `generate` methods discussion

Custom `generate` methods discussion

How to integrate our method into the `transformers-community`?

Custom `generate` methods discussion