380 52 197

Sayak Paul

sayakpaul

https://sayak.dev

AI & ML interests

Diffusion models, representation learning

Recent Activity

posted an update 1 day ago

Fast LoRA inference for Flux with Diffusers and PEFT 🚨 There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their adoption. In our latest post, @BenjaminB and I show different techniques to optimize LoRA inference for the Flux family of models for image generation. Our recipe includes the use of: 1. `torch.compile` 2. Flash Attention 3 (when compatible) 3. Dynamic FP8 weight quantization (when compatible) 4. Hotswapping for avoiding recompilation during swapping new LoRAs 🤯 We have tested our recipe with Flux.1-Dev on both H100 and RTX 4090. We achieve at least a *2x speedup* in either of the GPUs. We believe our recipe is grounded in the reality of how LoRA-based use cases are generally served. So, we hope this will be beneficial to the community 🤗 Even though our recipe was tested primarily with NVIDIA GPUs, it should also work with AMD GPUs. Learn the details and the full code here: https://huggingface.co/blog/lora-fast

commented on their article 1 day ago

Fast LoRA inference for Flux with Diffusers and PEFT

new activity 2 days ago

black-forest-labs/FLUX.1-Kontext-dev:pipe.to("cuda") runs super slow, is it expected?

View all activity

Organizations

posted an update 1 day ago

Post

280

Fast LoRA inference for Flux with Diffusers and PEFT 🚨

There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their adoption.

In our latest post, @BenjaminB and I show different techniques to optimize LoRA inference for the Flux family of models for image generation. Our recipe includes the use of:

1. torch.compile
2. Flash Attention 3 (when compatible)
3. Dynamic FP8 weight quantization (when compatible)
4. Hotswapping for avoiding recompilation during swapping new LoRAs 🤯

We have tested our recipe with Flux.1-Dev on both H100 and RTX 4090. We achieve at least a *2x speedup* in either of the GPUs. We believe our recipe is grounded in the reality of how LoRA-based use cases are generally served. So, we hope this will be beneficial to the community 🤗

Even though our recipe was tested primarily with NVIDIA GPUs, it should also work with AMD GPUs.

Learn the details and the full code here:
https://huggingface.co/blog/lora-fast

commented on Fast LoRA inference for Flux with Diffusers and PEFT 1 day ago

PyTorch nightly.

New activity in black-forest-labs/FLUX.1-Kontext-dev 2 days ago

pipe.to("cuda") runs super slow, is it expected?

#60 opened 2 days ago by

KnightOnLlama

upvoted an article 2 days ago

Article

State of open video generation models in Diffusers

and 2 others •

Jan 27

• 57

updated a dataset 3 days ago

sayakpaul/sample-datasets

Viewer • Updated 3 days ago • 6 • 18.3k • 1

liked a dataset 3 days ago

Dragonjinny/FiFA-pickapic-v2

Viewer • Updated 15 days ago • 179k • 272 • 1

published an article 3 days ago

Article

Fast LoRA inference for Flux with Diffusers and PEFT

and 1 other •

3 days ago

• 17

liked a Space 10 days ago

1.81k

Wan2.1

💻

Wan: Open and Advanced Large-Scale Video Generative Models

updated a Space 11 days ago

Benchmark Analyzer

🌖

Analyze Diffusers benchmarks

updated a dataset 11 days ago

diffusers/benchmarks

Viewer • Updated 11 days ago • 13 • 314 • 14

upvoted 2 articles 15 days ago

Article

Building the Hugging Face MCP Server

and 3 others •

16 days ago

• 50

Article

Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders

and 1 other •

17 days ago

• 608

updated a Space 18 days ago

Serialize Flux Aot

🐢

Space to serialize AoT compiled artifacts of Flux.

published a Space 18 days ago

Serialize Flux Aot

🐢

Space to serialize AoT compiled artifacts of Flux.

New activity in black-forest-labs/FLUX.1-Kontext-dev 19 days ago

Wouldn't fit on 4090 so I made it use a 4bit quant

#44 opened 19 days ago by

Fancellu

liked 2 Spaces 19 days ago

9.38k

Kolors Virtual Try-On

👕

Overlay garment on person image

1.07k

FLUX.1 Kontext

⚡

Kontext image editing on FLUX[dev]

commented a paper 24 days ago

Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

Paper • 2506.19852 • Published Jun 24 • 38 •

upvoted a paper 24 days ago

Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation

Paper • 2506.19852 • Published Jun 24 • 38

reacted to burtenshaw's post with ❤️ 24 days ago

Post

2804

Inference for generative ai models looks like a mine field, but there’s a simple protocol for picking the best inference:

🌍 95% of users >> If you’re using open (large) models and need fast online inference, then use Inference providers on auto mode, and let it choose the best provider for the model. https://huggingface.co/docs/inference-providers/index

👷 fine-tuners/ bespoke >> If you’ve got custom setups, use Inference Endpoints to define a configuration from AWS, Azure, GCP. https://endpoints.huggingface.co/

🦫 Locals >> If you’re trying to stretch everything you can out of a server or local machine, use Llama.cpp, Jan, LMStudio or vLLM. https://huggingface.co/settings/local-apps#local-apps

🪟 Browsers >> If you need open models running right here in the browser, use transformers.js. https://github.com/huggingface/transformers.js

Let me know what you’re using, and if you think it’s more complex than this.

Sayak Paul

AI & ML interests

Recent Activity

Organizations

sayakpaul's activity

pipe.to("cuda") runs super slow, is it expected?

State of open video generation models in Diffusers

Fast LoRA inference for Flux with Diffusers and PEFT

Wan2.1

Benchmark Analyzer

Building the Hugging Face MCP Server

Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders

Serialize Flux Aot

Serialize Flux Aot

Wouldn't fit on 4090 so I made it use a 4bit quant

Kolors Virtual Try-On

FLUX.1 Kontext