Nishith Jain's picture

Nishith Jain

KingNish

·

AI & ML interests

AI is fun actually.

Recent Activity

updated a Space 29 minutes ago

KingNish/Realtime-FLUX

reacted to a-r-r-o-w's post with 🧠 33 minutes ago

Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache. The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with. Diffusers usage with CogView4: ```python import torch from diffusers import CogView4Pipeline from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16) pipe.to("cuda") apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2)) prompt = "A photo of an astronaut riding a horse on mars" image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0] image.save("output.png") ``` Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT. Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache PR: https://github.com/huggingface/diffusers/pull/11180 References: - First Block Cache: https://github.com/chengzeyi/ParaAttention - TeaCache: https://github.com/ali-vilab/TeaCache

reacted to a-r-r-o-w's post with 🔥 34 minutes ago

Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache. The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with. Diffusers usage with CogView4: ```python import torch from diffusers import CogView4Pipeline from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16) pipe.to("cuda") apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2)) prompt = "A photo of an astronaut riding a horse on mars" image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0] image.save("output.png") ``` Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT. Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache PR: https://github.com/huggingface/diffusers/pull/11180 References: - First Block Cache: https://github.com/chengzeyi/ParaAttention - TeaCache: https://github.com/ali-vilab/TeaCache

View all activity

Organizations

upvoted an article about 22 hours ago

Article

SmolLM3: smol, multilingual, long-context reasoner

By

and 22 others •

1 day ago

• 364

upvoted a collection about 22 hours ago

🧠 SmolLM3

Smol, multilingual, long-context reasoner • 9 items • Updated about 6 hours ago • 39

upvoted a collection 3 days ago

Dhanishtha model series

Our Reasoning models • 3 items • Updated 3 days ago • 3

upvoted a collection 9 days ago

ERNIE 4.5

collection of ERNIE 4.5 models. "-Paddle" models use PaddlePaddle weights, while "-PT" models use Transformer-style PyTorch weights. • 23 items • Updated 6 days ago • 144

upvoted a collection 20 days ago

Essential-Web v1.0

10 items • Updated 22 days ago • 6

upvoted a paper 24 days ago

Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Paper • 2506.09250 • Published 29 days ago • 28

upvoted a collection about 1 month ago

dots.llm1

2 items • Updated 29 days ago • 15

upvoted a paper about 1 month ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 42

upvoted 2 articles about 1 month ago

Article

Announcing the Common Pile and Comma v0.1

By

•

Jun 6

• 15

Article

What if Your AI Conversations Become Public?

By

•

Jun 6

• 12

upvoted a collection about 1 month ago

MTP

4 items • Updated May 29 • 1

upvoted a changelog about 1 month ago

Changelog

New Inference Providers Dashboard

Jun 5

• 58

upvoted a collection about 1 month ago

One-RL-to-See-Them-All

One RL to See Them All: Visual Triple Unified Reinforcement Learning. GitHub: https://github.com/MiniMax-AI/One-RL-to-See-Them-All • 5 items • Updated 29 days ago • 27

upvoted 2 papers about 1 month ago

Distilling LLM Agent into Small Models with Retrieval and Code Tools

Paper • 2505.17612 • Published May 23 • 79

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Paper • 2505.22954 • Published May 29 • 12

upvoted an article about 1 month ago

Article

🌙 Introducing Moon: Storytelling Generator Model

By

and 1 other •

May 30

• 6

upvoted a collection about 1 month ago

Skywork-OR1

Skywork Open Reasoner 1 • 11 items • Updated May 29 • 30

upvoted 2 papers about 1 month ago

Exploring the Latent Capacity of LLMs for One-Step Text Generation

Paper • 2505.21189 • Published May 27 • 62

Alchemist: Turning Public Text-to-Image Data into Generative Gold

Paper • 2505.19297 • Published May 25 • 81

upvoted an article about 1 month ago

Article

Bigger isn't always better: how to choose the most efficient model for context-specific tasks 🌱🧑🏼‍💻

By

•

May 28

• 21