Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
0.5
TFLOPS
254
190
717
Nishith Jain
KingNish
Follow
qubvel-hf's profile picture
rdoon's profile picture
Blane187's profile picture
1223 followers
·
106 following
kingnish24
KingNish24
AI & ML interests
AI is fun actually.
Recent Activity
updated
a Space
2 minutes ago
KingNish/Realtime-FLUX
reacted
to
a-r-r-o-w
's
post
with 🧠
6 minutes ago
Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache. The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with. Diffusers usage with CogView4: ```python import torch from diffusers import CogView4Pipeline from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16) pipe.to("cuda") apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2)) prompt = "A photo of an astronaut riding a horse on mars" image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0] image.save("output.png") ``` Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT. Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache PR: https://github.com/huggingface/diffusers/pull/11180 References: - First Block Cache: https://github.com/chengzeyi/ParaAttention - TeaCache: https://github.com/ali-vilab/TeaCache
reacted
to
a-r-r-o-w
's
post
with 🔥
6 minutes ago
Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache. The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with. Diffusers usage with CogView4: ```python import torch from diffusers import CogView4Pipeline from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16) pipe.to("cuda") apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2)) prompt = "A photo of an astronaut riding a horse on mars" image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0] image.save("output.png") ``` Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT. Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache PR: https://github.com/huggingface/diffusers/pull/11180 References: - First Block Cache: https://github.com/chengzeyi/ParaAttention - TeaCache: https://github.com/ali-vilab/TeaCache
View all activity
Organizations
KingNish
's models
18
Sort: Recently updated
KingNish/whisper-small-en
Automatic Speech Recognition
•
0.2B
•
Updated
about 20 hours ago
KingNish/moonshine-tiny-svarah
Updated
17 days ago
KingNish/tiny-talker
Updated
Jun 1
•
7
KingNish/Qwen2.5-0.5b-Test-ft
Text Generation
•
0.5B
•
Updated
Apr 28
•
2k
•
•
11
KingNish/Qwen2.5-0.5b-RBase
Text Generation
•
0.5B
•
Updated
Apr 28
•
120
•
•
1
KingNish/Reasoning-0.5b
Text Generation
•
0.5B
•
Updated
Apr 28
•
26
•
30
KingNish/Smollm-135M-audio
Text Generation
•
Updated
Apr 22
•
11
KingNish/qwen-1b-continued-v2.2
Text Generation
•
1B
•
Updated
Mar 9
•
18
KingNish/qwen-1b-continued-v2.1
Text Generation
•
1B
•
Updated
Mar 8
•
7
KingNish/qwen-1b-continued-v2
Text Generation
•
1B
•
Updated
Mar 7
•
11
KingNish/qwen-1b-continued
Text Generation
•
1B
•
Updated
Mar 7
•
14
KingNish/modernbert
Fill-Mask
•
0.1B
•
Updated
Feb 14
•
5
KingNish/Reasoning-Llama-3b-v0.2
Text Generation
•
3B
•
Updated
Oct 27, 2024
•
11
•
4
KingNish/Reasoning-Llama-3b-v0.1
Text Generation
•
3B
•
Updated
Oct 12, 2024
•
19
•
9
KingNish/Reasoning-Llama-1b-v0.1
Text Generation
•
1B
•
Updated
Oct 10, 2024
•
25
•
•
26
KingNish/Llama-3.2-1B-Instruct
Text Generation
•
1B
•
Updated
Oct 6, 2024
•
19
•
•
1
KingNish/Qwen2.5-0.5b-Test-ft-Q4_K_M-GGUF
0.5B
•
Updated
Sep 27, 2024
•
6
•
1
KingNish/Better-SDXL-Lora
Text-to-Image
•
Updated
Jul 12, 2024
•
57
•
15