AI & ML interests

None defined yet.

Recent Activity

WauplinΒ 
posted an update about 10 hours ago
view post
Post
208
Say hello to hf: a faster, friendlier Hugging Face CLI ✨

We are glad to announce a long-awaited quality-of-life improvement: the Hugging Face CLI has been officially renamed from huggingface-cli to hf!

So... why this change?

Typing huggingface-cli constantly gets old fast. More importantly, the CLI’s command structure became messy as new features were added over time (upload, download, cache management, repo management, etc.). Renaming the CLI is a chance to reorganize commands into a clearer, more consistent format.

We decided not to reinvent the wheel and instead follow a well-known CLI pattern: hf <resource> <action>. Isn't hf auth login easier to type and remember?

The full rationale, implementation details, and migration notes are in the blog post: https://huggingface.co/blog/hf-cli

XenovaΒ 
posted an update 1 day ago
view post
Post
832
Introducing Voxtral WebGPU: State-of-the-art audio transcription directly in your browser! 🀯
πŸ—£οΈ Transcribe videos, meeting notes, songs and more
πŸ” Runs on-device, meaning no data is sent to a server
🌎 Multilingual (8 languages)
πŸ€— Completely free (forever) & open source

That's right, we're running Mistral's new Voxtral-Mini-3B model 100% locally in-browser on WebGPU, powered by Transformers.js and ONNX Runtime Web! πŸ”₯

Try it out yourself! πŸ‘‡
webml-community/Voxtral-WebGPU
sayakpaulΒ 
posted an update 1 day ago
view post
Post
207
Fast LoRA inference for Flux with Diffusers and PEFT 🚨

There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their adoption.

In our latest post, @BenjaminB and I show different techniques to optimize LoRA inference for the Flux family of models for image generation. Our recipe includes the use of:

1. torch.compile
2. Flash Attention 3 (when compatible)
3. Dynamic FP8 weight quantization (when compatible)
4. Hotswapping for avoiding recompilation during swapping new LoRAs 🀯

We have tested our recipe with Flux.1-Dev on both H100 and RTX 4090. We achieve at least a *2x speedup* in either of the GPUs. We believe our recipe is grounded in the reality of how LoRA-based use cases are generally served. So, we hope this will be beneficial to the community πŸ€—

Even though our recipe was tested primarily with NVIDIA GPUs, it should also work with AMD GPUs.

Learn the details and the full code here:
https://huggingface.co/blog/lora-fast
albertvillanovaΒ 
posted an update 15 days ago
view post
Post
433
πŸš€ New in smolagents v1.20.0: Remote Python Execution via WebAssembly (Wasm)

We've just merged a major new capability into the smolagents framework: the CodeAgent can now execute Python code remotely in a secure, sandboxed WebAssembly environment!

πŸ”§ Powered by Pyodide and Deno, this new WasmExecutor lets your agent-generated Python code run safely: without relying on Docker or local execution.

Why this matters:
βœ… Isolated execution = no host access
βœ… No need for Python on the user's machine
βœ… Safer evaluation of arbitrary code
βœ… Compatible with serverless / edge agent workloads
βœ… Ideal for constrained or untrusted environments

This is just the beginning: a focused initial implementation with known limitations. A solid MVP designed for secure, sandboxed use cases. πŸ’‘

πŸ’‘ We're inviting the open-source community to help evolve this executor:
β€’ Tackle more advanced Python features
β€’ Expand compatibility
β€’ Add test coverage
β€’ Shape the next-gen secure agent runtime

πŸ”— Check out the PR: https://github.com/huggingface/smolagents/pull/1261

Let's reimagine what agent-driven Python execution can look like: remote-first, wasm-secure, and community-built.

This feature is live in smolagents v1.20.0!
Try it out.
Break things. Extend it. Give us feedback.
Let's build safer, smarter agents; together πŸ§ βš™οΈ

πŸ‘‰ https://github.com/huggingface/smolagents/releases/tag/v1.20.0

#smolagents #WebAssembly #Python #AIagents #Pyodide #Deno #OpenSource #HuggingFace #AgenticAI
a-r-r-o-wΒ 
posted an update 17 days ago
view post
Post
3145
Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache.

The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with.

Diffusers usage with CogView4:

import torch
from diffusers import CogView4Pipeline
from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)
pipe.to("cuda")

apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2))

prompt = "A photo of an astronaut riding a horse on mars"
image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0]
image.save("output.png")


Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT.

Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache
PR: https://github.com/huggingface/diffusers/pull/11180

References:
- First Block Cache: https://github.com/chengzeyi/ParaAttention
- TeaCache: https://github.com/ali-vilab/TeaCache
tomaarsenΒ 
posted an update 24 days ago
view post
Post
2715
‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = πŸ”₯ hybrid search performance! Details:

1️⃣ Sparse Encoder Models
Brand new support for sparse embedding models that generate high-dimensional embeddings (30,000+ dims) where <1% are non-zero:

- Full SPLADE, Inference-free SPLADE, and CSR architecture support
- 4 new modules, 12 new losses, 9 new evaluators
- Integration with @elastic-co , @opensearch-project , @NAVER LABS Europe, @qdrant , @IBM , etc.
- Decode interpretable embeddings to understand token importance
- Hybrid search integration to get the best of both worlds

2️⃣ Enhanced Encode Methods & Multi-Processing
- Introduce encode_query & encode_document automatically use predefined prompts
- No more manual pool management - just pass device list directly to encode()
- Much cleaner and easier to use than the old multi-process approach

3️⃣ Router Module & Advanced Training
- Router module with different processing paths for queries vs documents
- Custom learning rates for different parameter groups
- Composite loss logging - see individual loss components
- Perfect for two-tower architectures

4️⃣ Comprehensive Documentation & Training
- New Training Overview, Loss Overview, API Reference docs
- 6 new training example documentation pages
- Full integration examples with major search engines
- Extensive blogpost on training sparse models

Read the comprehensive blogpost about training sparse embedding models: https://huggingface.co/blog/train-sparse-encoder

See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/v5.0.0

What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!
a-r-r-o-wΒ 
posted an update 29 days ago
view post
Post
2822
As you might have already heard, FLUX.1-Kontext-dev is now released and taken the generative community by storm!

In case you haven't come across it, you can get started with Kontext using πŸ€— diffusers. See the official [model]( black-forest-labs/FLUX.1-Kontext-dev) and [docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux#flux).

Want to know how inference companies like Fal & Replicate are able to run the model so fast and in under 2 seconds per image? Check out this [gist](https://gist.github.com/a-r-r-o-w/d08c37e8bd3e9c26b4ce80360be148c6) for some details!
  • 1 reply
Β·
albertvillanovaΒ 
posted an update about 1 month ago
view post
Post
1622
πŸš€ SmolAgents v1.19.0 is live!
This release brings major improvements to agent flexibility, UI usability, streaming architecture, and developer experience: making it easier than ever to build smart, interactive AI agents. Here's what's new:

πŸ”§ Agent Upgrades
- Support for managed agents in ToolCallingAgent
- Context manager support for cleaner agent lifecycle handling
- Output formatting now uses XML tags for consistency

πŸ–₯️ UI Enhancements
- GradioUI now supports reset_agent_memory: perfect for fresh starts in dev & demos.

πŸ”„ Streaming Refactor
- Streaming event aggregation moved off the Model class
- ➑️ Better architecture & maintainability

πŸ“¦ Output Tracking
- CodeAgent outputs are now stored in ActionStep
- βœ… More visibility and structure to agent decisions

πŸ› Bug Fixes
- Smarter planning logic
- Cleaner Docker logs
- Better prompt formatting for additional_args
- Safer internal functions and final answer matching

πŸ“š Docs Improvements
- Added quickstart examples with tool usage
- One-click Colab launch buttons
- Expanded reference docs (AgentMemory, GradioUI docstrings)
- Fixed broken links and migrated to .md format

πŸ”— Full release notes:
https://github.com/huggingface/smolagents/releases/tag/v1.19.0

πŸ’¬ Try it out, explore the new features, and let us know what you build!

#smolagents #opensource #AIagents #LLM #HuggingFace
a-r-r-o-wΒ 
posted an update about 1 month ago
view post
Post
2284
New diffusion model for text-to-image and video-to-world generation: Cosmos Predict-2 πŸ‘½

Model collection: nvidia/cosmos-predict2-68028efc052239369a0f2959
Diffusers support: https://github.com/huggingface/diffusers/pull/11695
Documentation: https://huggingface.co/docs/diffusers/main/en/api/pipelines/cosmos

These are results with the 2B param model. Imagine what you could do with the 14B version! Go check it out now!
  • 1 reply
Β·
a-r-r-o-wΒ 
posted an update about 1 month ago
view post
Post
1313
Did you know how simple it was to get started with your own custom compiler backend with torch.compile? What's stopping you from writing your own compiler?

import torch
from torch._functorch.partitioners import draw_graph

def compiler(fx_module: torch.fx.GraphModule, _):
    draw_graph(fx_module, f"compile.dot")
    return fx_module.forward

def capture(model, *inputs):
    compiled_model = torch.compile(model, backend=compiler)
    y = compiled_model(*inputs)
    y.sum().backward()

class MLP(torch.nn.Module):
    def __init__(self):
        super().__init__()
        
        self.linear_1 = torch.nn.Linear(16, 32)
        self.linear_2 = torch.nn.Linear(32, 16)
    
    def forward(self, x):
        x = self.linear_1(x)
        x = torch.nn.functional.silu(x)
        x = self.linear_2(x)
        return x

if __name__ == '__main__':
    model = MLP()
    model.to("mps")
    x = torch.randn(4, 16, device="mps", dtype=torch.float32)

    capture(model, x)


--------------

Part of https://huggingface.co/posts/a-r-r-o-w/231008365980283
  • 1 reply
Β·
a-r-r-o-wΒ 
posted an update about 1 month ago
view post
Post
2242
Recently, I've been focusing my learning on the following topics:
- Pytorch internals, specifically the inductor system (roughly ~1 month of experience)
- Triton internals (~8 moe)
- CUDA (~3 moe)
- Understanding fusion patterns in compilers and how to improve them (~1 moe)
- Parallelism strategies for large scale inference optimization (~6-7 moe)

I thought it would be nice to document it somewhere for no particular reason. Maybe someone will find it useful? It's also because I want to get into the habit of writing, but had no motivation to do so. Maybe writing short informal posts will help build the habit.

Since I don't have a personal site, and don't plan to create one in the near future, I think HF posts are best suited for short and informal documentation to share my little discoveries and learnings. If you're interested, strap in!

First post in this series will be on basic study of Pytorch's float32 matmuls and their Triton implementation (nothing much, just the tutorial available on the website), short dive into TF32 and their TFLOPS comparison on an A100 machine.
Β·
NarsilΒ 
posted an update about 1 month ago
view post
Post
1733
Me: This function is too slow. Find a faster algorithm.
Cursor: Hold my beer.

Me: *Slacking off with colleagues*
Cursor: Ping.

Me: 🀯

XenovaΒ 
posted an update about 2 months ago
view post
Post
6872
NEW: Real-time conversational AI models can now run 100% locally in your browser! 🀯

πŸ” Privacy by design (no data leaves your device)
πŸ’° Completely free... forever
πŸ“¦ Zero installation required, just visit a website
⚑️ Blazingly-fast WebGPU-accelerated inference

Try it out: webml-community/conversational-webgpu

For those interested, here's how it works:
- Silero VAD for voice activity detection
- Whisper for speech recognition
- SmolLM2-1.7B for text generation
- Kokoro for text to speech

Powered by Transformers.js and ONNX Runtime Web! πŸ€— I hope you like it!
Β·
danaaubakirovaΒ 
posted an update about 2 months ago
albertvillanovaΒ 
posted an update about 2 months ago
sayakpaulΒ 
posted an update 2 months ago
view post
Post
2766
Diffusers supports a good variety of quantization backends. It can be challenging to navigate through them, given the complex nature of diffusion pipelines in general.

So, @derekl35 set out to write a comprehensive guide that puts users in the front seat. Explore the different backends we support, learn the trade-offs they offer, and finally, check out the cool space we built that lets you compare quantization results.

Give it a go here:
https://lnkd.in/gf8Pi4-2
joaoganteΒ 
posted an update 2 months ago
view post
Post
520
Let's go! Custom generation code has landed in transformers πŸš€

Have you designed a new cool KV cache? Maybe you're comparing new test-time compute ideas you've been researching? Have you found a way to do diffusion with existing models? You can now easily share your findings with the community with custom generation code, sharing the well-known generate interface πŸ€“

In a nutshell, we have expanded the support of custom modeling code on the Hub with *model-agnostic* custom generation code. Write for one model, reuse with any model -- hopefully, this will democratize access to new generation ideas 🫑

As a creator, you gain the ability to get your ideas in transformers with minimal effort. You'll also have access to all Hub features: a landing page for your creation, discussions, usage metrics, ... πŸ€“

πŸ’Ž Resources πŸ’Ž
- docs: https://huggingface.co/docs/transformers/generation_strategies#custom-decoding-methods
- minimal example: transformers-community/custom_generate_example
- discussion: transformers-community/support#10