Inference Endpoints Images

community
Activity Feed

AI & ML interests

Hugging Face Inference Endpoints Images repository allows AI Builders to collaborate and engage creating awesome inference deployments

Recent Activity

hfendpoints-images's activity

a-r-r-o-w 
posted an update 1 day ago
reach-vb 
posted an update 2 days ago
view post
Post
1488
Excited to onboard FeatherlessAI on Hugging Face as an Inference Provider - they bring a fleet of 6,700+ LLMs on-demand on the Hugging Face Hub 🤯

Starting today, you'd be able to access all those LLMs (OpenAI compatible) on HF model pages and via OpenAI client libraries too! 💥

Go, play with it today: https://huggingface.co/blog/inference-providers-featherless

P.S. They're also bringing on more GPUs to support all your concurrent requests!
a-r-r-o-w 
posted an update 2 days ago
view post
Post
1096
Did you know how simple it was to get started with your own custom compiler backend with torch.compile? What's stopping you from writing your own compiler?

import torch
from torch._functorch.partitioners import draw_graph

def compiler(fx_module: torch.fx.GraphModule, _):
    draw_graph(fx_module, f"compile.dot")
    return fx_module.forward

def capture(model, *inputs):
    compiled_model = torch.compile(model, backend=compiler)
    y = compiled_model(*inputs)
    y.sum().backward()

class MLP(torch.nn.Module):
    def __init__(self):
        super().__init__()
        
        self.linear_1 = torch.nn.Linear(16, 32)
        self.linear_2 = torch.nn.Linear(32, 16)
    
    def forward(self, x):
        x = self.linear_1(x)
        x = torch.nn.functional.silu(x)
        x = self.linear_2(x)
        return x

if __name__ == '__main__':
    model = MLP()
    model.to("mps")
    x = torch.randn(4, 16, device="mps", dtype=torch.float32)

    capture(model, x)


--------------

Part of https://huggingface.co/posts/a-r-r-o-w/231008365980283
  • 1 reply
·
a-r-r-o-w 
posted an update 3 days ago
view post
Post
2030
Recently, I've been focusing my learning on the following topics:
- Pytorch internals, specifically the inductor system (roughly ~1 month of experience)
- Triton internals (~8 moe)
- CUDA (~3 moe)
- Understanding fusion patterns in compilers and how to improve them (~1 moe)
- Parallelism strategies for large scale inference optimization (~6-7 moe)

I thought it would be nice to document it somewhere for no particular reason. Maybe someone will find it useful? It's also because I want to get into the habit of writing, but had no motivation to do so. Maybe writing short informal posts will help build the habit.

Since I don't have a personal site, and don't plan to create one in the near future, I think HF posts are best suited for short and informal documentation to share my little discoveries and learnings. If you're interested, strap in!

First post in this series will be on basic study of Pytorch's float32 matmuls and their Triton implementation (nothing much, just the tutorial available on the website), short dive into TF32 and their TFLOPS comparison on an A100 machine.
·
AdinaY 
posted an update 5 days ago
freddyaboulton 
posted an update 5 days ago
freddyaboulton 
posted an update 5 days ago
AdinaY 
posted an update 6 days ago
view post
Post
3129
RoboBrain 2.0🔥 OPEN embedded brain model by BAAIBeijing

BAAI/RoboBrain2.0-7B

✨ 7B - Apache 2.0 / 32B coming soon
✨ Supports multiple images, long videos, and high-resolution visuals
✨ Spatial + temporal reasoning
✨ Real-time memory & scene graphs
AdinaY 
posted an update 8 days ago
view post
Post
2651
RedNote 小红书 just released their first LLM 🔥

dots.llm1.base 🪐 a 142B MoE model with only 14B active params.

rednote-hilab/dotsllm1-68246aaaaba3363374a8aa7c
✨ Base & Instruct - MIT license
✨ Trained on 11.2T non-synthetic high-quality data
✨ Competitive with Qwen2.5/3 on reasoning, code, alignment
AdinaY 
posted an update 8 days ago
view post
Post
425
MiniCPM4🔥 efficient LLMs built for end-side devices, by OpenBMB

openbmb/minicpm4-6841ab29d180257e940baa9b

✨ Apache 2.0
✨ 5–7× Faster Inference (Jetson Orin & RTX 4090)
✨ 8B trained on 8T clean, non-synthetic tokens
✨ 32K Native Context -> 128K+ with InfLLM v2 + LongRoPE
✨ Runs on 🤗Transformers , http://CPM.cu, vLLM, and SGLang
jbilcke-hf 
posted an update 8 days ago
jbilcke-hf 
posted an update 8 days ago
view post
Post
1769
Hi everyone,

I've seen some unsuccessful attempts at running Wan2GP inside a Hugging Face Space, which is a shame as it is a great Gradio app!

So here is a fork that you can use, with some instructions on how to do this:

jbilcke-hf/Wan2GP_you_must_clone_this_space_to_use_it#1

Note : some things like persistent models/storage/custom LoRAs might not be fully working out of the box. If you need those, you might have to dig into the Wan2GP codebase, see how to tweak the storage folder. Happy hacking!

AdinaY 
posted an update 9 days ago