We're open-sourcing our infra with 10M+ frames of dataset!
We're releasing Stera, an open-source infra that turns an off-the-shelf device in your pocket into a high-fidelity multimodal data pipeline. It's built around four layers. Capture → Process → Evaluate → Export.
Stera Capture removes the need for bespoke/gated hardware and runs on an off-the-shelf iPhone. It fuses together synchronized RGB, IMU, Lidar-guided depth, and 6-DoF pose out of the box from ARKit and exports them to a raw MCAP file.
@retrain-pipelines v0.2.0 is out ! I'm at Station F at My booth with GOSIM Paris 2026 today & tomorrow. Come meet me for a live in-person demo and a chat !
Today's content moderation systems give you a label: safe or unsafe. They don't tell you what triggered the decision, who is involved, or where in the image it happens. That opacity hurts auditing, breaks adaptation across platforms, and frustrates the human review that responsible deployment demands.
We built SenBen to fix this: the first large-scale scene graph benchmark designed specifically for sensitive content moderation:
- 13,999 annotated frames from 157 movies - Visual Genome style scene graphs with bounding boxes, attributes, and predicates - Affective state attributes (pain, fear, aggression, distress) so the model captures not just what is in the frame, but what it means - 16 safety tags across 5 categories, the broadest taxonomy of any dataset of this kind
A small model that beats much bigger ones:
We distilled a frontier VLM into a compact 241M parameter student built on Florence-2.
On grounded scene graph metrics, the 241M student beats every evaluated VLM except Gemini, and every commercial safety API. It also wins on object detection and captioning across the entire model zoo. It runs at 733 ms per frame on 1.2 GB VRAM, which is 7.6 times faster than the next-best local VLM at zero per-frame cost. The whole benchmark, from dataset creation through all baseline evaluations, is reproducible for under $350.
We should really have a release date range slider on the /models page. Tired of "trending/most downloaded" being the best way to sort and still seeing models from 2023 on the first page just because they're embedded in enterprise pipelines and get downloaded repeatedly. "Recently Created/Recently Updated" don't solve the discovery problem considering the amount of noise to sift through.
Slight caveat: Trending actually does have some recency bias, but it's not strong/precise enough.
The hidden gem of open-source embedding models: LCO-Embedding for text, image AND audio!
I found this model after reading the recent Massive Audio Embedding Benchmark (MAEB) paper, as it blew the other models out of the water on day zero. I've been using it personally for about a week, and searching my files by describing music, sound effects or images is both practical and entertaining. Really underrated model, would highly recommend checking it out: LCO-Embedding/LCO-Embedding-Omni-7B
Translating benchmarks is a painful process, requiring a lot of manual inspection and adjustments. You start from setting up the whole pipeline and adapting to every format type, including task specifics. There already exist some massive benchmarks, but they still have some simple (and sometimes silly) bugs, which can hurt the evaluations :( We present a novel automated translation framework to help with that!
Eastern and Southern European languages introduce richer linguistic structures compared to English and for benchmarks which heavily rely on grammatical coherence machine translation presents a risk of harming evaluations. We discover potential answer leakage or misleading through grammatical structure of the questions. Some benchmarks are also just outdated and need to be retranslated with newer and better models.
We present a framework with novel test-time scaling methods which allow to control time and cost investments, while at the same time mitigate the need for human-in-the-loop verification. While working on Ukrainian-focused MamayLM models, we had to translate 10+ benchmarks in a short span of time. Finding human evaluators is costly and time-consuming, same goes for using professional translators. With our pipeline we were able to do it in 3 days🏎️
We hope our findings will help enable stronger multilingual evaluations and developments. We release all produced benchmarks on Hugging Face together with the source code and Arxiv paper 🤗
🤔 Many cultures penalize or look down upon self-celebratory behavior. One such example is liking your own post. So why do i do it? Two reasons: 1. I disagree that self-celebratory behavior is inherently bad. 2. On the Huggingface hub, if your post has 0 reactions, it takes TWO whole clicks to react instead of one. So it is actually a UI hack that lowers the bar to engage.
So if you see me reacting to to my own post and thing 'Ugh, this guy is so full of himself' you are only half correct 😆
Now behold as I perform this magic trick called "Exhausting all reaction options for increased visual engagement" so you don't have to click twice to react. You're welcome! Follow this aspiring 🤗 HF Hub influencer for more half-serious bloat in your feed 😜
Did you know that Qwen3 TTS actually utilizes voice embedding? Your voice is turned into a vector of 1024 (or 2048) dimensions, and based on this vector alone you can get your custom voice.
But the coolest part is that this means that you can use math to modify voices, average voices. You can swap gender, pitch, mix and match vocies, and even create an emotion space! This also enables semantic voice search!
The voice embedding model is actually just a tiny encoder with just a few million parameters. I've ripped it out of the voice embeding model so you can use the embedding model standalone. Check out my collection! :D