you can easily fine-tune, quantize, play with sota vision LM InternVL3 now 🔥 we have recently merged InternVL3 to Hugging Face transformers and released converted checkpoints 🤗
DeepSeek, Alibaba, Skywork, Xiaomi, Bytedance..... And that’s just part of the companies from the Chinese community that released open models in April 🤯
🎬 Video > MAGI-1 by SandAI > SkyReels-A2 & SkyReels-V2 by Skywork > Wan2.1-FLF2V by Alibaba-Wan
🎨 Image > HiDream-I1 by Vivago AI > Kimi-VL by Moonshot AI > InstantCharacter by InstantX & Tencent-Hunyuan > Step1X-Edit by StepFun > EasyControl by Shanghai Jiaotong University
🧠 Reasoning > MiMo by Xiaomi > Skywork-R1V 2.0 by Skywork > ChatTS by ByteDance > Kimina by Moonshot AI & Numina > GLM-Z1 by Zhipu AI > Skywork OR1 by Skywork > Kimi-VL-Thinking by Moonshot AI
🔊 Audio > Kimi-Audio by Moonshot AI > IndexTTS by BiliBili > MegaTTS3 by ByteDance > Dolphin by DataOceanAI
🔢 Math > DeepSeek Prover V2 by Deepseek
🌍 LLM > Qwen by Alibaba-Qwen > InternVL3 by Shanghai AI lab > Ernie4.5 (demo) by Baidu
📊 Dataset > PHYBench by Eureka-Lab > ChildMandarin & Seniortalk by BAAI
Meta released Llama Guard 4 and new Prompt Guard 2 models 🔥
Llama Guard 4 is a new model to filter model inputs/outputs both text-only and image 🛡️ use it before and after LLMs/VLMs! meta-llama/Llama-Guard-4-12B
Introducing the ONNX model explorer: Browse, search, and visualize neural networks directly in your browser. 🤯 A great tool for anyone studying Machine Learning! We're also releasing the entire dataset of graphs so you can use them in your own projects! 🤗
Kimi-Audio 🚀🎧 an OPEN audio foundation model released by Moonshot AI moonshotai/Kimi-Audio-7B-Instruct ✨ 7B ✨ 13M+ hours of pretraining data ✨ Novel hybrid input architecture ✨ Universal audio capabilities (ASR, AQA, AAC, SER, SEC/ASC, end-to-end conversation)
Meta dropped swiss army knives for vision with A2.0 license 👏 > image/video encoders for vision language modelling and spatial understanding (object detection etc) 👏 > The vision LM outperforms InternVL3 and Qwen2.5VL 👏 > They also release gigantic video and image datasets
The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.
They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 👏
> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models 😮
> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)
The authors release the following checkpoints in sizes base, large and giant:
Authors release following datasets 📑 > PE Video: Gigantic video datasete of 1M videos with 120k expert annotations ⏯️ > PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks > PLM-VideoBench: New video benchmark on MCQA
Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)
DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset 👀
They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization 👏
🤗 Just published: "Consent by Design" - exploring how we're building better consent mechanisms across the HF ecosystem!
Our research shows open AI development enables: - Community-driven ethical standards - Transparent accountability - Context-specific implementations - Privacy as core infrastructure
Check out our Space Privacy Analyzer tool that automatically generates privacy summaries of applications!
Effective consent isn't about perfect policies; it's about architectures that empower users while enabling innovation. 🚀
Reasoning models like o3 and o4-mini are advancing faster than ever, but imagine what will be possible when they can run locally in your browser! 🤯
Well, with 🤗 Transformers.js, you can do just that! Here's Zyphra's new ZR1 model running at over 100 tokens/second on WebGPU! ⚡️
Giving models access to browser APIs (like File System, Screen Capture, and more) could unlock an entirely new class of web experiences that are personalized, interactive, and run locally in a secure, sandboxed environment.