OmniVision-968M: a new local VLM for edge devices, fast & small but performant 💨 a new vision language model with 9x less image tokens, super efficient 📖 aligned with DPO for reducing hallucinations ⚡️ Apache 2.0 license 🔥
Models 💻 Coding: Qwen team released two Qwen2.5-Coder checkpoints of 32B and 7B. Infly released OpenCoder: 1.5B and 8B coding models with instruction SFT'd versions and their datasets! 💗
🖼️ Image/Video Gen: Alibaba vision lab released In-context LoRA -- 10 LoRA models on different themes based on Flux. Also Mochi the sota video generation model with A2.0 license now comes natively supported in diffusers 👏
🖼️ VLMs/Multimodal: NexaAIDev released Omnivision 968M a new vision language model aligned with DPO for reducing hallucinations, also comes with GGUF ckpts 👏 Microsoft released LLM2CLIP, a new CLIP-like model with longer context window allowing complex text inputs and better search
🎮 AGI?: Etched released Oasis 500M, a diffusion based open world model that takes keyboard input and outputs gameplay 🤯
Datasets Common Corpus: A text dataset with 2T tokens with permissive license for EN/FR on various sources: code, science, finance, culture 📖
Microsoft released LLM2CLIP: a CLIP model with longer context window for complex text inputs 🤯 All models with Apache 2.0 license here microsoft/llm2clip-672323a266173cfa40b32d4c
TLDR; they replaced CLIP's text encoder with various LLMs fine-tuned on captioning, better top-k accuracy on retrieval. This will enable better image-text retrieval, better zero-shot image classification, better vision language models 🔥 Read the paper to learn more: LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation (2411.04997)
Another great week in open ML! Here's a small recap 🫰🏻
Model releases ⏯️ Video Language Models AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2
💬 Small language models Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets. Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M
Hello, researchers! I've tried to made reading HF Daily Papers easier and made a tool that does reviews with LLMs like Claude 3.5, GPT-4o and sometimes FLUX.
📚 Classification by topics 📅 Sorting by publication date and HF addition date 🔄 Syncing every 2 hours 💻 Hosted on GitHub 🌏 English, Russian, and Chinese 📈 Top by week/month (in progress)
Microsoft released a groundbreaking model that can be used for web automation, with MIT license 🔥 microsoft/OmniParser
Interesting highlight for me was Mind2Web (a benchmark for web navigation) capabilities of the model, which unlocks agentic behavior for RPA agents.
no need for hefty web automation pipelines that get broken when the website/app design changes! Amazing work.
Lastly, the authors also fine-tune this model on open-set detection for interactable regions and see if they can use it as a plug-in for VLMs and it actually outperforms off-the-shelf open-set detectors like GroundingDINO. 👏
OmniParser is a state-of-the-art UI parsing/understanding model that outperforms GPT4V in parsing.
Lotus 🪷 is a new foundation model on monocular depth estimation ✨ Compared to previous diffusion-based MDE models, Lotus is modified for dense prediction tasks Authors also released a model for normal prediction 🤗 Find everything in this collection merve/lotus-6718fb957dc1c85a47ca1210
It's raining depth estimation models ☔️ DepthPro is a zero-shot depth estimation model by Apple, it's fast, sharp and accurate 🔥 Demo: akhaliq/depth-pro Model: apple/DepthPro Paper page: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second (2410.02073) The model consists of two encoders: an encoder for patches and an image encoder 🖼️ The outputs of both are merged to decode to depth maps and get the focal length. The model outperforms the previous state-of-the-art models in average of various benchmarks 📑