Merve Noyan

merve

AI & ML interests

VLMs, vision & co

Articles

Organizations

merve's activity

posted an update about 16 hours ago
view post
Post
753
OmniVision-968M: a new local VLM for edge devices, fast & small but performant
💨 a new vision language model with 9x less image tokens, super efficient
📖 aligned with DPO for reducing hallucinations
⚡️ Apache 2.0 license 🔥

Demo hf.co/spaces/NexaAIDev/omnivlm-dpo-demo
Model NexaAIDev/omnivision-968M
reacted to AdinaY's post with 👀🔥 2 days ago
view post
Post
2306
Let’s dive into the exciting releases from the Chinese community last week 🔥🚀
More details 👉 https://huggingface.co/zh-ai-community

Code model:
✨Qwen 2.5 coder by Alibaba Qwen
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f
✨OpenCoder by InflyAI - Fully open code model🙌
infly/opencoder-672cec44bbb86c39910fb55e

Image model:
✨Hunyuan3D-1.0 by Tencent
tencent/Hunyuan3D-1

MLLM:
✨JanusFlow by DeepSeek
deepseek-ai/JanusFlow-1.3B
deepseek-ai/JanusFlow-1.3B
✨Mono-InternVL-2B by OpenGVlab
OpenGVLab/Mono-InternVL-2B

Video model:
✨CogVideoX 1.5 by ChatGLM
THUDM/CogVideoX1.5-5B-SAT

Audio model:
✨Fish Agent by FishAudio
fishaudio/fish-agent-v0.1-3b

Dataset:
✨OPI dataset by BAAIBeijing
BAAI/OPI
posted an update 2 days ago
view post
Post
1733
Amazing past days at open ML, it's raining coding models, let's have a recap 🌧️ Find all models and datasets here merve/nov-15-releases-67372d0ebdc354756a52ecd0

Models
💻 Coding: Qwen team released two Qwen2.5-Coder checkpoints of 32B and 7B. Infly released OpenCoder: 1.5B and 8B coding models with instruction SFT'd versions and their datasets! 💗

🖼️ Image/Video Gen: Alibaba vision lab released In-context LoRA -- 10 LoRA models on different themes based on Flux. Also Mochi the sota video generation model with A2.0 license now comes natively supported in diffusers 👏

🖼️ VLMs/Multimodal: NexaAIDev released Omnivision 968M a new vision language model aligned with DPO for reducing hallucinations, also comes with GGUF ckpts 👏 Microsoft released LLM2CLIP, a new CLIP-like model with longer context window allowing complex text inputs and better search

🎮 AGI?: Etched released Oasis 500M, a diffusion based open world model that takes keyboard input and outputs gameplay 🤯

Datasets
Common Corpus: A text dataset with 2T tokens with permissive license for EN/FR on various sources: code, science, finance, culture 📖
reacted to maxiw's post with 👍🚀🔥🤗❤️ 3 days ago
view post
Post
4274
I was curious to see what people post here on HF so I created a dataset with all HF Posts: maxiw/hf-posts

Some interesting stats:

Top 5 Authors by Total Impressions:
-----------------------------------
@merve : 171,783 impressions (68 posts)
@fdaudens : 135,253 impressions (81 posts)
@singhsidhukuldeep : 122,591 impressions (81 posts)
@akhaliq : 119,526 impressions (78 posts)
@MonsterMMORPG : 112,500 impressions (45 posts)

Top 5 Users by Number of Reactions Given:
----------------------------------------
@osanseviero : 1278 reactions
@clem : 910 reactions
@John6666 : 899 reactions
@victor : 674 reactions
@samusenps : 655 reactions

Top 5 Most Used Reactions:
-------------------------
❤️: 7048 times
🔥: 5921 times
👍: 4856 times
🚀: 2549 times
🤗: 2065 times
·
posted an update 3 days ago
view post
Post
1538
Microsoft released LLM2CLIP: a CLIP model with longer context window for complex text inputs 🤯
All models with Apache 2.0 license here microsoft/llm2clip-672323a266173cfa40b32d4c

TLDR; they replaced CLIP's text encoder with various LLMs fine-tuned on captioning, better top-k accuracy on retrieval.
This will enable better image-text retrieval, better zero-shot image classification, better vision language models 🔥
Read the paper to learn more: LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation (2411.04997)
posted an update 17 days ago
view post
Post
5315
Another great week in open ML!
Here's a small recap 🫰🏻

Model releases
⏯️ Video Language Models
AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2

💬 Small language models
Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets.
Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M

🖼️ Image Generation
Stability AI released stabilityai/stable-diffusion-3.5-medium, a 2B model with commercially permissive license

🖼️💬Any-to-Any
gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!

Dataset releases
🖼️ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2
reacted to averoo's post with 🔥👀 19 days ago
view post
Post
3741
Hello, researchers! I've tried to made reading HF Daily Papers easier and made a tool that does reviews with LLMs like Claude 3.5, GPT-4o and sometimes FLUX.

📚 Classification by topics
📅 Sorting by publication date and HF addition date
🔄 Syncing every 2 hours
💻 Hosted on GitHub
🌏 English, Russian, and Chinese
📈 Top by week/month (in progress)

👉 https://hfday.ru

Let me know what do you think of it.
posted an update 20 days ago
view post
Post
5152
Hugging Face Hub Python library now comes with easy inference for vision language models! ✨

$ pip install huggingface_hub 🤗
  • 1 reply
·
posted an update 23 days ago
view post
Post
3429
Microsoft released a groundbreaking model that can be used for web automation, with MIT license 🔥 microsoft/OmniParser

Interesting highlight for me was Mind2Web (a benchmark for web navigation) capabilities of the model, which unlocks agentic behavior for RPA agents.

no need for hefty web automation pipelines that get broken when the website/app design changes! Amazing work.

Lastly, the authors also fine-tune this model on open-set detection for interactable regions and see if they can use it as a plug-in for VLMs and it actually outperforms off-the-shelf open-set detectors like GroundingDINO. 👏


OmniParser is a state-of-the-art UI parsing/understanding model that outperforms GPT4V in parsing.
posted an update 25 days ago
view post
Post
2424
Lotus 🪷 is a new foundation model on monocular depth estimation ✨
Compared to previous diffusion-based MDE models, Lotus is modified for dense prediction tasks
Authors also released a model for normal prediction 🤗
Find everything in this collection merve/lotus-6718fb957dc1c85a47ca1210
posted an update 26 days ago
posted an update 27 days ago
posted an update about 1 month ago
view post
Post
1952
It's raining depth estimation models ☔️
DepthPro is a zero-shot depth estimation model by Apple, it's fast, sharp and accurate 🔥
Demo: akhaliq/depth-pro
Model: apple/DepthPro
Paper page: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second (2410.02073)

The model consists of two encoders: an encoder for patches and an image encoder 🖼️ The outputs of both are merged to decode to depth maps and get the focal length.
The model outperforms the previous state-of-the-art models in average of various benchmarks 📑
posted an update about 1 month ago