Learn how to search in a video dataset and generate using Tevatron/OmniEmbed-v0.1-multivent an all modality retriever, and Qwen/Qwen2.5-Omni-7B, any-to-any model in this notebook 🤝 merve/smol-vision
Jean Louis
AI & ML interests
Recent Activity
Organizations

Learn how to search in a video dataset and generate using Tevatron/OmniEmbed-v0.1-multivent an all modality retriever, and Qwen/Qwen2.5-Omni-7B, any-to-any model in this notebook 🤝 merve/smol-vision

So… who are they, and why does it matter?
Had a lot of fun co-writing this blog post with @xianbao , with key insights translated from Chinese, to unpack how this startup built a model that outperforms GPT-4.1, Claude Opus, and DeepSeek V3 on several major benchmarks.
🧵 A few standout facts:
1. From zero to $3.3B in 18 months:
Founded in March 2023, Moonshot is now backed by Alibaba, Tencent, Meituan, and HongShan.
2. A CEO who thinks from the end:
Yang Zhilin (31) previously worked at Meta AI, Google Brain, and Carnegie Mellon. His vision? Nothing less than AGI — still a rare ambition among Chinese AI labs.
3. A trillion-parameter model that’s surprisingly efficient:
Kimi K2 uses a mixture-of-experts architecture (32B active params per inference) and dominates on coding/math benchmarks.
4. The secret weapon: Muon optimizer:
A new training method that doubles efficiency, cuts memory in half, and ran 15.5T tokens with zero failures. Big implications.
Most importantly, their move from closed to open source signals a broader shift in China’s AI scene — following Baidu’s pivot. But as Yang puts it: “Users are the only real leaderboard.”
👇 Check out the full post to explore what Kimi K2 can do, how to try it, and why it matters for the future of open-source LLMs:
https://huggingface.co/blog/fdaudens/moonshot-ai-kimi-k2-explained
No, the Pangu Model License Agreement Version 1.0 is not a free software license. It imposes significant restrictions, such as prohibiting use within the European Union (Section 3) and requiring attribution (Section 4.2), which conflict with the principles of free software licenses like the GNU GPL or Open Source Definition. The non-transferable clause (Section 2) and indemnity requirement (Section 7) further deviate from standard free software terms.
🔥 "Open Model"? More Like "Openly Restrictive"! 🔥
Huawei calls Pangu Pro MoE an "open model"? That’s like calling a locked door an "open invitation." Let’s break down the brilliant "openness" here:
- "No EU Allowed!" (Section 3) – Because nothing says "open" like banning entire continents. GDPR too scary for you, Huawei?
- "Powered by Pangu" or GTFO (Section 4.2) – Mandatory branding? Real open-source models don’t force you to be a walking billboard.
- Non-transferable license (Section 2) – Can’t pass it on? So much for community sharing.
- Indemnify Huawei for your use (Section 7) – If anything goes wrong, you pay, not them. How generous!
This isn’t an "open model"—it’s a marketing stunt wrapped in proprietary chains. True open-source (Apache, MIT, GPL) doesn’t come with geographic bans, forced attribution, and legal traps.
Huawei, either commit to real openness or stop insulting the FOSS community with this pretend-free nonsense. 🚮
"not commercial" license isn't "Open Source", so please be accurate to users.
Reference:
The Open Source Definition – Open Source Initiative:
https://opensource.org/osd
Gemma License (danger) is not Free Software and is not Open Source:
https://gnu.support/gnu-emacs/emacs-lisp/Gemma-License-danger-is-not-Free-Software-and-is-not-Open-Source.html
So the goal of Google is just their monopoly and dependence of users. I suggest using fully free, free as in freedom, LLMs.

Model:
THU-KEG/LongWriter-Zero-32B
Dataset:
THU-KEG/LongWriter-Zero-RLData
Paper:
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning (2506.18841)
✨ 32B
✨ Multi-reward GRPO: length, fluency, structure, non-redundancy
✨ Enforces <think><answer> format via Format RM
✨ Build on Qwen2.5-32B-base
Okay, please keep researching so that you can get more tools for Uganda, Kenya and Tanzania.

Every language carries its own cultural values and worldviews. So, when we build AI systems, we're not just deciding how they speak but also whose perspectives they represent.
Even choosing which dialect to train on in Norway becomes a question of inclusion and power. In Kenya, will AI speak Swahili from Nairobi or coastal regions? What about indigenous languages with rich oral traditions but limited written text, like Quechua in Peru or Cherokee in North America?
The path forward? Building WITH communities, not just FOR them. Working with local partners (libraries, universities, civil society), testing for cultural alignment, and asking hard questions about representation.
Just published some thoughts on this after my keynote in Norway a few weeks ago: https://huggingface.co/blog/giadap/when-ai-speaks

Thank you, that is interesting, but where is the link? Is it going to work on 24 GB VRAM?

Ever felt your AI agent is "shooting from the hip"? It latches onto a single line of thought and fails to produce a robust, well-rounded plan. This is a common struggle I've called the "AI Reasoning Paradox."
To tackle this, I developed Trinity-Synthesis, a multi-agent architecture designed to force reflection and synthesis before delivering a final answer. The philosophy is simple: constructive conflict between different perspectives leads to better solutions.
Here’s the core idea:
Instead of one agent, it uses four agents running on the same base model but with different "personalities" defined by their system prompts and temperature settings:
🧠 The Visionary: Thinks outside the box (high temp: 1.0).
📊 The Analyst: Focuses on logic, data, and structure (low temp: 0.3).
🛠️ The Pragmatist: Evaluates feasibility, costs, and risks (mid temp: 0.5).
These three "thinkers" work in parallel on the same problem. Then, a final Synthesizer agent critically analyzes their outputs, rejects flawed arguments, and integrates the best points into a single, coherent, and often superior strategy.
The result is a more robust reasoning process that balances creativity with analytical rigor, making it ideal for solving complex, strategic problems where answer quality is critical.
I've written a deep dive on how it works, including a detailed case study ("The Helios Initiative") and the Python source code for you to experiment with.
Read the full article on Medium:
https://medium.com/@brainhome9/trinity-synthesis-how-i-built-an-ai-agent-that-thinks-before-it-speaks-d45d45c2827c
I'd love to hear your feedback and see what you build with it!
#AI #AIAgents #LLM #Reasoning #MultiAgent

It's an evolutionary coding agent that uses LLMs to discover and optimize algorithms. I successfully replicated DeepMind's results on circle packing (99.97% match!) and evolved a random search into a simulated annealing algorithm.
✨ Key features:
- Evolves entire codebases (not just single functions)
- Works with any OpenAI-compatible API
- LLM ensemble approach for better results
- Multi-objective optimization
👉 Check it out:
GitHub: https://github.com/codelion/openevolve
Blog post: https://huggingface.co/blog/codelion/openevolve
Would love to hear your thoughts or answer any questions about it!
Gemini's proprietary license is a deal-breaker. It's not just about performance—it's about freedom. Google's terms actively restrict libre use, while models like QwQ 32B and DeepSeek v3 (when properly licensed) respect user rights. Never conflate ethically-licensed AI with corporate traps that forbid modification, redistribution, or independent use.

That's why today I'm excited to introduce 𝐫𝐞𝐚𝐝𝐞𝐫𝐬, the new feature of PdfItDown v1.4.0!🎉
With 𝘳𝘦𝘢𝘥𝘦𝘳𝘴, you can choose among three (for now👀) flavors of text extraction and conversion to PDF:
- 𝗗𝗼𝗰𝗹𝗶𝗻𝗴, which does a fantastic work with presentations, spreadsheets and word documents🦆
- 𝗟𝗹𝗮𝗺𝗮𝗣𝗮𝗿𝘀𝗲 by LlamaIndex, suitable for more complex and articulated documents, with mixture of texts, images and tables🦙
- 𝗠𝗮𝗿𝗸𝗜𝘁𝗗𝗼𝘄𝗻 by Microsoft, not the best at handling highly structured documents, by extremly flexible in terms of input file format (it can even convert XML, JSON and ZIP files!)✒️
You can use this new feature in your python scripts (check the attached code snippet!😉) and in the command line interface as well!🐍
Have fun and don't forget to star the repo on GitHub ➡️ https://github.com/AstraBert/PdfItDown

Just tested it with Steve Jobs' Stanford speech and was speechless (pun intended). The video isn’t sped up.
3 things that floored me:
- Transcription took just 10 seconds for a 15-min file
- Got a CSV with perfect timestamps, punctuation & capitalization
- Stunning accuracy (correctly captured "Reed College" and other specifics)
NVIDIA also released a demo where you can click any transcribed segment to play it instantly.
The improvement is significant: number 1 on the ASR Leaderboard, 6% error rate (best in class) with complete commercial freedom (cc-by-4.0 license).
Time to update those Whisper pipelines! H/t @Steveeeeeeen for the finding!
Model: nvidia/parakeet-tdt-0.6b-v2
Demo: nvidia/parakeet-tdt-0.6b-v2
ASR Leaderboard: hf-audio/open_asr_leaderboard

And that’s just part of the companies from the Chinese community that released open models in April 🤯
zh-ai-community/april-2025-open-releases-from-the-chinese-community-67ea699965f6e4c135cab10f
🎬 Video
> MAGI-1 by SandAI
> SkyReels-A2 & SkyReels-V2 by Skywork
> Wan2.1-FLF2V by Alibaba-Wan
🎨 Image
> HiDream-I1 by Vivago AI
> Kimi-VL by Moonshot AI
> InstantCharacter by InstantX & Tencent-Hunyuan
> Step1X-Edit by StepFun
> EasyControl by Shanghai Jiaotong University
🧠 Reasoning
> MiMo by Xiaomi
> Skywork-R1V 2.0 by Skywork
> ChatTS by ByteDance
> Kimina by Moonshot AI & Numina
> GLM-Z1 by Zhipu AI
> Skywork OR1 by Skywork
> Kimi-VL-Thinking by Moonshot AI
🔊 Audio
> Kimi-Audio by Moonshot AI
> IndexTTS by BiliBili
> MegaTTS3 by ByteDance
> Dolphin by DataOceanAI
🔢 Math
> DeepSeek Prover V2 by Deepseek
🌍 LLM
> Qwen by Alibaba-Qwen
> InternVL3 by Shanghai AI lab
> Ernie4.5 (demo) by Baidu
📊 Dataset
> PHYBench by Eureka-Lab
> ChildMandarin & Seniortalk by BAAI
Please feel free to add if I missed anything!

The dataset simulates the discovery phase of a fictitious VC firm called Reasoned Capital and, once expanded, can be used to create models which are able to make complex, subjective financial decisions based on different criteria.
The generation process encourages recursive problem-solving in increasingly complex prompts to encourage models to assess and reevaluate the conclusions and generated opinions of upstream models. Pretty neat stuff, and I'm not aware of this architecture being used in a reasoning context anywhere else.
Check it out: ZennyKenny/synthetic_vc_financial_decisions_reasoning_dataset

moonshotai/Kimi-Audio-7B-Instruct
✨ 7B
✨ 13M+ hours of pretraining data
✨ Novel hybrid input architecture
✨ Universal audio capabilities (ASR, AQA, AAC, SER, SEC/ASC, end-to-end conversation)

Today, we're releasing an expanded version: 32K images annotated with 3.7M responses from over 300K individuals which was completed in under two weeks using the Rapidata Python API.
Rapidata/text-2-image-Rich-Human-Feedback-32k
A few months ago, we published one of our most liked dataset with 13K images based on the @data-is-better-together 's dataset, following Google's research on "Rich Human Feedback for Text-to-Image Generation" (https://arxiv.org/abs/2312.10240). It collected over 1.5M responses from 150K+ participants.
Rapidata/text-2-image-Rich-Human-Feedback
In the examples below, users highlighted words from prompts that were not correctly depicted in the generated images. Higher word scores indicate more frequent issues. If an image captured the prompt accurately, users could select [No_mistakes].
We're continuing to work on large-scale human feedback and model evaluation. If you're working on related research and need large, high-quality annotations, feel free to get in touch: [email protected].