๐ Multimodal - MiniCPM-o 2.6 is a new sota any-to-any model by OpenBMB (vision, speech and text!) - VideoChat-Flash-Qwen2.5-2B is new video multimodal models by OpenGVLab that come in sizes 2B & 7B in resolutions 224 & 448 - ByteDance released larger SA2VA that comes in 26B parameters - Dataset: VRC-Bench is a new diverse benchmark for multimodal LLM reasoning performance
๐ฌ LLMs - MiniMax-Text-01 is a new huge language model (456B passive 45.9B active params) by MiniMaxAI with context length of 4M tokens ๐คฏ - Dataset: Sky-T1-data-17k is a diverse dataset used to train Sky-T1-32B - kyutai released Helium-1-Preview-2B is a new small multilingual LM - Wayfarer-12B is a new LLM able to write D&D ๐ง๐ปโโ๏ธ - ReaderLM-v2 is a new HTML parsing model by Jina AI - Dria released, Dria-Agent-a-3B, new agentic coding model (Pythonic function calling) based on Qwen2.5 Coder - Unsloth released Phi-4, faster and memory efficient Llama 3.3
๐ผ๏ธ Vision - MatchAnything is a new foundation model for matching - FitDit is a high-fidelity VTON model based on DiT architecture
๐ฃ๏ธ Audio - OuteTTS-0.3-1B is a new multilingual text-to-speech model with voice cloning and emotion control capabilities
๐ Retrieval - lightblue released a new reranker based on Qwen2.5 LB-reranker-0.5B-v1.0 that can handle 95+ languages - cde-small-v2 is a new sota small retrieval model by @jxm
๐ช๐ฒ'๐๐ฒ ๐ท๐๐๐ ๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ฒ๐ฑ ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ ๐๐ญ.๐ฏ.๐ฌ ๐, and it comes with a major feature: you can now log agent runs using OpenTelemetry to inspect them afterwards! ๐
This interactive format is IMO much easier to inspect big multi-step runs than endless console logs.
Multimodal ๐ผ๏ธ > ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts > moondream2 is out with new capabilities like outputting structured data and gaze detection! > Dataset: Alibaba DAMO lab released multimodal textbook โ 22k hours worth of samples from instruction videos ๐คฏ > Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!
LLMs ๐ฌ > Microsoft released Phi-4, sota open-source 14B language model ๐ฅ > Dolphin is back with Dolphin 3.0 Llama 3.1 8B ๐ฌ๐ฌ > Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment > SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct ๐ญ > Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview ๐ > Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs ๐ > Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences ๐ฉ๐ปโ๐ป
Embeddings ๐ > @MoritzLaurer released zero-shot version of ModernBERT large ๐ > KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B
Image/Video Generation โฏ๏ธ > NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts ๐ฅ > Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!) > Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M
Others > Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression > Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
reacted to andrewrreed's
post with ๐๐ค๐ค๐ฅ9 days ago
๐ Supercharge your LLM apps with Langfuse on Hugging Face Spaces!
Langfuse brings end-to-end observability and tooling to accelerate your dev workflow from experiments through production
Now available as a Docker Space directly on the HF Hub! ๐ค
๐ Trace everything: monitor LLM calls, retrieval, and agent actions with popular frameworks 1โฃ One-click deployment: on Spaces with persistent storage and integrated OAuth ๐ Simple Prompt Management: Version, edit, and update without redeployment โ Intuitive Evals: Collect user feedback, run model/prompt evaluations, and improve quality ๐ Dataset Creation: Build datasets directly from production data to enhance future performance
Kudos to the Langfuse team for this collab and the awesome, open-first product theyโre building! ๐ @marcklingen@Clemo@MJannik
> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos โฏ๏ธ
> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)
> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM ๐ฌ
the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks โคต๏ธ
> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
QvQ-72B-Preview๐ an open weight model for visual reasoning just released by Alibaba_Qwen team Qwen/qvq-676448c820912236342b9888 โจ Combines visual understanding & language reasoning. โจ Scores 70.3 on MMMU โจ Outperforms Qwen2-VL-72B-Instruct in complex problem-solving
The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work โฏ๏ธ
Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B they evaluate sampling strategies, scaling laws for models and datasets, video representation and more! > The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled ๐ scaling dataset has diminishing returns for smaller models > They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal > They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2 they find google/siglip-so400m-patch14-384 to be most powerful ๐ฅ > they also compare freezing different parts of models, training all stages with some frozen parts give the best yield
They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models ๐ฅ
Multimodal ๐ผ๏ธ > Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants ๐ > OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license โจ > Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts
LLMs ๐ฌ > Meta released a new iteration of Llama 70B, Llama3.2-70B trained further > EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license ๐ฅ > Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license > Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models > Dataset: FineWeb2 just landed with multilinguality update! ๐ฅ nearly 8TB pretraining data in many languages!
Image/Video Generation ๐ผ๏ธ > Tencent released HunyuanVideo, a new photorealistic video generation model > OminiControl is a new editing/control framework for image generation models like Flux
Audio ๐ > Indic-Parler-TTS is a new text2speech model made by community
New InternVL drop with a state-of-the-art 78B vision language model with MIT license ๐ฅ https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c The release comes with seven new vision LMs based on InternViT 300M/6B and Qwen2.5 (0.5B, 3B, 32B, 72B) and InternLM2 (8B, 7B, 20B) in different sizes 78B model is of InternViT 6B and Qwen2.5-72B Instruct, can accomplish variety of tasks ๐ Try here OpenGVLab/InternVL