π Multimodal - MiniCPM-o 2.6 is a new sota any-to-any model by OpenBMB (vision, speech and text!) - VideoChat-Flash-Qwen2.5-2B is new video multimodal models by OpenGVLab that come in sizes 2B & 7B in resolutions 224 & 448 - ByteDance released larger SA2VA that comes in 26B parameters - Dataset: VRC-Bench is a new diverse benchmark for multimodal LLM reasoning performance
π¬ LLMs - MiniMax-Text-01 is a new huge language model (456B passive 45.9B active params) by MiniMaxAI with context length of 4M tokens π€― - Dataset: Sky-T1-data-17k is a diverse dataset used to train Sky-T1-32B - kyutai released Helium-1-Preview-2B is a new small multilingual LM - Wayfarer-12B is a new LLM able to write D&D π§π»ββοΈ - ReaderLM-v2 is a new HTML parsing model by Jina AI - Dria released, Dria-Agent-a-3B, new agentic coding model (Pythonic function calling) based on Qwen2.5 Coder - Unsloth released Phi-4, faster and memory efficient Llama 3.3
πΌοΈ Vision - MatchAnything is a new foundation model for matching - FitDit is a high-fidelity VTON model based on DiT architecture
π£οΈ Audio - OuteTTS-0.3-1B is a new multilingual text-to-speech model with voice cloning and emotion control capabilities
π Retrieval - lightblue released a new reranker based on Qwen2.5 LB-reranker-0.5B-v1.0 that can handle 95+ languages - cde-small-v2 is a new sota small retrieval model by @jxm
ποΈ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.
We apply our recipe to train 2 Static Embedding models that we release today! We release: 2οΈβ£ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0 π§ my modern training strategy: ideation -> dataset choice -> implementation -> evaluation π my training scripts, using the Sentence Transformers library π my Weights & Biases reports with losses & metrics π my list of 30 training and 13 evaluation datasets
The 2 Static Embedding models have the following properties: ποΈ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5' 0οΈβ£ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed! π No maximum sequence length! Embed texts at any length (note: longer texts may embed worse) π Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more. πͺ Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)
Check out the full blogpost if you'd like to 1) use these lightning-fast models or 2) learn how to train them with consumer-level hardware: https://huggingface.co/blog/static-embeddings
The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.
A new benchmark (DPAB-Ξ±) has been released that evaluates LLM function calling in both Pythonic and JSON approaches.
It shows that Pythonic function calling often outperforms traditional JSON-based methods, especially for complex multi-step tasks.
Key findings from benchmarks: β Claude 3.5 Sonnet leads with 87% on Pythonic vs 45% on JSON β Smaller models show impressive results (Dria-Agent-Ξ±-3B: 72% Pythonic) β Even larger models like DeepSeek V3 (685B) show significant gaps (63% Pythonic vs 33% JSON)
If you're building or using LLM agents, these results suggest that how you implement function calling could impact performance - might be worth reconsidering JSON-only approaches.
β¨ MiniMax-text-01: - 456B with 45.9B activated per token - Combines Lightning Attention, Softmax Attention, and MoE for optimal performance - Training context up to 1M tokens, inference handles 4M tokens
β¨ MiniMax-VL-01: - ViT-MLP-LLM framework ( non-transformerπ) - Handles image inputs from 336Γ336 to 2016Γ2016 - 694M image-caption pairs + 512B tokens processed across 4 stages