University of Sydney

community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

nielsrΒ  updated a Space 17 days ago
usyd-community/README
nielsrΒ  updated a collection 17 days ago
ViTPose
nielsrΒ  updated a collection 17 days ago
ViTPose
View all activity

usyd-community's activity

merveΒ 
posted an update 5 days ago
view post
Post
4296
Oof, what a week! πŸ₯΅ So many things have happened, let's recap! merve/jan-24-releases-6793d610774073328eac67a9

Multimodal πŸ’¬
- We have released SmolVLM -- tiniest VLMs that come in 256M and 500M, with it's retrieval models ColSmol for multimodal RAG πŸ’—
- UI-TARS are new models by ByteDance to unlock agentic GUI control 🀯 in 2B, 7B and 72B
- Alibaba DAMO lab released VideoLlama3, new video LMs that come in 2B and 7B
- MiniMaxAI released Minimax-VL-01, where decoder is based on MiniMax-Text-01 456B MoE model with long context
- Dataset: Yale released a new benchmark called MMVU
- Dataset: CAIS released Humanity's Last Exam (HLE) a new challenging MM benchmark

LLMs πŸ“–
- DeepSeek-R1 & DeepSeek-R1-Zero: gigantic 660B reasoning models by DeepSeek, and six distilled dense models, on par with o1 with MIT license! 🀯
- Qwen2.5-Math-PRM: new math models by Qwen in 7B and 72B
- NVIDIA released AceMath and AceInstruct, new family of models and their datasets (SFT and reward ones too!)

Audio πŸ—£οΈ
- Llasa is a new speech synthesis model based on Llama that comes in 1B,3B, and 8B
- TangoFlux is a new audio generation model trained from scratch and aligned with CRPO

Image/Video/3D Generation ⏯️
- Flex.1-alpha is a new 8B pre-trained diffusion model by ostris similar to Flux
- tencent released Hunyuan3D-2, new 3D asset generation from images
Β·
merveΒ 
posted an update 5 days ago
view post
Post
2071
smolagents can see πŸ”₯
we just shipped vision support to smolagents πŸ€— agentic computers FTW

you can now:
πŸ’» let the agent get images dynamically (e.g. agentic web browser)
πŸ“‘ pass images at the init of the agent (e.g. chatting with documents, filling forms automatically etc)
with few LoC change! 🀯
you can use transformers models locally (like Qwen2VL) OR plug-in your favorite multimodal inference provider (gpt-4o, antrophic & co) 🀠

read our blog http://hf.co/blog/smolagents-can-see
merveΒ 
posted an update 12 days ago
view post
Post
2530
Everything that happened this week in open AI, a recap 🀠 merve/jan-17-releases-678a673a9de4a4675f215bf5

πŸ‘€ Multimodal
- MiniCPM-o 2.6 is a new sota any-to-any model by OpenBMB
(vision, speech and text!)
- VideoChat-Flash-Qwen2.5-2B is new video multimodal models by OpenGVLab that come in sizes 2B & 7B in resolutions 224 & 448
- ByteDance released larger SA2VA that comes in 26B parameters
- Dataset: VRC-Bench is a new diverse benchmark for multimodal LLM reasoning performance

πŸ’¬ LLMs
- MiniMax-Text-01 is a new huge language model (456B passive 45.9B active params) by MiniMaxAI with context length of 4M tokens 🀯
- Dataset: Sky-T1-data-17k is a diverse dataset used to train Sky-T1-32B
- kyutai released Helium-1-Preview-2B is a new small multilingual LM
- Wayfarer-12B is a new LLM able to write D&D πŸ§™πŸ»β€β™‚οΈ
- ReaderLM-v2 is a new HTML parsing model by Jina AI

- Dria released, Dria-Agent-a-3B, new agentic coding model (Pythonic function calling) based on Qwen2.5 Coder
- Unsloth released Phi-4, faster and memory efficient Llama 3.3

πŸ–ΌοΈ Vision
- MatchAnything is a new foundation model for matching
- FitDit is a high-fidelity VTON model based on DiT architecture

πŸ—£οΈ Audio
- OuteTTS-0.3-1B is a new multilingual text-to-speech model with voice cloning and emotion control capabilities

πŸ“– Retrieval
- lightblue released a new reranker based on Qwen2.5 LB-reranker-0.5B-v1.0 that can handle 95+ languages
- cde-small-v2 is a new sota small retrieval model by
@jxm
merveΒ 
posted an update 13 days ago
merveΒ 
posted an update 16 days ago
view post
Post
3859
there's a new multimodal retrieval model in town 🀠
LlamaIndex released vdr-2b-multi-v1
> uses 70% less image tokens, yet outperforming other dse-qwen2 based models
> 3x faster inference with less VRAM πŸ’¨
> shrinkable with matryoshka πŸͺ†
> can do cross-lingual retrieval!
Collection: llamaindex/visual-document-retrieval-678151d19d2758f78ce910e1 (with models and datasets)
Demo: llamaindex/multimodal_vdr_demo
Learn more from their blog post here https://huggingface.co/blog/vdr-2b-multilingual πŸ“–
nielsrΒ 
updated a Space 17 days ago
merveΒ 
posted an update 19 days ago
view post
Post
3611
What a beginning to this year in open ML 🀠
Let's unwrap! merve/jan-10-releases-677fe34177759de0edfc9714

Multimodal πŸ–ΌοΈ
> ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts
> moondream2 is out with new capabilities like outputting structured data and gaze detection!
> Dataset: Alibaba DAMO lab released multimodal textbook β€” 22k hours worth of samples from instruction videos 🀯
> Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!

LLMs πŸ’¬
> Microsoft released Phi-4, sota open-source 14B language model πŸ”₯
> Dolphin is back with Dolphin 3.0 Llama 3.1 8B 🐬🐬
> Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment
> SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct πŸ’­
> Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview πŸ“•
> Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs πŸ“•
> Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences πŸ‘©πŸ»β€πŸ’»

Embeddings πŸ”–
> @MoritzLaurer released zero-shot version of ModernBERT large πŸ‘
> KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B

Image/Video Generation ⏯️
> NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts πŸ”₯
> Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!)
> Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M

Others
> Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression
> Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding