Zesen Cheng's picture

Zesen Cheng

ClownRat

·

AI & ML interests

multi-modal foundation model; Segmentation, Detection, and Tracking;

Recent Activity

upvoted a paper 3 days ago

Llama-Nemotron: Efficient Reasoning Models

upvoted a paper 3 days ago

LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

upvoted a paper 3 days ago

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

View all activity

Organizations

ClownRat's activity

upvoted 9 papers 3 days ago

Llama-Nemotron: Efficient Reasoning Models

Paper • 2505.00949 • Published 10 days ago • 32

LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

Paper • 2505.02625 • Published 7 days ago • 20

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

Paper • 2505.02735 • Published 7 days ago • 27

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Paper • 2505.02835 • Published 6 days ago • 22

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Paper • 2505.02707 • Published 7 days ago • 79

RM-R1: Reward Modeling as Reasoning

Paper • 2505.02387 • Published 7 days ago • 65

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Paper • 2505.03739 • Published 5 days ago • 8

Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Paper • 2505.03821 • Published 9 days ago • 22

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

Paper • 2505.05467 • Published 3 days ago • 13

upvoted a paper 6 days ago

SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations

Paper • 2505.02094 • Published 8 days ago • 16

upvoted 3 papers 19 days ago

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

Paper • 2503.11579 • Published Mar 14 • 20

Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning

Paper • 2503.11646 • Published Mar 14 • 36

API Agents vs. GUI Agents: Divergence and Convergence

Paper • 2503.11069 • Published Mar 14 • 37

upvoted a paper 25 days ago

Gemma 3 Technical Report

Paper • 2503.19786 • Published Mar 25 • 50

upvoted a paper about 2 months ago

MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

Paper • 2503.14428 • Published Mar 18 • 9

authored a paper about 2 months ago

MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

Paper • 2503.14428 • Published Mar 18 • 9

upvoted 3 papers about 2 months ago

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 125

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Paper • 2410.18558 • Published Oct 24, 2024 • 20

Transformers without Normalization

Paper • 2503.10622 • Published Mar 13 • 162

upvoted a paper 2 months ago

LongRoPE2: Near-Lossless LLM Context Window Scaling

Paper • 2502.20082 • Published Feb 27 • 38