JueZhang
's Collections
VisualLLM
updated
How Far Are We from Intelligent Visual Deductive Reasoning?
Paper
•
2403.04732
•
Published
•
20
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
75
DragAnything: Motion Control for Anything using Entity Representation
Paper
•
2403.07420
•
Published
•
14
Learning and Leveraging World Models in Visual Representation Learning
Paper
•
2403.00504
•
Published
•
32
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Paper
•
2403.13248
•
Published
•
78
Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos
Paper
•
2403.13044
•
Published
•
15
Vid2Robot: End-to-end Video-conditioned Policy Learning with
Cross-Attention Transformers
Paper
•
2403.12943
•
Published
•
15
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
•
2403.11703
•
Published
•
17
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper
•
2403.11481
•
Published
•
13
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
•
2403.10301
•
Published
•
52
RAFT: Adapting Language Model to Domain Specific RAG
Paper
•
2403.10131
•
Published
•
68
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
•
2403.10517
•
Published
•
33
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
46
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Paper
•
2403.17804
•
Published
•
17
Can large language models explore in-context?
Paper
•
2403.15371
•
Published
•
32
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Paper
•
2403.15042
•
Published
•
26
InternVideo2: Scaling Video Foundation Models for Multimodal Video
Understanding
Paper
•
2403.15377
•
Published
•
23
DragAPart: Learning a Part-Level Motion Prior for Articulated Objects
Paper
•
2403.15382
•
Published
•
10
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments
Paper
•
2404.07972
•
Published
•
47
Rho-1: Not All Tokens Are What You Need
Paper
•
2404.07965
•
Published
•
89
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents
Paper
•
2404.05902
•
Published
•
21
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
•
2404.07973
•
Published
•
31
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper
•
2404.07503
•
Published
•
30
OmniFusion Technical Report
Paper
•
2404.06212
•
Published
•
75
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
•
2404.05719
•
Published
•
83
ByteEdit: Boost, Comply and Accelerate Generative Image Editing
Paper
•
2404.04860
•
Published
•
25
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web
Navigating Agent
Paper
•
2404.03648
•
Published
•
25
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
Interleaved Visual-Textual Tokens
Paper
•
2404.03413
•
Published
•
26
Scaling Instructable Agents Across Many Simulated Worlds
Paper
•
2404.10179
•
Published
•
28
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
102
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper
•
2405.15223
•
Published
•
13
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
Understanding
Paper
•
2406.14515
•
Published
•
33
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous
Reinforcement Learning
Paper
•
2406.11896
•
Published
•
19
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
•
2406.11816
•
Published
•
23
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized
Sounds
Paper
•
2407.01494
•
Published
•
13