Lavico
's Collections
Vision Task
updated
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
•
2406.09415
•
Published
•
51
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper
•
2406.09406
•
Published
•
14
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Paper
•
2406.10227
•
Published
•
9
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
•
2406.08478
•
Published
•
40
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
•
2407.16198
•
Published
•
13
VideoGameBunny: Towards vision assistants for video games
Paper
•
2407.15295
•
Published
•
22
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language
Understanding
Paper
•
2407.15754
•
Published
•
20
Theia: Distilling Diverse Vision Foundation Models for Robot Learning
Paper
•
2407.20179
•
Published
•
47
SHIC: Shape-Image Correspondences with no Keypoint Supervision
Paper
•
2407.18907
•
Published
•
41
Paper
•
2407.21017
•
Published
•
23
Improving 2D Feature Representations by 3D-Aware Fine-Tuning
Paper
•
2407.20229
•
Published
•
7
Diffusion Models as Data Mining Tools
Paper
•
2408.02752
•
Published
•
14
Segment Anything with Multiple Modalities
Paper
•
2408.09085
•
Published
•
22
Sapiens: Foundation for Human Vision Models
Paper
•
2408.12569
•
Published
•
90
HiRED: Attention-Guided Token Dropping for Efficient Inference of
High-Resolution Vision-Language Models in Resource-Constrained Environments
Paper
•
2408.10945
•
Published
•
10
Agent-to-Sim: Learning Interactive Behavior Models from Casual
Longitudinal Videos
Paper
•
2410.16259
•
Published
•
5
DimensionX: Create Any 3D and 4D Scenes from a Single Image with
Controllable Video Diffusion
Paper
•
2411.04928
•
Published
•
49