mlfoundations-cua-dev/uitars_1000_steps_gbs_8_wd_0.1orm_1.0_add_synthetic_legacy_typing_data Updated 5 days ago
mlfoundations-cua-dev/idm_tars_1.5_7b_frame_pairs_11emoved_editor_data_and_cua_in_house_data Updated 4 days ago
mlfoundations-cua-dev/idm_tars_1.5_7b_frame_pairs_10x_grad_norm_1.0_tripadvisor_data_yt_data Updated 4 days ago
mlfoundations-cua-dev/idm_tars_1.5_7b_frame_pairs_10x_grad_norm_1.0_tripadvisor_data_yt_data Updated 4 days ago
mlfoundations-cua-dev/idm_tars_1.5_7b_frame_pairs_11emoved_editor_data_and_cua_in_house_data Updated 4 days ago
mlfoundations-cua-dev/uitars_1000_steps_gbs_8_wd_0.1orm_1.0_add_synthetic_legacy_typing_data Updated 5 days ago
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Paper • 2411.07461 • Published Nov 12, 2024 • 24
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations Paper • 2408.12590 • Published Aug 22, 2024 • 37
Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness Paper • 2407.01942 • Published Jul 2, 2024
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16, 2024 • 101
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published Jun 17, 2024 • 21
Getting it Right: Improving Spatial Consistency in Text-to-Image Models Paper • 2404.01197 • Published Apr 1, 2024 • 32
DataComp: In search of the next generation of multimodal datasets Paper • 2304.14108 • Published Apr 27, 2023 • 2
Catwalk: A Unified Language Model Evaluation Framework for Many Datasets Paper • 2312.10253 • Published Dec 15, 2023 • 8
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use Paper • 2308.06595 • Published Aug 12, 2023 • 6