Papers
arxiv:2410.00255

Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

Published on Sep 30
· Submitted by weitaikang on Oct 4
Authors:
,

Abstract

Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\% improvement in the captioning task (Scan2Cap).

Community

Paper author Paper submitter

Introduction: We tackle the problem of data scarcity—specifically, the lack of high-quality, robust instruction data for training 3D Large Language Models (3DLLMs)—by introducing our novel data engine, Robust Instruction Generation (RIG). RIG generates two types of robust instruction data: Adversarial Instruction data and Diverse Instruction data. Using our dataset of 1 Million samples, we present Robin3D, a new SOTA 3DLLM which further enhances spatial understanding, referring, and grounding abilities to better handle these complex instructions.

Performance: Robin3D surpasses previous methods across all 3D multimodal learning benchmarks, without task-specific fine-tuning. Notably, we achieve +5.3% on ScanRefer, +7.8% on Multi3DRefer and +6.9% on Scan2Cap.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.00255 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.00255 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.00255 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.