GroundingGPT: Language-Enhanced Multi-modal Grounding Model
Introduction
GroundingGPT is an end-to-end multimodal grounding model that accurately comprehends inputs and possesses robust grounding capabilities across multi modalities,including images, audios, and videos. To address the issue of limited data, we construct a diverse and high-quality multimodal training dataset. This dataset encompasses a rich collection of multimodal data enriched with spatial and temporal information, thereby serving as a valuable resource to foster further advancements in this field. Extensive experimental evaluations validate the effectiveness of the GroundingGPT model in understanding and grounding tasks across various modalities.
More details are available in our project page.
News
- [2024.4] Our model is available now!
- [2024.3] Our training dataset are available now!
- [2024.3] Our code are available now!
Dependencies and Installation
git clone https://github.com/lzw-lzw/GroundingGPT.git
cd GroundingGPT
conda create -n groundinggpt python=3.10 -y
conda activate groundinggpt
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Training
Training model preparation
- Put the prepared checkpoints in directory
./ckpt
. - Prepare ImageBind checkpoint: download imagebind_huge.pth in link and put it under directory
./ckpt/imagebind
. - Prepare blip2 checkpoint: download blip2_pretrained_flant5xxl.pth in link and put it under directory
./ckpt
.
Training dataset preparation
- Please put the prepared checkpoints in file
dataset
. - Prepare LLaVA, COCO, GQA, OCR-VQA, TextVQA, VisualGenome datasets: follow LLaVA.
- Prepare Flickr30K-Entities datasets: follow Flickr30K-Entities.
- Prepare Valley datasets: follow Valley.
- Prepare DiDeMO datasets: follow DiDeMO.
- Prepare ActivityNet Captions datasets: follow ActivityNet Captions.
- Prepare Charades-STA datasets: follow Charades-STA.
- Prepare VGGSS datasets: follow VGGSS.
- Prepare WaveCaps datasets: follow WaveCaps.
- Prepare Clotho datasets: follow Clotho.
Training
Inference
Download GroundingGPT-7B and change the model_path in
GroundingGPT/lego/serve/cli.py
Use the script to inference
python3 lego/serve/cli.py
Demo
Download GroundingGPT-7B and change the model_path in line 141 of
GroundingGPT/lego/serve/gradio_web_server.py
Use the script to launch a gradio web demo
python3 lego/serve/gradio_web_server.py
Acknowledgement
Citation
If you find GroundingGPT useful for your your research and applications, please cite using this BibTeX:
@article{li2024lego,
title={LEGO: Language Enhanced Multi-modal Grounding Model},
author={Li, Zhaowei and Xu, Qi and Zhang, Dong and Song, Hang and Cai, Yiqing and Qi, Qi and Zhou, Ran and Pan, Junting and Li, Zefeng and Vu, Van Tu and others},
journal={arXiv preprint arXiv:2401.06071},
year={2024}
}