Tzktz's picture
Upload 7664 files
6fc683c verified

A newer version of the Gradio SDK is available: 5.39.0

Upgrade

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Paper | Project Page

Checkpoints

Download checkpoints for stage1, stage2, and the final model.

mkdir kosmosg_checkpoints
cd kosmosg_checkpoints
wget -O ViT-L-14-sd.pt "https://conversationhub.blob.core.windows.net/beit-share-public/kosmosg/ViT-L-14-sd.pt?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D"
wget -O checkpoint_stage1.pt "https://conversationhub.blob.core.windows.net/beit-share-public/kosmosg/checkpoint_stage1.pt?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D"
wget -O checkpoint_stage2.pt "https://conversationhub.blob.core.windows.net/beit-share-public/kosmosg/checkpoint_stage2.pt?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D"
wget -O checkpoint_final.pt "https://conversationhub.blob.core.windows.net/beit-share-public/kosmosg/checkpoint_final.pt?sv=2021-10-04&st=2023-06-08T11%3A16%3A02Z&se=2033-06-09T11%3A16%3A00Z&sr=c&sp=r&sig=N4pfCVmSeq4L4tS8QbrFVsX6f6q844eft8xSuXdxU48%3D"

Setup

Using Docker Image [Recommended]

You can use our built Docker Image

docker run --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name kosmosg --privileged=true -it -v /mnt:/mnt/ xichenpan/kosmosg:v1 /bin/bash
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/

You can also start with NVIDIA Official Docker Image, and install all dependencies manually.

docker run --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name kosmosg --privileged=true -it -v /mnt:/mnt/ nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash
apt-get install -y libsm6 libxext6 libxrender-dev
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
bash vl_setup.sh

Using Base Environment

Make sure you have Pytorch 1.13.0 and nvcc 11.x installed.

git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
bash vl_setup.sh

Demo

If you would like to host a local Gradio demo, run the following command after setup:

bash runapp.sh

Be sure to adjust the guidance scale if you find the default one leads to over-saturated images.

Training

Preparing dataset

Refer to this guide to prepare the dataset.

Train script

After preparing the data, run the following command to train the model. Be sure to change the directories in the script to your own. For the image decoder aligning stage:

bash runalign.sh

For the instruction tuning stage:

bash runtrain.sh

Evaluation

FID score on COCO (2014) val set

Download and unzip the COCO (2014) val set:

mkdir coco
cd coco
wget http://images.cocodataset.org/zips/val2014.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
unzip val2014.zip

Specify the cfg in sample_kosmosg_coco.py and run the script to evaluate:

bash runeval_coco.sh

DINO score, CLIP-I score and CLIP-T score on DreamBench

Download DreamBench:

mkdir dreambench
cd dreambench
git clone https://github.com/google/dreambooth.git

We keep only one image for each entity as described in our paper.

bash scripts/remove_dreambench_multiimg.sh /path/to/dreambench/dreambooth/dataset

Specify the cfg in sample_kosmosg_dreambench.py and run the script to evaluate:

bash runeval_dreambench.sh

Citation

If you find this repository useful, please consider citing our work:

@article{kosmos-g,
  title={{Kosmos-G}: Generating Images in Context with Multimodal Large Language Models},
  author={Xichen Pan and Li Dong and Shaohan Huang and Zhiliang Peng and Wenhu Chen and Furu Wei},
  journal={ArXiv},
  year={2023},
  volume={abs/2310.02992}
}

Acknowledgement

This repository is built using torchscale, fairseq, openclip. We thank the authors of Nerfies that kindly open sourced the template of the project page.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using models, please submit a GitHub issue.