Update README.md
Browse files
README.md
CHANGED
|
@@ -1,168 +1,170 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
- zho
|
| 5 |
-
- eng
|
| 6 |
-
- fra
|
| 7 |
-
- spa
|
| 8 |
-
- por
|
| 9 |
-
- deu
|
| 10 |
-
- ita
|
| 11 |
-
- rus
|
| 12 |
-
- jpn
|
| 13 |
-
- kor
|
| 14 |
-
- vie
|
| 15 |
-
- tha
|
| 16 |
-
- ara
|
| 17 |
-
metrics:
|
| 18 |
-
- bleu
|
| 19 |
-
base_model:
|
| 20 |
-
- Qwen/Qwen2.5-7B-Instruct
|
| 21 |
-
---
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
## Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
| | | MLCD <br> Embodied-7B | LLaVA <br> OneVision-7B | GPT-4v | RoboMamba |
|
| 32 |
-
:-- | :-- | :-: | :-: | :-: | :-: |
|
| 33 |
-
| RoboVQA | BLEU1 | <span style="color:red">73.16</span> | 38.12 | - | 54.9 |
|
| 34 |
-
| | BLEU2 | <span style="color:red">66.39</span> | 33.56 | - | 44.2 |
|
| 35 |
-
| | BLEU3 | <span style="color:red">60.61</span> | 31.76 | - | 39.5 |
|
| 36 |
-
| | BLEU4 | <span style="color:red">56.56</span> | 30.97 | - | 36.3 |
|
| 37 |
-
| OpenEQA | Object State Recognition | <span style="color:red">71.83</span> | - | 63.2 | - |
|
| 38 |
-
| | Object Recognition | <span style="color:red">49.46</span> | - | 43.4 | - |
|
| 39 |
-
| | Functional Reasoning | 54.38 | - | <span style="color:red">57.4</span> | - |
|
| 40 |
-
| | Spatial Understanding | <span style="color:red">48.64</span> | - | 33.6 | - |
|
| 41 |
-
| | Attribute Recognition | <span style="color:red">67.08</span> | - | 57.2 | - |
|
| 42 |
-
| | World Knowledge | <span style="color:red">53.87</span> | - | 50.7 | - |
|
| 43 |
-
| | Object Localization | <span style="color:red">43.06</span> | - | 42.0 | - |
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
## General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4
|
| 49 |
-
|
| 50 |
-
| Dataset | Split | MLCD<br>Embodied-7B | LLaVA<br>OneVision-7B | GPT-4v | GPT-4o |
|
| 51 |
-
| :-- | :-: | :-: | :-: | :-: | :-: |
|
| 52 |
-
| A12D | test | 79.9 | 81.4 | 78.2 | 94.2 |
|
| 53 |
-
| ChartQA | test | 83.0 | 80.0 | 78.5 | 85.7 |
|
| 54 |
-
| DocVQA | test | 91.6 | 87.5 | 88.4 | 92.8 |
|
| 55 |
-
| InfoVQA | val | 73.9 | 70.7 | - | - |
|
| 56 |
-
| InfoVQA | test | 70.0 | 68.8 | - | - |
|
| 57 |
-
| MMMU | val | 47.3 | 48.8 | 56.8 | 69.1 |
|
| 58 |
-
| MMStar | test | 58.5 | 61.7 | 57.1 | 63.9 |
|
| 59 |
-
| OCRBench | - | 749.0 | 697.0 | 656.0 | 805.0 |
|
| 60 |
-
| RealWorldQA | test | 68.9 | 66.3 | 61.4 | 58.6 |
|
| 61 |
-
| SeedBench | image | 74.9 | 75.4 | 49.9 | 76.2 |
|
| 62 |
-
| MMbench | en-dev | 81.1 | 83.2 | 81.3 | 83.4 |
|
| 63 |
-
| MMbench | en-test | 80.1 | 80.8 | 75.0 | - |
|
| 64 |
-
| MME | test | 578/1603 | 418/1580 | 517/1409 | - |
|
| 65 |
-
|
| 66 |
-
## Usage
|
| 67 |
-
|
| 68 |
-
### A. Installation
|
| 69 |
-
|
| 70 |
-
```bash
|
| 71 |
-
git clone https://github.com/deepglint/unicom
|
| 72 |
-
cd unicom
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
pip install --
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
#
|
| 94 |
-
# >>
|
| 95 |
-
# >>
|
| 96 |
-
# >>
|
| 97 |
-
# >>
|
| 98 |
-
# >>
|
| 99 |
-
#
|
| 100 |
-
```
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
|
| 127 |
-
| |--
|
| 128 |
-
|
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
|
| 134 |
-
| |--
|
| 135 |
-
|
|
| 136 |
-
|
|
| 137 |
-
|--
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
bash
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
--
|
| 160 |
-
|
| 161 |
-
--
|
| 162 |
-
--
|
| 163 |
-
--
|
| 164 |
-
--
|
| 165 |
-
--
|
| 166 |
-
|
| 167 |
-
|
|
|
|
|
|
|
| 168 |
We would like to express our gratitude to [Huajie Tan](https://huggingface.co/tanhuajie2001), [Yumeng Wang](https://huggingface.co/devymex), [Yin Xie](https://huggingface.co/Yin-Xie) for his significant contributions to the experimental validation in MLLMs.
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- zho
|
| 5 |
+
- eng
|
| 6 |
+
- fra
|
| 7 |
+
- spa
|
| 8 |
+
- por
|
| 9 |
+
- deu
|
| 10 |
+
- ita
|
| 11 |
+
- rus
|
| 12 |
+
- jpn
|
| 13 |
+
- kor
|
| 14 |
+
- vie
|
| 15 |
+
- tha
|
| 16 |
+
- ara
|
| 17 |
+
metrics:
|
| 18 |
+
- bleu
|
| 19 |
+
base_model:
|
| 20 |
+
- Qwen/Qwen2.5-7B-Instruct
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
## Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
| | | MLCD <br> Embodied-7B | LLaVA <br> OneVision-7B | GPT-4v | RoboMamba |
|
| 32 |
+
:-- | :-- | :-: | :-: | :-: | :-: |
|
| 33 |
+
| RoboVQA | BLEU1 | <span style="color:red">73.16</span> | 38.12 | - | 54.9 |
|
| 34 |
+
| | BLEU2 | <span style="color:red">66.39</span> | 33.56 | - | 44.2 |
|
| 35 |
+
| | BLEU3 | <span style="color:red">60.61</span> | 31.76 | - | 39.5 |
|
| 36 |
+
| | BLEU4 | <span style="color:red">56.56</span> | 30.97 | - | 36.3 |
|
| 37 |
+
| OpenEQA | Object State Recognition | <span style="color:red">71.83</span> | - | 63.2 | - |
|
| 38 |
+
| | Object Recognition | <span style="color:red">49.46</span> | - | 43.4 | - |
|
| 39 |
+
| | Functional Reasoning | 54.38 | - | <span style="color:red">57.4</span> | - |
|
| 40 |
+
| | Spatial Understanding | <span style="color:red">48.64</span> | - | 33.6 | - |
|
| 41 |
+
| | Attribute Recognition | <span style="color:red">67.08</span> | - | 57.2 | - |
|
| 42 |
+
| | World Knowledge | <span style="color:red">53.87</span> | - | 50.7 | - |
|
| 43 |
+
| | Object Localization | <span style="color:red">43.06</span> | - | 42.0 | - |
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
## General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4
|
| 49 |
+
|
| 50 |
+
| Dataset | Split | MLCD<br>Embodied-7B | LLaVA<br>OneVision-7B | GPT-4v | GPT-4o |
|
| 51 |
+
| :-- | :-: | :-: | :-: | :-: | :-: |
|
| 52 |
+
| A12D | test | 79.9 | 81.4 | 78.2 | 94.2 |
|
| 53 |
+
| ChartQA | test | 83.0 | 80.0 | 78.5 | 85.7 |
|
| 54 |
+
| DocVQA | test | 91.6 | 87.5 | 88.4 | 92.8 |
|
| 55 |
+
| InfoVQA | val | 73.9 | 70.7 | - | - |
|
| 56 |
+
| InfoVQA | test | 70.0 | 68.8 | - | - |
|
| 57 |
+
| MMMU | val | 47.3 | 48.8 | 56.8 | 69.1 |
|
| 58 |
+
| MMStar | test | 58.5 | 61.7 | 57.1 | 63.9 |
|
| 59 |
+
| OCRBench | - | 749.0 | 697.0 | 656.0 | 805.0 |
|
| 60 |
+
| RealWorldQA | test | 68.9 | 66.3 | 61.4 | 58.6 |
|
| 61 |
+
| SeedBench | image | 74.9 | 75.4 | 49.9 | 76.2 |
|
| 62 |
+
| MMbench | en-dev | 81.1 | 83.2 | 81.3 | 83.4 |
|
| 63 |
+
| MMbench | en-test | 80.1 | 80.8 | 75.0 | - |
|
| 64 |
+
| MME | test | 578/1603 | 418/1580 | 517/1409 | - |
|
| 65 |
+
|
| 66 |
+
## Usage
|
| 67 |
+
|
| 68 |
+
### A. Installation
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
git clone https://github.com/deepglint/unicom
|
| 72 |
+
cd unicom/mlcd_vl
|
| 73 |
+
|
| 74 |
+
docker build -t train_mlcd_llava .
|
| 75 |
+
|
| 76 |
+
docker run --gpus all \
|
| 77 |
+
-v /vlm:/vlm \
|
| 78 |
+
-v /mnt:/mnt \
|
| 79 |
+
-v $(pwd):/workspace \
|
| 80 |
+
--rm \
|
| 81 |
+
-w /workspace \
|
| 82 |
+
--shm-size=64g -it train_mlcd_llava bash
|
| 83 |
+
|
| 84 |
+
pip install flash-attn==2.3.3 --no-build-isolation
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
### B. Inference
|
| 89 |
+
|
| 90 |
+
```bash
|
| 91 |
+
CUDA_VISIBLE_DEVICES=0 python infer_mlcd_emboided.py --model_dir DeepGlint-AI/MLCD-Embodied-7B
|
| 92 |
+
|
| 93 |
+
# example:
|
| 94 |
+
# >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
|
| 95 |
+
# >> Enter image file paths (comma-separated): ../_static/images/logo.png
|
| 96 |
+
# >> User: <image>What kind of animal is it in this picture?
|
| 97 |
+
# >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
|
| 98 |
+
# >> User: What color is this cat?
|
| 99 |
+
# >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
### C. Evaluation for Embodied Ability
|
| 105 |
+
|
| 106 |
+
#### Step 1
|
| 107 |
+
|
| 108 |
+
Download raw data following [OpenEQA](https://github.com/facebookresearch/open-eqa/tree/main/data) and [RoboVQA](https://console.cloud.google.com/storage/browser/gdm-robovqa)(val part)
|
| 109 |
+
|
| 110 |
+
#### Step 2
|
| 111 |
+
|
| 112 |
+
Converting raw data into the format required for model evaluation.
|
| 113 |
+
```bash
|
| 114 |
+
# convert OpenEQA benchmark. Note: replace the paths with your own.
|
| 115 |
+
python llava/benchmark/make_openeqa_bmk.py
|
| 116 |
+
|
| 117 |
+
# convert RoboVQA benchmark. Note: replace the paths with your own.
|
| 118 |
+
python llava/benchmark/make_robovqa_bmk.py
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
#### Step 3
|
| 122 |
+
|
| 123 |
+
Make sure that your top-level directory structure should look like this:
|
| 124 |
+
```
|
| 125 |
+
|--/path/to/your/benchmarks
|
| 126 |
+
| |--OpenEQA
|
| 127 |
+
| | |--openeqa_scannet.parquet
|
| 128 |
+
| | |--openeqa_hm3d.parquet
|
| 129 |
+
| |--RoboVQA
|
| 130 |
+
| |--robovqa.parquet
|
| 131 |
+
|--/path/to/your/images
|
| 132 |
+
|--openeqa_val
|
| 133 |
+
| |--scannet-v0
|
| 134 |
+
| | |--002-scannet-scene0709_00
|
| 135 |
+
| | |--xxx-scannet-scenexxxx_xx
|
| 136 |
+
| |--hm3d-v0
|
| 137 |
+
| |--000-hm3d-BFRyYbPCCPE
|
| 138 |
+
| |--xxx-hm3d-xxxxxxxxxxx
|
| 139 |
+
|--robovqa_val
|
| 140 |
+
|--robovqa_221911
|
| 141 |
+
|--robovqa_xxxxxx
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
+
#### Step 4
|
| 145 |
+
|
| 146 |
+
Run script for evaluation
|
| 147 |
+
```bash
|
| 148 |
+
# Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
|
| 149 |
+
bash scripts/eval/eval_robo.sh /path/to/your/model
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
### D. Evaluation for General Ability
|
| 153 |
+
|
| 154 |
+
Install the evaluation tool and execute the evaluation script:
|
| 155 |
+
```bash
|
| 156 |
+
pip install lmms-eval==0.2.0
|
| 157 |
+
PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
|
| 158 |
+
--main_process_port=12444 \
|
| 159 |
+
--num_processes=8 \
|
| 160 |
+
-m lmms_eval \
|
| 161 |
+
--model llava \
|
| 162 |
+
--model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
|
| 163 |
+
--tasks mme \
|
| 164 |
+
--batch_size 1 \
|
| 165 |
+
--log_samples \
|
| 166 |
+
--log_samples_suffix mlcd \
|
| 167 |
+
--output_path ./eval_log/
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
We would like to express our gratitude to [Huajie Tan](https://huggingface.co/tanhuajie2001), [Yumeng Wang](https://huggingface.co/devymex), [Yin Xie](https://huggingface.co/Yin-Xie) for his significant contributions to the experimental validation in MLLMs.
|