DeepGlint-AI
/

MLCD-Embodied-7B

Safetensors

qwen2

Model card Files Files and versions

xet

Community

xiangan commited on Apr 28

Commit

fc4dd9f

verified ·

1 Parent(s): 0663758

Update README.md

Browse files

Files changed (1) hide show

README.md +169 -167

README.md CHANGED Viewed

@@ -1,168 +1,170 @@
----
-license: apache-2.0
-language:
-- zho
-- eng
-- fra
-- spa
-- por
-- deu
-- ita
-- rus
-- jpn
-- kor
-- vie
-- tha
-- ara
-metrics:
-- bleu
-base_model:
-- Qwen/Qwen2.5-7B-Instruct
----
-[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
-## Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA
-|                |                   | MLCD <br> Embodied-7B | LLaVA <br> OneVision-7B | GPT-4v | RoboMamba |
- :-- | :-- | :-: | :-: | :-: | :-: |
-| RoboVQA        | BLEU1             | <span style="color:red">73.16</span>       | 38.12                   |            -              | 54.9      |
-|                | BLEU2             | <span style="color:red">66.39</span>       | 33.56                   |            -              | 44.2      |
-|                | BLEU3             | <span style="color:red">60.61</span>       | 31.76                   |            -              | 39.5      |
-|                | BLEU4             | <span style="color:red">56.56</span>       | 30.97                   |            -              | 36.3      |
-| OpenEQA        | Object State Recognition | <span style="color:red">71.83</span>   |          -               | 63.2   |            -              |
-|                | Object Recognition        | <span style="color:red">49.46</span>  |          -               | 43.4   |            -              |
-|                | Functional Reasoning      | 54.38                                 |          -               | <span style="color:red">57.4</span> |            -              |
-|                | Spatial Understanding     | <span style="color:red">48.64</span>  |          -               | 33.6   |            -              |
-|                | Attribute Recognition     | <span style="color:red">67.08</span>  |          -               | 57.2   |            -              |
-|                | World Knowledge           | <span style="color:red">53.87</span>  |          -               | 50.7   |            -              |
-|                | Object Localization       | <span style="color:red">43.06</span>  |          -               | 42.0   |            -              |
-## General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4
-| Dataset     | Split   | MLCD<br>Embodied-7B | LLaVA<br>OneVision-7B | GPT-4v   | GPT-4o |
-| :-- | :-: | :-: | :-: | :-: | :-: |
-| A12D        | test    | 79.9             | 81.4               | 78.2     | 94.2   |
-| ChartQA     | test    | 83.0             | 80.0               | 78.5     | 85.7   |
-| DocVQA      | test    | 91.6             | 87.5               | 88.4     | 92.8   |
-| InfoVQA     | val     | 73.9             | 70.7               | -        | -      |
-| InfoVQA     | test    | 70.0             | 68.8               | -        | -      |
-| MMMU        | val     | 47.3             | 48.8               | 56.8     | 69.1   |
-| MMStar      | test    | 58.5             | 61.7               | 57.1     | 63.9   |
-| OCRBench    | -       | 749.0            | 697.0              | 656.0    | 805.0  |
-| RealWorldQA | test    | 68.9             | 66.3               | 61.4     | 58.6   |
-| SeedBench   | image   | 74.9             | 75.4               | 49.9     | 76.2   |
-| MMbench     | en-dev  | 81.1             | 83.2               | 81.3     | 83.4   |
-| MMbench     | en-test | 80.1             | 80.8               | 75.0     | -      |
-| MME         | test    | 578/1603         | 418/1580           | 517/1409 | -      |
-## Usage
-### A. Installation
-```bash
-git clone https://github.com/deepglint/unicom
-cd unicom
-# Upgrade pip and install necessary dependencies
-pip install --upgrade pip
-pip install -e ".[train]"
-```
-### B. Inference
-```bash
-git clone https://github.com/deepglint/unicom
-cd unicom
-pip install --upgrade pip
-pip install -e ".[train]"
-pip install flash-attn --no-build-isolation
-CUDA_VISIBLE_DEVICES=0 python infer.py --model_dir DeepGlint-AI/MLCD-Embodied-7B
-# example:
-# >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
-# >> Enter image file paths (comma-separated): ./asserts/logo.png
-# >> User: <image>What kind of animal is it in this picture?
-# >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
-# >> User: What color is this cat?
-# >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.
-# >> User: <image>请你介绍一下这个图片
-# >> Assistant: 这是一幅充满创意的猫头艺术作品。它采用了多色渐变和抽象风格，将猫的头部描绘成一个充满活力和色彩的视觉冲击。猫的眼睛用金色渲染，显得非常有神采，
-# 而粉色的鼻子则增添了一丝可爱感。整体设计融合了现代艺术与传统猫头图案，创造出一种既独特又引��入胜的视觉效果。。
-```
-### C. Evaluation for Embodied Ability
-#### Step 1
-Download raw data following [OpenEQA](https://github.com/facebookresearch/open-eqa/tree/main/data) and [RoboVQA](https://console.cloud.google.com/storage/browser/gdm-robovqa)(val part)
-#### Step 2
-Converting raw data into the format required for model evaluation.
-```bash
-# convert OpenEQA benchmark. Note: replace the paths with your own.
-python llava/benchmark/make_openeqa_bmk.py
-# convert RoboVQA benchmark. Note: replace the paths with your own.
-python llava/benchmark/make_robovqa_bmk.py
-```
-#### Step 3
-Make sure that your top-level directory structure should look like this:
-```
-|--/path/to/your/benchmarks
-|  |--OpenEQA
-|  |  |--openeqa_scannet.parquet
-|  |  |--openeqa_hm3d.parquet
-|  |--RoboVQA
-|     |--robovqa.parquet
-|--/path/to/your/images
-   |--openeqa_val
-   |  |--scannet-v0
-   |  |  |--002-scannet-scene0709_00
-   |  |  |--xxx-scannet-scenexxxx_xx
-   |  |--hm3d-v0
-   |     |--000-hm3d-BFRyYbPCCPE
-   |     |--xxx-hm3d-xxxxxxxxxxx
-   |--robovqa_val
-      |--robovqa_221911
-      |--robovqa_xxxxxx
-```
-#### Step 4
-Run script for evaluation
-```bash
-# Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
-bash scripts/eval/eval_robo.sh /path/to/your/model
-```
-### D. Evaluation for General Ability
-Install the evaluation tool and execute the evaluation script:
-```bash
-pip install lmms-eval==0.2.0
-PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
-    --main_process_port=12444 \
-    --num_processes=8 \
-    -m lmms_eval \
-    --model llava \
-    --model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
-    --tasks mme \
-    --batch_size 1 \
-    --log_samples \
-    --log_samples_suffix mlcd \
-    --output_path ./eval_log/
-```
 We would like to express our gratitude to [Huajie Tan](https://huggingface.co/tanhuajie2001), [Yumeng Wang](https://huggingface.co/devymex), [Yin Xie](https://huggingface.co/Yin-Xie) for his significant contributions to the experimental validation in MLLMs.

+---
+license: apache-2.0
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+metrics:
+- bleu
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
+---
+[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
+## Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA
+|                |                   | MLCD <br> Embodied-7B | LLaVA <br> OneVision-7B | GPT-4v | RoboMamba |
+ :-- | :-- | :-: | :-: | :-: | :-: |
+| RoboVQA        | BLEU1             | <span style="color:red">73.16</span>       | 38.12                   |            -              | 54.9      |
+|                | BLEU2             | <span style="color:red">66.39</span>       | 33.56                   |            -              | 44.2      |
+|                | BLEU3             | <span style="color:red">60.61</span>       | 31.76                   |            -              | 39.5      |
+|                | BLEU4             | <span style="color:red">56.56</span>       | 30.97                   |            -              | 36.3      |
+| OpenEQA        | Object State Recognition | <span style="color:red">71.83</span>   |          -               | 63.2   |            -              |
+|                | Object Recognition        | <span style="color:red">49.46</span>  |          -               | 43.4   |            -              |
+|                | Functional Reasoning      | 54.38                                 |          -               | <span style="color:red">57.4</span> |            -              |
+|                | Spatial Understanding     | <span style="color:red">48.64</span>  |          -               | 33.6   |            -              |
+|                | Attribute Recognition     | <span style="color:red">67.08</span>  |          -               | 57.2   |            -              |
+|                | World Knowledge           | <span style="color:red">53.87</span>  |          -               | 50.7   |            -              |
+|                | Object Localization       | <span style="color:red">43.06</span>  |          -               | 42.0   |            -              |
+## General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4
+| Dataset     | Split   | MLCD<br>Embodied-7B | LLaVA<br>OneVision-7B | GPT-4v   | GPT-4o |
+| :-- | :-: | :-: | :-: | :-: | :-: |
+| A12D        | test    | 79.9             | 81.4               | 78.2     | 94.2   |
+| ChartQA     | test    | 83.0             | 80.0               | 78.5     | 85.7   |
+| DocVQA      | test    | 91.6             | 87.5               | 88.4     | 92.8   |
+| InfoVQA     | val     | 73.9             | 70.7               | -        | -      |
+| InfoVQA     | test    | 70.0             | 68.8               | -        | -      |
+| MMMU        | val     | 47.3             | 48.8               | 56.8     | 69.1   |
+| MMStar      | test    | 58.5             | 61.7               | 57.1     | 63.9   |
+| OCRBench    | -       | 749.0            | 697.0              | 656.0    | 805.0  |
+| RealWorldQA | test    | 68.9             | 66.3               | 61.4     | 58.6   |
+| SeedBench   | image   | 74.9             | 75.4               | 49.9     | 76.2   |
+| MMbench     | en-dev  | 81.1             | 83.2               | 81.3     | 83.4   |
+| MMbench     | en-test | 80.1             | 80.8               | 75.0     | -      |
+| MME         | test    | 578/1603         | 418/1580           | 517/1409 | -      |
+## Usage
+### A. Installation
+```bash
+git clone https://github.com/deepglint/unicom
+cd unicom/mlcd_vl
+docker build -t train_mlcd_llava .
+docker run --gpus all \
+-v /vlm:/vlm \
+-v /mnt:/mnt \
+-v $(pwd):/workspace \
+--rm \
+-w /workspace \
+--shm-size=64g -it train_mlcd_llava bash
+pip install flash-attn==2.3.3 --no-build-isolation
+```
+### B. Inference
+```bash
+CUDA_VISIBLE_DEVICES=0 python infer_mlcd_emboided.py --model_dir DeepGlint-AI/MLCD-Embodied-7B
+# example:
+# >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
+# >> Enter image file paths (comma-separated): ../_static/images/logo.png
+# >> User: <image>What kind of animal is it in this picture?
+# >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
+# >> User: What color is this cat?
+# >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.
+```
+### C. Evaluation for Embodied Ability
+#### Step 1
+Download raw data following [OpenEQA](https://github.com/facebookresearch/open-eqa/tree/main/data) and [RoboVQA](https://console.cloud.google.com/storage/browser/gdm-robovqa)(val part)
+#### Step 2
+Converting raw data into the format required for model evaluation.
+```bash
+# convert OpenEQA benchmark. Note: replace the paths with your own.
+python llava/benchmark/make_openeqa_bmk.py
+# convert RoboVQA benchmark. Note: replace the paths with your own.
+python llava/benchmark/make_robovqa_bmk.py
+```
+#### Step 3
+Make sure that your top-level directory structure should look like this:
+```
+|--/path/to/your/benchmarks
+|  |--OpenEQA
+|  |  |--openeqa_scannet.parquet
+|  |  |--openeqa_hm3d.parquet
+|  |--RoboVQA
+|     |--robovqa.parquet
+|--/path/to/your/images
+   |--openeqa_val
+   |  |--scannet-v0
+   |  |  |--002-scannet-scene0709_00
+   |  |  |--xxx-scannet-scenexxxx_xx
+   |  |--hm3d-v0
+   |     |--000-hm3d-BFRyYbPCCPE
+   |     |--xxx-hm3d-xxxxxxxxxxx
+   |--robovqa_val
+      |--robovqa_221911
+      |--robovqa_xxxxxx
+```
+#### Step 4
+Run script for evaluation
+```bash
+# Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
+bash scripts/eval/eval_robo.sh /path/to/your/model
+```
+### D. Evaluation for General Ability
+Install the evaluation tool and execute the evaluation script:
+```bash
+pip install lmms-eval==0.2.0
+PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
+    --main_process_port=12444 \
+    --num_processes=8 \
+    -m lmms_eval \
+    --model llava \
+    --model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
+    --tasks mme \
+    --batch_size 1 \
+    --log_samples \
+    --log_samples_suffix mlcd \
+    --output_path ./eval_log/
+```
 We would like to express our gratitude to [Huajie Tan](https://huggingface.co/tanhuajie2001), [Yumeng Wang](https://huggingface.co/devymex), [Yin Xie](https://huggingface.co/Yin-Xie) for his significant contributions to the experimental validation in MLLMs.