File size: 8,894 Bytes
a7d1657
de4aa2d
 
 
a7d1657
83b56c3
8a34aad
83b56c3
8a34aad
de4aa2d
 
 
 
 
 
 
 
 
084072d
197c275
de4aa2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b2918de
de4aa2d
 
 
 
 
 
 
 
 
 
 
c58f341
de4aa2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a34aad
 
 
 
 
 
 
 
 
 
83b56c3
8a34aad
 
 
83b56c3
8a34aad
 
 
 
 
 
 
 
83b56c3
8a34aad
 
 
 
 
 
 
 
 
 
83b56c3
8a34aad
 
 
 
 
 
b2918de
8a34aad
 
 
 
 
 
a7d1657
 
 
 
 
 
 
 
 
 
 
 
b2918de
a7d1657
 
 
 
 
 
 
 
 
 
b2918de
a7d1657
 
 
 
0fa03a1
a7d1657
 
 
 
 
b2918de
a7d1657
 
8a34aad
 
83b56c3
8a34aad
 
 
83b56c3
96d53d2
 
 
 
b2918de
96d53d2
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
language:
- ja
pipeline_tag: image-text-to-text
---
# LLM-jp-3 VILA 14B

This repository provides a large vision language model (VLM) developed by the [Research and Development Center for Large Language Models](https://llmc.nii.ac.jp/) at the [National Institute of Informatics](https://www.nii.ac.jp/en/), Japan.

## Usage

Python version: 3.10.12

1. Clone the repository and install the libraries.

    <details>

    ```bash
    git clone [email protected]:llm-jp/llm-jp-VILA.git
    cd llm-jp-VILA
    ```

    ```bash
    python3 -m venv venv
    source venv/bin/activate
    ```

    ```bash
    pip install --upgrade pip
    wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.4.2/flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
    pip install flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
    pip install -e .
    pip install -e ".[train]"
    ```

    ```bash
    pip install git+https://github.com/huggingface/[email protected]
    cp -rv ./llava/train/transformers_replace/* ./venv/lib/python3.10/site-packages/transformers/
    ```

    </details>

2. Run the python script. You can change the `image_path` and `query` to your own.

    <details>

    ```python
    import argparse
    from io import BytesIO

    import requests
    import torch
    from PIL import Image

    from llava.constants import IMAGE_TOKEN_INDEX
    from llava.conversation import conv_templates
    from llava.mm_utils import (get_model_name_from_path,
                                process_images, tokenizer_image_token)
    from llava.model.builder import load_pretrained_model
    from llava.utils import disable_torch_init


    def load_image(image_file):
        if image_file.startswith("http") or image_file.startswith("https"):
            response = requests.get(image_file)
            image = Image.open(BytesIO(response.content)).convert("RGB")
        else:
            image = Image.open(image_file).convert("RGB")
        return image


    def load_images(image_files):
        out = []
        for image_file in image_files:
            image = load_image(image_file)
            out.append(image)
        return out


    disable_torch_init()

    model_checkpoint_path = "llm-jp/llm-jp-3-vila-14b"
    model_name = get_model_name_from_path(model_checkpoint_path)
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_checkpoint_path, model_name)

    image_path = "path/to/image"
    image_files = [
        image_path
    ]
    images = load_images(image_files)

    query = "<image>\nこの画像について説明してください。"

    conv_mode = "llmjp_v3"
    conv = conv_templates[conv_mode].copy()
    conv.append_message(conv.roles[0], query)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()

    images_tensor = process_images(images, image_processor, model.config).to(model.device, dtype=torch.float16)
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=[
                images_tensor,
            ],
            do_sample=False,
            num_beams=1,
            max_new_tokens=256,
            use_cache=True,
        )

    outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
    print(outputs)
    ```

    </details>

## Model Details

|Model components|Model / Architecture|Parameters|
|:---:|:---:|:---:|
|Vision encoder|[siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)|428M|
|Projector|2-layer MLP|32M|
|LLM|[llm-jp-3-13b-instruct](https://huggingface.co/llm-jp/llm-jp-3-13b-instruct)|13B|

## Datasets

The model was trained in three stages.

### Step-0

We used the following data sets to tune the parameters in the projector.

| Language | Dataset | Images|
|:---|:---|---:|
|Japanese|[Japanese image text pairs](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs)|558K
|English|[LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)|558K

### Step-1

We used the following data sets to tune the parameters in the projector and LLM.
 
| Language | Dataset | Images |
|:---|:---|:---|
|Japanese|[Japanese image text pairs](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs)| 6M |
|        |[Japanese interleaved data](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-interleaved-data)| 6M |
|English |[coyo](https://github.com/kakaobrain/coyo-dataset) (subset) | 6M | 
| |[mmc4-core](https://github.com/allenai/mmc4) (subset) | 6M | 

### Step-2

We used the following data sets to tune the parameters in the projector and LLM.

| Language | Dataset | Images |
|:---|:---|:---|
|Japanese|[llava-instruct-ja](https://huggingface.co/datasets/llm-jp/llava-instruct-ja)| 156K |
|        |[japanese-photos-conv](https://huggingface.co/datasets/llm-jp/japanese-photos-conversation)| 12K |
|        |[ja-vg-vqa](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)| 99K |
|        |[synthdog-ja](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja) (subset)| 102K |
|English |[LLaVA](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | 158K | 
|        |[VQAv2](https://visualqa.org/) | 53K | 
|        |[GQA](https://cs.stanford.edu/people/dorarad/gqa/index.html) | 46K | 
|        |[OCRVQA](https://ocr-vqa.github.io/) | 80K | 
|        |[TextVQA](https://textvqa.org/dataset/) | 22K | 

## Evaluations
We evaluated our model using [Heron Bench](https://huggingface.co/datasets/turing-motors/Japanese-Heron-Bench), [JA-VLM-Bench-In-the-Wild](https://huggingface.co/datasets/SakanaAI/JA-VLM-Bench-In-the-Wild), and [JA-VG-VQA500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500). 
We used `gpt-4o-2024-05-13` for LLM-as-a-judge.

### Heron Bench

| Models | LLM-as-a-judge score (%) | 
|---|:---:|
| [Japanese InstructBLIP Alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha) | 14.0 | 
| [Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm) | 24.2 | 
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 39.3 | 
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 43.3 | 
| **llm-jp-3-vila-14b (Ours)** | 57.2 | 
| GPT-4o | 87.6 | 

### JA-VLM-Bench-In-the-Wild

| **Models** | ROUGE-L | LLM-as-a-judge score (/5.0) | 
|---|:---:|:---:|
| [Japanese InstructBLIP Alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha) | 20.8 | 2.42 | 
| [Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm) | 23.3 | 2.47 | 
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 41.4 | 2.92 | 
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 47.2 | 3.15 | 
| **llm-jp-3-vila-14b (Ours)** | 52.3 | 3.69 | 
| GPT-4o | 37.6 | 3.85 | 

### JA-VG-VQA-500

| **Models** | ROUGE-L | LLM-as-a-judge score (/5.0) |
|---|:---:|:---:|
| [Japanese InstructBLIP Alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha) | -- | -- |
| [Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm) | -- | -- |
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 23.5 | 2.96 |
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 17.4 | 3.21 |
| **llm-jp-3-vila-14b (Ours)** | 16.2 | 3.62 |
| GPT-4o | 12.1 | 3.58 |

## Risks and Limitations

The model released in this repository is in the early stages of our research and development. It has not been tuned such that model's outputs are aligned with social norms, ethical standards, and the law.

## License

The weights of this model are released under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
In addition, a user of this model must comply with [the OpenAI terms of use](https://openai.com/policies/terms-of-use) because the model used synthetic data generated by OpenAI GPT-4.

## Additional information

Regarding the license of the [synthdog-ja](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja) dataset, there is no explicit license statement in the dataset documentation. While we attempted to contact the main corresponding author of "OCR-free Document Understanding Transformer" for clarification, we received no response.

Based on the following considerations:

1. The [donut-base](https://huggingface.co/naver-clova-ix/donut-base) model trained on this dataset is released under the MIT license
2. The Wikipedia articles used in the dataset are licensed under CC-BY-SA

We have determined that the synthdog-ja dataset is most likely governed by the CC-BY-SA license, and proceeded with training under this assumption.