File size: 8,894 Bytes
a7d1657 de4aa2d a7d1657 83b56c3 8a34aad 83b56c3 8a34aad de4aa2d 084072d 197c275 de4aa2d b2918de de4aa2d c58f341 de4aa2d 8a34aad 83b56c3 8a34aad 83b56c3 8a34aad 83b56c3 8a34aad 83b56c3 8a34aad b2918de 8a34aad a7d1657 b2918de a7d1657 b2918de a7d1657 0fa03a1 a7d1657 b2918de a7d1657 8a34aad 83b56c3 8a34aad 83b56c3 96d53d2 b2918de 96d53d2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
---
language:
- ja
pipeline_tag: image-text-to-text
---
# LLM-jp-3 VILA 14B
This repository provides a large vision language model (VLM) developed by the [Research and Development Center for Large Language Models](https://llmc.nii.ac.jp/) at the [National Institute of Informatics](https://www.nii.ac.jp/en/), Japan.
## Usage
Python version: 3.10.12
1. Clone the repository and install the libraries.
<details>
```bash
git clone [email protected]:llm-jp/llm-jp-VILA.git
cd llm-jp-VILA
```
```bash
python3 -m venv venv
source venv/bin/activate
```
```bash
pip install --upgrade pip
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.4.2/flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e .
pip install -e ".[train]"
```
```bash
pip install git+https://github.com/huggingface/[email protected]
cp -rv ./llava/train/transformers_replace/* ./venv/lib/python3.10/site-packages/transformers/
```
</details>
2. Run the python script. You can change the `image_path` and `query` to your own.
<details>
```python
import argparse
from io import BytesIO
import requests
import torch
from PIL import Image
from llava.constants import IMAGE_TOKEN_INDEX
from llava.conversation import conv_templates
from llava.mm_utils import (get_model_name_from_path,
process_images, tokenizer_image_token)
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
def load_image(image_file):
if image_file.startswith("http") or image_file.startswith("https"):
response = requests.get(image_file)
image = Image.open(BytesIO(response.content)).convert("RGB")
else:
image = Image.open(image_file).convert("RGB")
return image
def load_images(image_files):
out = []
for image_file in image_files:
image = load_image(image_file)
out.append(image)
return out
disable_torch_init()
model_checkpoint_path = "llm-jp/llm-jp-3-vila-14b"
model_name = get_model_name_from_path(model_checkpoint_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_checkpoint_path, model_name)
image_path = "path/to/image"
image_files = [
image_path
]
images = load_images(image_files)
query = "<image>\nこの画像について説明してください。"
conv_mode = "llmjp_v3"
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], query)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
images_tensor = process_images(images, image_processor, model.config).to(model.device, dtype=torch.float16)
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=[
images_tensor,
],
do_sample=False,
num_beams=1,
max_new_tokens=256,
use_cache=True,
)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(outputs)
```
</details>
## Model Details
|Model components|Model / Architecture|Parameters|
|:---:|:---:|:---:|
|Vision encoder|[siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)|428M|
|Projector|2-layer MLP|32M|
|LLM|[llm-jp-3-13b-instruct](https://huggingface.co/llm-jp/llm-jp-3-13b-instruct)|13B|
## Datasets
The model was trained in three stages.
### Step-0
We used the following data sets to tune the parameters in the projector.
| Language | Dataset | Images|
|:---|:---|---:|
|Japanese|[Japanese image text pairs](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs)|558K
|English|[LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)|558K
### Step-1
We used the following data sets to tune the parameters in the projector and LLM.
| Language | Dataset | Images |
|:---|:---|:---|
|Japanese|[Japanese image text pairs](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs)| 6M |
| |[Japanese interleaved data](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-interleaved-data)| 6M |
|English |[coyo](https://github.com/kakaobrain/coyo-dataset) (subset) | 6M |
| |[mmc4-core](https://github.com/allenai/mmc4) (subset) | 6M |
### Step-2
We used the following data sets to tune the parameters in the projector and LLM.
| Language | Dataset | Images |
|:---|:---|:---|
|Japanese|[llava-instruct-ja](https://huggingface.co/datasets/llm-jp/llava-instruct-ja)| 156K |
| |[japanese-photos-conv](https://huggingface.co/datasets/llm-jp/japanese-photos-conversation)| 12K |
| |[ja-vg-vqa](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)| 99K |
| |[synthdog-ja](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja) (subset)| 102K |
|English |[LLaVA](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | 158K |
| |[VQAv2](https://visualqa.org/) | 53K |
| |[GQA](https://cs.stanford.edu/people/dorarad/gqa/index.html) | 46K |
| |[OCRVQA](https://ocr-vqa.github.io/) | 80K |
| |[TextVQA](https://textvqa.org/dataset/) | 22K |
## Evaluations
We evaluated our model using [Heron Bench](https://huggingface.co/datasets/turing-motors/Japanese-Heron-Bench), [JA-VLM-Bench-In-the-Wild](https://huggingface.co/datasets/SakanaAI/JA-VLM-Bench-In-the-Wild), and [JA-VG-VQA500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500).
We used `gpt-4o-2024-05-13` for LLM-as-a-judge.
### Heron Bench
| Models | LLM-as-a-judge score (%) |
|---|:---:|
| [Japanese InstructBLIP Alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha) | 14.0 |
| [Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm) | 24.2 |
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 39.3 |
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 43.3 |
| **llm-jp-3-vila-14b (Ours)** | 57.2 |
| GPT-4o | 87.6 |
### JA-VLM-Bench-In-the-Wild
| **Models** | ROUGE-L | LLM-as-a-judge score (/5.0) |
|---|:---:|:---:|
| [Japanese InstructBLIP Alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha) | 20.8 | 2.42 |
| [Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm) | 23.3 | 2.47 |
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 41.4 | 2.92 |
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 47.2 | 3.15 |
| **llm-jp-3-vila-14b (Ours)** | 52.3 | 3.69 |
| GPT-4o | 37.6 | 3.85 |
### JA-VG-VQA-500
| **Models** | ROUGE-L | LLM-as-a-judge score (/5.0) |
|---|:---:|:---:|
| [Japanese InstructBLIP Alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha) | -- | -- |
| [Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm) | -- | -- |
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 23.5 | 2.96 |
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 17.4 | 3.21 |
| **llm-jp-3-vila-14b (Ours)** | 16.2 | 3.62 |
| GPT-4o | 12.1 | 3.58 |
## Risks and Limitations
The model released in this repository is in the early stages of our research and development. It has not been tuned such that model's outputs are aligned with social norms, ethical standards, and the law.
## License
The weights of this model are released under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
In addition, a user of this model must comply with [the OpenAI terms of use](https://openai.com/policies/terms-of-use) because the model used synthetic data generated by OpenAI GPT-4.
## Additional information
Regarding the license of the [synthdog-ja](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja) dataset, there is no explicit license statement in the dataset documentation. While we attempted to contact the main corresponding author of "OCR-free Document Understanding Transformer" for clarification, we received no response.
Based on the following considerations:
1. The [donut-base](https://huggingface.co/naver-clova-ix/donut-base) model trained on this dataset is released under the MIT license
2. The Wikipedia articles used in the dataset are licensed under CC-BY-SA
We have determined that the synthdog-ja dataset is most likely governed by the CC-BY-SA license, and proceeded with training under this assumption. |