|
--- |
|
language: |
|
- ja |
|
tags: |
|
- vision-language |
|
- image-captioning |
|
- japanese-stable-vlm |
|
pipeline_tag: image-to-text |
|
license: other |
|
extra_gated_prompt: >- |
|
By clicking "Agree", you agree to the [License |
|
Agreement](https://huggingface.co/stabilityai/japanese-stable-vlm/blob/main/LICENSE.md) |
|
and acknowledge Stability AI's [Privacy |
|
Policy](https://stability.ai/privacy-policy). |
|
extra_gated_fields: |
|
Name: text |
|
Email: text |
|
Country: country |
|
Organization or Affiliation: text |
|
Receive email updates and promotions on Stability AI products, services, and research?: |
|
type: select |
|
options: |
|
- 'Yes' |
|
- 'No' |
|
--- |
|
|
|
# Japanese Stable VLM |
|
|
|
Please note: for commercial usage of this model, please see https://stability.ai/license |
|
|
|
商用利用に関する日本語での問い合わせは [email protected] までお願い致します。 |
|
|
|
|
|
## Model Details |
|
|
|
Japanese Stable VLM is a vision-language instruction-following model that enables to generate Japanese descriptions for input images and optionally input texts such as questions. |
|
|
|
|
|
## Usage |
|
|
|
<details> |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForVision2Seq, AutoImageProcessor |
|
from PIL import Image |
|
import requests |
|
|
|
# helper function to format input prompts |
|
TASK2INSTRUCTION = { |
|
"caption": "画像を詳細に述べてください。", |
|
"tag": "与えられた単語を使って、画像を詳細に述べてください。", |
|
"vqa": "与えられた画像を下に、質問に答えてください。", |
|
} |
|
|
|
|
|
def build_prompt(task="caption", input=None, sep="\n\n### "): |
|
assert ( |
|
task in TASK2INSTRUCTION |
|
), f"Please choose from {list(TASK2INSTRUCTION.keys())}" |
|
if task in ["tag", "vqa"]: |
|
assert input is not None, "Please fill in `input`!" |
|
if task == "tag" and isinstance(input, list): |
|
input = "、".join(input) |
|
else: |
|
assert input is None, f"`{task}` mode doesn't support to input questions" |
|
sys_msg = "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。" |
|
p = sys_msg |
|
roles = ["指示", "応答"] |
|
instruction = TASK2INSTRUCTION[task] |
|
msgs = [": \n" + instruction, ": \n"] |
|
if input: |
|
roles.insert(1, "入力") |
|
msgs.insert(1, ": \n" + input) |
|
for role, msg in zip(roles, msgs): |
|
p += sep + role + msg |
|
return p |
|
|
|
# load model |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model = AutoModelForVision2Seq.from_pretrained("stabilityai/japanese-stable-vlm", trust_remote_code=True) |
|
processor = AutoImageProcessor.from_pretrained("stabilityai/japanese-stable-vlm") |
|
tokenizer = AutoTokenizer.from_pretrained("stabilityai/japanese-stable-vlm") |
|
model.to(device) |
|
|
|
# prepare inputs |
|
url = "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80" |
|
image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
|
prompt = build_prompt(task="caption") |
|
# prompt = build_prompt(task="tag", input=["河津桜", "青空"]) |
|
# prompt = build_prompt(task="vqa", input="季節はいつですか?") |
|
|
|
inputs = processor(images=image, return_tensors="pt") |
|
text_encoding = tokenizer(prompt, add_special_tokens=False, return_tensors="pt") |
|
inputs.update(text_encoding) |
|
|
|
# generate |
|
outputs = model.generate( |
|
**inputs.to(device, dtype=model.dtype), |
|
do_sample=False, |
|
num_beams=5, |
|
max_new_tokens=128, |
|
min_length=1, |
|
repetition_penalty=1.5, |
|
) |
|
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].strip() |
|
print(generated_text) |
|
# 桜越しの東京スカイツリー |
|
``` |
|
|
|
</details> |
|
|
|
|
|
## Model Details |
|
|
|
* **Developed by**: [Stability AI](https://stability.ai/) |
|
* **Model type**: Auto-regressive Vision Language Model |
|
* **Language(s)**: Japanese |
|
* **License**: [STABILITY AI COMMUNITY LICENSE](./LICENSE.md). |
|
|
|
### Training |
|
|
|
This model is a vision-language instruction-following model with the [LLaVA 1.5](https://arxiv.org/abs/2310.03744) architecture. It uses [stabilityai/japanese-stablelm-instruct-gamma-7b](https://huggingface.co/stabilityai/japanese-stablelm-instruct-gamma-7b) as a language model and [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) as an image encoder. During training, the MLP projection was trained from scratch at the first stage and the language model and the MLP projection were further trained at the second stage. |
|
|
|
### Training Dataset |
|
|
|
The training dataset includes the following public datasets: |
|
|
|
- [CC12M](https://github.com/google-research-datasets/conceptual-12m) with captions translated into Japanese |
|
- [MS-COCO](https://cocodataset.org/#home) with [STAIR Captions](http://captions.stair.center/) |
|
- [Japanese Visual Genome VQA dataset](https://github.com/yahoojapan/ja-vg-vqa) |
|
|
|
## Use and Limitations |
|
|
|
### Intended Use |
|
|
|
This model is intended to be used by the open-source community in vision-language applications. |
|
|
|
|
|
### Limitations and bias |
|
|
|
|
|
The training dataset may have contained offensive or inappropriate content even though we applied data filters. |
|
We recommend users exercise reasonable caution when using these models in production systems. Do not use the model for any applications that may cause harm or distress to individuals or groups. |
|
|
|
|
|
## How to cite |
|
|
|
```bibtex |
|
@misc{JapaneseStableVLM, |
|
url = {[https://huggingface.co/stabilityai/japanese-stable-vlm](https://huggingface.co/stabilityai/japanese-stable-vlm)}, |
|
title = {Japanese Stable VLM}, |
|
author = {Shing, Makoto and Akiba, Takuya} |
|
} |
|
``` |
|
|
|
|
|
## Contact |
|
* For questions and comments about the model, please join [Stable Community Japan](https://discord.com/invite/StableJP). |
|
* For future announcements / information about Stability AI models, research, and events, please follow https://twitter.com/StabilityAI_JP. |
|
* For business and partnership inquiries, please contact [email protected]. ビジネスや協業に関するお問い合わせは[email protected]にご連絡ください。 |
|
|