Update README.md

4c62b7d verified 12 months ago

6.13 kB

	---
	language:
	- ja
	tags:
	- vision-language
	- image-captioning
	- japanese-stable-vlm
	pipeline_tag: image-to-text
	license: other
	extra_gated_prompt: >-
	By clicking "Agree", you agree to the [License
	Agreement](https://huggingface.co/stabilityai/japanese-stable-vlm/blob/main/LICENSE.md)
	and acknowledge Stability AI's [Privacy
	Policy](https://stability.ai/privacy-policy).
	extra_gated_fields:
	Name: text
	Email: text
	Country: country
	Organization or Affiliation: text
	Receive email updates and promotions on Stability AI products, services, and research?:
	type: select
	options:
	- 'Yes'
	- 'No'
	---

	# Japanese Stable VLM

	Please note: for commercial usage of this model, please see https://stability.ai/license

	商用利用に関する日本語での問い合わせは　[email protected] までお願い致します。


	## Model Details

	Japanese Stable VLM is a vision-language instruction-following model that enables to generate Japanese descriptions for input images and optionally input texts such as questions.


	## Usage

	<details>

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForVision2Seq, AutoImageProcessor
	from PIL import Image
	import requests

	# helper function to format input prompts
	TASK2INSTRUCTION = {
	"caption": "画像を詳細に述べてください。",
	"tag": "与えられた単語を使って、画像を詳細に述べてください。",
	"vqa": "与えられた画像を下に、質問に答えてください。",
	}


	def build_prompt(task="caption", input=None, sep="\n\n### "):
	assert (
	task in TASK2INSTRUCTION
	), f"Please choose from {list(TASK2INSTRUCTION.keys())}"
	if task in ["tag", "vqa"]:
	assert input is not None, "Please fill in `input`!"
	if task == "tag" and isinstance(input, list):
	input = "、".join(input)
	else:
	assert input is None, f"`{task}` mode doesn't support to input questions"
	sys_msg = "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。"
	p = sys_msg
	roles = ["指示", "応答"]
	instruction = TASK2INSTRUCTION[task]
	msgs = [": \n" + instruction, ": \n"]
	if input:
	roles.insert(1, "入力")
	msgs.insert(1, ": \n" + input)
	for role, msg in zip(roles, msgs):
	p += sep + role + msg
	return p

	# load model
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = AutoModelForVision2Seq.from_pretrained("stabilityai/japanese-stable-vlm", trust_remote_code=True)
	processor = AutoImageProcessor.from_pretrained("stabilityai/japanese-stable-vlm")
	tokenizer = AutoTokenizer.from_pretrained("stabilityai/japanese-stable-vlm")
	model.to(device)

	# prepare inputs
	url = "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
	image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
	prompt = build_prompt(task="caption")
	# prompt = build_prompt(task="tag", input=["河津桜", "青空"])
	# prompt = build_prompt(task="vqa", input="季節はいつですか？")

	inputs = processor(images=image, return_tensors="pt")
	text_encoding = tokenizer(prompt, add_special_tokens=False, return_tensors="pt")
	inputs.update(text_encoding)

	# generate
	outputs = model.generate(
	**inputs.to(device, dtype=model.dtype),
	do_sample=False,
	num_beams=5,
	max_new_tokens=128,
	min_length=1,
	repetition_penalty=1.5,
	)
	generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].strip()
	print(generated_text)
	# 桜越しの東京スカイツリー
	```

	</details>


	## Model Details

	* Developed by: [Stability AI](https://stability.ai/)
	* Model type: Auto-regressive Vision Language Model
	* Language(s): Japanese
	* License: [STABILITY AI COMMUNITY LICENSE](./LICENSE.md).

	### Training

	This model is a vision-language instruction-following model with the [LLaVA 1.5](https://arxiv.org/abs/2310.03744) architecture. It uses [stabilityai/japanese-stablelm-instruct-gamma-7b](https://huggingface.co/stabilityai/japanese-stablelm-instruct-gamma-7b) as a language model and [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) as an image encoder. During training, the MLP projection was trained from scratch at the first stage and the language model and the MLP projection were further trained at the second stage.

	### Training Dataset

	The training dataset includes the following public datasets:

	- [CC12M](https://github.com/google-research-datasets/conceptual-12m) with captions translated into Japanese
	- [MS-COCO](https://cocodataset.org/#home) with [STAIR Captions](http://captions.stair.center/)
	- [Japanese Visual Genome VQA dataset](https://github.com/yahoojapan/ja-vg-vqa)

	## Use and Limitations

	### Intended Use

	This model is intended to be used by the open-source community in vision-language applications.


	### Limitations and bias


	The training dataset may have contained offensive or inappropriate content even though we applied data filters.
	We recommend users exercise reasonable caution when using these models in production systems. Do not use the model for any applications that may cause harm or distress to individuals or groups.


	## How to cite

	```bibtex
	@misc{JapaneseStableVLM,
	url = {[https://huggingface.co/stabilityai/japanese-stable-vlm](https://huggingface.co/stabilityai/japanese-stable-vlm)},
	title = {Japanese Stable VLM},
	author = {Shing, Makoto and Akiba, Takuya}
	}
	```


	## Contact
	* For questions and comments about the model, please join [Stable Community Japan](https://discord.com/invite/StableJP).
	* For future announcements / information about Stability AI models, research, and events, please follow https://twitter.com/StabilityAI_JP.
	* For business and partnership inquiries, please contact [email protected]. ビジネスや協業に関するお問い合わせは[email protected]にご連絡ください。