Rohithsai1234567 commited on
Commit
7f370b9
·
verified ·
1 Parent(s): 5dacc41

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +144 -0
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - vision
5
+ - image-text-to-text
6
+ - autoquant
7
+ - gguf
8
+ language:
9
+ - en
10
+ pipeline_tag: image-text-to-text
11
+ inference: true
12
+ ---
13
+
14
+ # LLaVa-Next, leveraging [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) as LLM
15
+
16
+ The LLaVA-NeXT model was proposed in [LLaVA-NeXT: Improved reasoning, OCR, and world knowledge](https://llava-vl.github.io/blog/2024-01-30-llava-next/) by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee. LLaVa-NeXT (also called LLaVa-1.6) improves upon [LLaVa-1.5](https://huggingface.co/transformers/main/model_doc/llava.html) by increasing the input image resolution and training on an improved visual instruction tuning dataset to improve OCR and common sense reasoning.
17
+
18
+ Disclaimer: The team releasing LLaVa-NeXT did not write a model card for this model so this model card has been written by the Hugging Face team.
19
+
20
+ ## Model description
21
+
22
+ LLaVa combines a pre-trained large language model with a pre-trained vision encoder for multimodal chatbot use cases. LLaVA 1.6 improves on LLaVA 1.5 BY:
23
+ - Using [Mistral-7B](https://mistral.ai/news/announcing-mistral-7b/) (for this checkpoint) and [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) which has better commercial licenses,
24
+ and bilingual support
25
+ - More diverse and high quality data mixture
26
+ - Dynamic high resolution
27
+
28
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62441d1d9fdefb55a0b7d12c/FPshq08TKYD0e-qwPLDVO.png)
29
+
30
+ ## Intended uses & limitations
31
+
32
+ You can use the raw model for tasks like image captioning, visual question answering, multimodal chatbot use cases. See the [model hub](https://huggingface.co/models?search=llava-hf) to look for
33
+ other versions on a task that interests you.
34
+
35
+ ### How to use
36
+
37
+ Here's the prompt template for this model but we recomment to use the chat templates to format the prompt with `processor.apply_chat_template()`.
38
+ That will apply the correct template for a given checkpoint for you.
39
+
40
+ ```
41
+ "[INST] <image>\nWhat is shown in this image? [/INST]"
42
+ ```
43
+
44
+ To run the model with the `pipeline`, see the below example:
45
+
46
+ ```python
47
+ from transformers import pipeline
48
+
49
+ pipe = pipeline("image-text-to-text", model="llava-hf/llava-v1.6-mistral-7b-hf")
50
+ messages = [
51
+ {
52
+ "role": "user",
53
+ "content": [
54
+ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"},
55
+ {"type": "text", "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud"},
56
+ ],
57
+ },
58
+ ]
59
+
60
+ out = pipe(text=messages, max_new_tokens=20)
61
+ print(out)
62
+ >>> [{'input_text': [{'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg'}, {'type': 'text', 'text': 'What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud'}]}], 'generated_text': 'Lava'}]
63
+ ```
64
+
65
+
66
+ You can also load and use the model like following:
67
+
68
+ ```python
69
+ from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
70
+ import torch
71
+ from PIL import Image
72
+ import requests
73
+
74
+ processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
75
+
76
+ model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
77
+ model.to("cuda:0")
78
+
79
+ # prepare image and text prompt, using the appropriate prompt template
80
+ url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
81
+ image = Image.open(requests.get(url, stream=True).raw)
82
+
83
+ # Define a chat history and use `apply_chat_template` to get correctly formatted prompt
84
+ # Each value in "content" has to be a list of dicts with types ("text", "image")
85
+ conversation = [
86
+ {
87
+
88
+ "role": "user",
89
+ "content": [
90
+ {"type": "text", "text": "What is shown in this image?"},
91
+ {"type": "image"},
92
+ ],
93
+ },
94
+ ]
95
+ prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
96
+
97
+ inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda:0")
98
+
99
+ # autoregressively complete prompt
100
+ output = model.generate(**inputs, max_new_tokens=100)
101
+
102
+ print(processor.decode(output[0], skip_special_tokens=True))
103
+ ```
104
+
105
+ ### Model optimization
106
+
107
+ #### 4-bit quantization through `bitsandbytes` library
108
+
109
+ First make sure to install `bitsandbytes`, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:
110
+
111
+ ```diff
112
+ model = LlavaNextForConditionalGeneration.from_pretrained(
113
+ model_id,
114
+ torch_dtype=torch.float16,
115
+ low_cpu_mem_usage=True,
116
+ + load_in_4bit=True
117
+ )
118
+ ```
119
+
120
+ #### Use Flash-Attention 2 to further speed-up generation
121
+
122
+ First make sure to install `flash-attn`. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:
123
+
124
+ ```diff
125
+ model = LlavaNextForConditionalGeneration.from_pretrained(
126
+ model_id,
127
+ torch_dtype=torch.float16,
128
+ low_cpu_mem_usage=True,
129
+ + use_flash_attention_2=True
130
+ ).to(0)
131
+ ```
132
+
133
+ ### BibTeX entry and citation info
134
+
135
+ ```bibtex
136
+ @misc{liu2023improved,
137
+ title={Improved Baselines with Visual Instruction Tuning},
138
+ author={Haotian Liu and Chunyuan Li and Yuheng Li and Yong Jae Lee},
139
+ year={2023},
140
+ eprint={2310.03744},
141
+ archivePrefix={arXiv},
142
+ primaryClass={cs.CV}
143
+ }
144
+ ```