zcamz commited on
Commit
d0c2620
Β·
verified Β·
1 Parent(s): e119655

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -3
README.md CHANGED
@@ -1,3 +1,71 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - liuhaotian/LLaVA-Pretrain
5
+ - liuhaotian/LLaVA-Instruct-150K
6
+ language:
7
+ - en
8
+ metrics:
9
+ - accuracy
10
+ - precision
11
+ - recall
12
+ - f1
13
+ base_model:
14
+ - apple/aimv2-large-patch14-224
15
+ - apple/OpenELM
16
+ pipeline_tag: image-text-to-text
17
+ tags:
18
+ - cpu
19
+ - nano
20
+ - small
21
+ - tiny
22
+ - llava
23
+ model_size: 0.6B parameters
24
+ ---
25
+
26
+ **<center><span style="font-size:2em;">Tiny Llava 4 CPU πŸ›</span></center>**
27
+
28
+ ---
29
+
30
+ ### πŸš€ **Model Overview**
31
+ `tiny-llava-open-elm-aimv2` is a lightweight image-text-to-text model that combines **[OpenELM 270M - INSTRUCT](https://huggingface.co/apple/OpenELM-270M-Instruct)** as the LLM backbone and **[AIMv2-Large-Patch14-224-distilled (309M)](https://huggingface.co/apple/aimv2-large-patch14-224-distilled)** as the vision encoder. The model has been fine-tuned using **LoRA (Low-Rank Adaptation)** for efficient training. It was developed using the **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** codebase, which provides a modular framework for lightweight multi-modal models.
32
+
33
+ The model is designed to run efficiently on **CPU**, making it ideal for resource-constrained environments. It is trained and evaluated on **POPE** and **TextVQA** benchmarks. The total model size is **0.6B parameters**.
34
+
35
+ ---
36
+
37
+ ### Usage
38
+ Execute the following test code:
39
+ ```python
40
+ from transformers import AutoTokenizer, AutoModelForCausalLM
41
+ hf_path = 'cpu4dream/llava-small-OpenELM-AIMv2-0.6B-auto'
42
+ model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
43
+ model.cuda()
44
+ config = model.config
45
+ tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False, model_max_length = config.tokenizer_model_max_length,padding_side = config.tokenizer_padding_side)
46
+ prompt="What are these?"
47
+ image_url="http://images.cocodataset.org/test-stuff2017/000000000001.jpg"
48
+ output_text, genertaion_time = model.chat(prompt=prompt, image=image_url, tokenizer=tokenizer)
49
+ print('model output:', output_text)
50
+ print('runing time:', genertaion_time)
51
+ ```
52
+
53
+ ---
54
+
55
+ ### πŸ“Š **Performance**
56
+
57
+ | Model Name | VQAv2 | GQA | SQA | TextVQA | MM-VET | POPE | MME | MMMU |
58
+ |:-----------------------------------------------------------:|:-----:|:-----:|:-----:|:-------:|:------:|:-----:|:------:|:-----:|
59
+ | [LLaVA-1.5-7B](https://huggingface.co/llava-hf/llava-1.5-7b-hf) | 78.5 | 62.0 | 66.8 | 58.2 | 30.5 | 85.9 | 1510.7 | - |
60
+ | [bczhou/TinyLLaVA-3.1B](https://huggingface.co/bczhou/TinyLLaVA-3.1B) | 79.9 | 62.0 | 69.1 | 59.1 | 32.0 | 86.4 | 1464.9 | - |
61
+ | [tinyllava/TinyLLaVA-Gemma-SigLIP-2.4B](https://huggingface.co/tinyllava/TinyLLaVA-Gemma-SigLIP-2.4B) | 78.4 | 61.6 | 64.4 | 53.6 | 26.9 | 86.4 | 1339.0 | 31.7 |
62
+ | [tinyllava/TinyLLaVA-Phi-2-SigLIP-3.1B](https://huggingface.co/tinyllava/TinyLLaVA-Phi-2-SigLIP-3.1B) | 80.1 | 62.1 | 73.0 | 60.3 | 37.5 | 87.2 | 1466.4 | 38.4 |
63
+ | cpu4dream/llava-small-OpenELM-AIMv2-0.6B | - | - | - | 39.68 | - | 83.93 | - | - |
64
+
65
+ ---
66
+
67
+ ### πŸ”— **References**
68
+ - [OpenELM](https://huggingface.co/apple/OpenELM)
69
+ - [AIMv2-Large-Patch14-224](https://huggingface.co/apple/aimv2-large-patch14-224)
70
+ - [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)
71
+ - [LoRA Paper (arXiv:2402.14289)](https://arxiv.org/pdf/2402.14289)