cpu4dream
/

llava-small-OpenELM-AIMv2-0.6B-auto

Image-Text-to-Text

Model card Files Files and versions Community

llava-small-OpenELM-AIMv2-0.6B-auto / README.md

zcamz's picture

Update README.md

17839b5 verified 2 months ago

|

3.29 kB

	---
	license: mit
	datasets:
	- liuhaotian/LLaVA-Pretrain
	- liuhaotian/LLaVA-Instruct-150K
	language:
	- en
	metrics:
	- accuracy
	- precision
	- recall
	- f1
	base_model:
	- apple/aimv2-large-patch14-224
	- apple/OpenELM
	pipeline_tag: image-text-to-text
	tags:
	- cpu
	- nano
	- small
	- tiny
	- llava
	model_size: 0.6B parameters
	---

	<center><span style="font-size:2em;">Tiny Llava 4 CPU 🐛</span></center>

	---

	### 🚀 Model Overview
	`tiny-llava-open-elm-aimv2` is a lightweight image-text-to-text model that combines [OpenELM 270M - INSTRUCT](https://huggingface.co/apple/OpenELM-270M-Instruct) as the LLM backbone and [AIMv2-Large-Patch14-224-distilled (309M)](https://huggingface.co/apple/aimv2-large-patch14-224-distilled) as the vision encoder. The model has been fine-tuned using LoRA (Low-Rank Adaptation) for efficient training. It was developed using the [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) codebase, which provides a modular framework for lightweight multi-modal models.

	The model is designed to run efficiently on CPU, making it ideal for resource-constrained environments. It is trained and evaluated on POPE and TextVQA benchmarks. The total model size is 0.6B parameters.

	---

	### Usage
	Execute the following test code:
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	hf_path = 'cpu4dream/llava-small-OpenELM-AIMv2-0.6B-auto'
	model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
	model.cuda()
	config = model.config
	tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False, model_max_length = config.tokenizer_model_max_length,padding_side = config.tokenizer_padding_side)
	prompt="What are these?"
	image_url="http://images.cocodataset.org/test-stuff2017/000000000001.jpg"
	output_text, genertaion_time = model.chat(prompt=prompt, image=image_url, tokenizer=tokenizer)
	print('model output:', output_text)
	print('runing time:', genertaion_time)
	```

	---

	### 📊 Performance

	\| Model Name \| VQAv2 \| GQA \| SQA \| TextVQA \| MM-VET \| POPE \| MME \| MMMU \|
	\|:-----------------------------------------------------------:\|:-----:\|:-----:\|:-----:\|:-------:\|:------:\|:-----:\|:------:\|:-----:\|
	\| [LLaVA-1.5-7B](https://huggingface.co/llava-hf/llava-1.5-7b-hf) \| 78.5 \| 62.0 \| 66.8 \| 58.2 \| 30.5 \| 85.9 \| 1510.7 \| - \|
	\| [bczhou/TinyLLaVA-3.1B](https://huggingface.co/bczhou/TinyLLaVA-3.1B) \| 79.9 \| 62.0 \| 69.1 \| 59.1 \| 32.0 \| 86.4 \| 1464.9 \| - \|
	\| [tinyllava/TinyLLaVA-Gemma-SigLIP-2.4B](https://huggingface.co/tinyllava/TinyLLaVA-Gemma-SigLIP-2.4B) \| 78.4 \| 61.6 \| 64.4 \| 53.6 \| 26.9 \| 86.4 \| 1339.0 \| 31.7 \|
	\| [tinyllava/TinyLLaVA-Phi-2-SigLIP-3.1B](https://huggingface.co/tinyllava/TinyLLaVA-Phi-2-SigLIP-3.1B) \| 80.1 \| 62.1 \| 73.0 \| 60.3 \| 37.5 \| 87.2 \| 1466.4 \| 38.4 \|
	\| cpu4dream/llava-small-OpenELM-AIMv2-0.6B \| - \| - \| - \| 39.68 \| - \| 83.93 \| - \| - \|

	---

	### 🔗 References
	- [OpenELM](https://huggingface.co/apple/OpenELM)
	- [AIMv2-Large-Patch14-224](https://huggingface.co/apple/aimv2-large-patch14-224)
	- [TinyLLaVA Factory GitHub](https://github.com/TinyLLaVA/TinyLLaVA_Factory)