Image-Text-to-Text
Transformers
Safetensors
ada_llava_llama
text-generation
zhuoyanxu commited on
Commit
5b3e371
·
verified ·
1 Parent(s): 0493528

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -8,12 +8,12 @@ base_model:
8
  - liuhaotian/llava-v1.5-7b
9
  ---
10
 
11
- ```markdown
12
  # Ada-LLaVA Model Card
13
 
14
  <!-- Provide a quick summary of what the model is/does. -->
15
 
16
- Ada-LLaVA 7B is an open-source adaptive inference framework for multimodal Large Language Models (MLLMs) that dynamically adjusts its operations based on available computational resources and latency requirements.
17
 
18
  See the paper for more details: [Learning to Inference Adaptively for Multimodal Large Language Models](https://huggingface.co/papers/2503.10905)
19
 
@@ -55,4 +55,3 @@ AdaLLaVA is based on LLaVA-1.5 and thus follows its license. Llama 2 is licensed
55
  ## Limitations
56
 
57
  While Ada-LLaVA is currently limited to processing one image at a time and only applies adaptive operations in its later half of layers, future work could explore multi-image input support and extend the adaptive mechanisms throughout the entire model architecture, including the vision encoder. These improvements would make the model more versatile and applicable to a broader range of real-world scenarios.
58
- ```
 
8
  - liuhaotian/llava-v1.5-7b
9
  ---
10
 
11
+
12
  # Ada-LLaVA Model Card
13
 
14
  <!-- Provide a quick summary of what the model is/does. -->
15
 
16
+ Ada-LLaVA-L-7B is an open-source adaptive inference framework for multimodal Large Language Models (MLLMs) that dynamically adjusts its operations based on available computational resources and latency requirements.
17
 
18
  See the paper for more details: [Learning to Inference Adaptively for Multimodal Large Language Models](https://huggingface.co/papers/2503.10905)
19
 
 
55
  ## Limitations
56
 
57
  While Ada-LLaVA is currently limited to processing one image at a time and only applies adaptive operations in its later half of layers, future work could explore multi-image input support and extend the adaptive mechanisms throughout the entire model architecture, including the vision encoder. These improvements would make the model more versatile and applicable to a broader range of real-world scenarios.