alanzhuly commited on
Commit
9fc3117
1 Parent(s): 0067dce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -10
README.md CHANGED
@@ -10,20 +10,20 @@ tags:
10
 
11
  ## Introduction
12
 
13
- Omni-Vision is a sub-billion (968M) multimodal model capable of processing both visual and text inputs. Built upon LLaVA's architecture, it introduces a novel token compression technique to reduce image token sizes (from 729 to 81), optimizing efficiency without compromising visual understanding on edge devices. It has two key enhancements:
14
-
15
- - **9x Token Reduction through Token Compression**: Significant decrease in image token count, reducing latency and computational cost, ideal for on-device applications.
16
- - **Minimal-Edit DPO for Enhanced Response Quality**: Improves model responses by using targeted edits, maintaining core capabilities without significant behavior shifts.
17
 
 
 
 
18
  **Quick Links:**
19
- 1. Interact in our HuggingFace Space.
20
- 2. [Quickstart to run locally](#how-to-run-locally)
21
- 3. Learn more in [blogs](https://nexa.ai)
22
 
23
  **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
24
 
25
  ## Intended Use Cases
26
- OmniVision is intended for Visual Question Answering (answering questions about images) and Image Captioning (describing scenes in photos), optimized for edge devices.
27
 
28
  **Example Demo:**
29
  Omni-Vision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro
@@ -33,7 +33,7 @@ Omni-Vision generated captions for a 1046×1568 pixel poster | **Processing time
33
 
34
  ## Benchmarks
35
 
36
- Below we demonstrate a figure to show how omnivision performs against nanollava. In all the tasks, omnivision outperforms the previous world's smallest vision-language model.
37
 
38
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
39
 
@@ -50,7 +50,7 @@ We have conducted a series of experiments on benchmark datasets, including MM-VE
50
  | POPE | 89.4 | 84.1 | NA |
51
 
52
 
53
- ## How to Use - Quickstart
54
  In the following, we demonstrate how to run omnivision locally on your device.
55
 
56
  **Step 1: Install Nexa-SDK (local on-device inference framework)**
@@ -87,6 +87,8 @@ We enhance the model's contextual understanding using image-based question-answe
87
  **Direct Preference Optimization (DPO):**
88
  The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
89
 
 
 
90
 
91
  ### Learn more in our blogs
92
  [Blogs](https://nexa.ai)
 
10
 
11
  ## Introduction
12
 
13
+ Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Built on LLaVA's architecture, it features:
 
 
 
14
 
15
+ - **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
16
+ - **Minimal-Edit DPO**: Enhances response quality with minimal edits, preserving core model behavior.
17
+
18
  **Quick Links:**
19
+ 1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
20
+ 2. [Quickstart for local setup](#how-to-use-on-device)
21
+ 3. Learn more in our [Blogs](https://nexa.ai)
22
 
23
  **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
24
 
25
  ## Intended Use Cases
26
+ Omnivision is designed for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
27
 
28
  **Example Demo:**
29
  Omni-Vision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro
 
33
 
34
  ## Benchmarks
35
 
36
+ Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, omnivision outperforms the previous world's smallest vision-language model.
37
 
38
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
39
 
 
50
  | POPE | 89.4 | 84.1 | NA |
51
 
52
 
53
+ ## How to Use On Device
54
  In the following, we demonstrate how to run omnivision locally on your device.
55
 
56
  **Step 1: Install Nexa-SDK (local on-device inference framework)**
 
87
  **Direct Preference Optimization (DPO):**
88
  The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
89
 
90
+ ## What's next?
91
+ We are continually improving Omnivision for better on-device performance. Stay tuned.
92
 
93
  ### Learn more in our blogs
94
  [Blogs](https://nexa.ai)