NexaAIDev
/

omnivision-968M

@@ -10,20 +10,20 @@ tags:
 ## Introduction
-Omni-Vision is a sub-billion (968M) multimodal model capable of processing both visual and text inputs. Built upon LLaVA's architecture, it introduces a novel token compression technique to reduce image token sizes (from 729 to 81), optimizing efficiency without compromising visual understanding on edge devices. It has two key enhancements:
-- **9x Token Reduction through Token Compression**: Significant decrease in image token count, reducing latency and computational cost, ideal for on-device applications.
-- **Minimal-Edit DPO for Enhanced Response Quality**: Improves model responses by using targeted edits, maintaining core capabilities without significant behavior shifts.
 **Quick Links:**
-1. Interact in our HuggingFace Space.
-2. [Quickstart to run locally](#how-to-run-locally)
-3. Learn more in [blogs](https://nexa.ai)
 **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
 ## Intended Use Cases
-OmniVision is intended for Visual Question Answering (answering questions about images) and Image Captioning (describing scenes in photos), optimized for edge devices.
 **Example Demo:**
 Omni-Vision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro
@@ -33,7 +33,7 @@ Omni-Vision generated captions for a 1046×1568 pixel poster | **Processing time
 ## Benchmarks
-Below we demonstrate a figure to show how omnivision performs against nanollava. In all the tasks, omnivision outperforms the previous world's smallest vision-language model.
 <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
@@ -50,7 +50,7 @@ We have conducted a series of experiments on benchmark datasets, including MM-VE
 | POPE              | 89.4                | 84.1      | NA          |
-## How to Use - Quickstart
 In the following, we demonstrate how to run omnivision locally on your device.
 **Step 1: Install Nexa-SDK (local on-device inference framework)**
@@ -87,6 +87,8 @@ We enhance the model's contextual understanding using image-based question-answe
 **Direct Preference Optimization (DPO):**
 The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
 ### Learn more in our blogs
 [Blogs](https://nexa.ai)

 ## Introduction
+Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Built on LLaVA's architecture, it features:
+- **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
+- **Minimal-Edit DPO**: Enhances response quality with minimal edits, preserving core model behavior.
 **Quick Links:**
+1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
+2. [Quickstart for local setup](#how-to-use-on-device)
+3. Learn more in our [Blogs](https://nexa.ai)
 **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
 ## Intended Use Cases
+Omnivision is designed for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
 **Example Demo:**
 Omni-Vision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro
 ## Benchmarks
+Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, omnivision outperforms the previous world's smallest vision-language model.
 <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
 | POPE              | 89.4                | 84.1      | NA          |
+## How to Use On Device
 In the following, we demonstrate how to run omnivision locally on your device.
 **Step 1: Install Nexa-SDK (local on-device inference framework)**
 **Direct Preference Optimization (DPO):**
 The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
+## What's next?
+We are continually improving Omnivision for better on-device performance. Stay tuned.
 ### Learn more in our blogs
 [Blogs](https://nexa.ai)