My Journey Into Vision Models

Community Article Published April 12, 2025

Computer vision has always been an intriguing field for me. I remember the first time I dabbled in image recognition stuff was back in high school - just a fun little project during a summer break 😂. Good times!

Then came the golden age of large language models (LLMs), and like many others, I was super excited to see the rapid progress. I started learning about the transformer architecture and how it revolutionized not just the field of NLP, but was starting to shake things up in computer vision too.

Luckily, I joined Hugging Face in August 2024. This happily meant I could finally dedicate more time to projects like llama.cpp. Back then, vision support was still in its early stages, but I was eager to contribute. Thanks to the amazing folks at Hugging Face, I learned a ton about the latest advancements in vision models.

In this article, I want to share my journey navigating the world of vision models and the adventure of integrating them into llama.cpp. I hope this might inspire others to explore the exciting realm of computer vision and maybe even contribute to the open-source community!

Yep, that was me in the cover image, with my trusty Fuji X-E1 📷
Check out my posts about photography here

Overview of Vision Models

Most vision models nowadays have two main parts: the vision encoder and the language decoder. Vision encoders are often based on the transformer architecture, hence the catchy name "Vision Transformer" (ViT).

To wrap your head around this, imagine two people working together:

One person can look at an image.
The other person can only read descriptions.

The first person looks at the image and describes what they see to the second person. Then, the second person uses that description to answer questions or generate text.

That's kind of how vision models work! The vision tower (or vision encoder) acts like the first person, "looking" at the image and compressing it into a set of intermediate representations. Then, the language model (or language decoder) acts like the second person, generating answers based on those representations.

These intermediate representations can take various forms (like KV vectors for cross-attention), but the most common form is a set of embedding vectors. We'll focus mostly on those in the next sections.

Components

Here's a simplified diagram of the whole pipeline:

Preprocessing

Preprocessing is where the magic begins. It typically involves:

Converting the image out of its comfy file format (e.g., JPEG, PNG) into a raw bitmap.
Resizing the image to a fixed size (usually shrinking it down). This might require some padding or cropping if the aspect ratio doesn't match the target.
For models that can handle different image sizes, we might need to slice the image up (more on this below).
Converting the image into a tensor and normalizing its pixel values.

The vision encoder usually expects a fixed, often quite small, input image size. This means precious details can get lost if the original image is too large. To tackle this:

Some models (like LLaVA-UHD, MiniCPM-V) get clever and split the large image into smaller slices, processing both the slices and the downscaled original.
Other models just bulk up and accept larger image inputs directly (e.g., Gemma 3).
Most notably, Qwen2-VL uses a special positional embedding technique called M-RoPE to keep track of where patches came from, even in different-sized images, without losing spatial context. Very cool!

One thing that made me scratch my head is that while slicing seems purely algorithmic (no fancy Machine Learning needed here), the specific slicing strategy can be surprisingly complex and vary wildly between models.

Example of LLaVA-UHD Adaptive Slicing algorithm Example of LLaVA-UHD's Adaptive Slicing algorithm. Source: LLaVA-UHD Github repo

IMPORTANT: Each slice is then treated like a separate image when it goes into the encoder. So, when I talk about the encoder, I might use "slice" and "image" interchangeably.

Split into patches

Next up, we chop the image (or slice) into smaller, equal-sized patches. Think of it like cutting a photo into a grid of tiny squares. Each patch is then flattened into a single vector.

In many implementations, this chopping isn't done with scissors, but with a math operation called 2D convolution, using a kernel size that matches the desired patch size. This step also sneakily embeds some extra information into the patches thanks to the convolution's kernel and bias.

Positional embeddings are also added to these patch vectors. This is crucial because transformers, by themselves, have no inherent sense of space - these embeddings tell the model where each patch came from in the original picture.

Illustration of splitting an image or slice into patches. Source: ResearchGate

You might wonder, "Why not just feed the whole image in?" Good question! But doing that directly would make the input vector enormous, causing the model size to balloon exponentially. Patching keeps things manageable.

Vision Encoder

The vision encoder is typically a transformer-based model. It takes the patches (already infused with positional info) as input and outputs a set of embedding vectors.

This part is the core of the vision processing. It's often relatively straightforward to implement (phew!), partly because we don't need to worry about the KV cache complexity found in generative language models. The transformer processes all patches in a non-causal manner. This means it gets to peek at all the patches simultaneously, figuring out the relationships between them.

Illustration of Causal vs Non-Causal Attention. Source: ResearchGate

The underlying transformer extracts features from the patches. If the image has a cat and a dog, the transformer churns out embedding vectors that somehow represent "cat-ness" and "dog-ness" derived from the patches.

Using non-causal attention is important because an object might span multiple patches. For instance, one patch might just contain the cat's majestic snoot, while another contains an ear. Non-causal attention helps the model piece the whole cat together from these different parts.

Of course, trying to directly interpret what's in these embedding vectors is like trying to understand abstract art - meaningful, but tricky! Below is a highly simplified idea:

Simplified example of what embedding vectors might represent

Projector

Okay, so we have a nice set of embedding vectors representing the image. But wait! In most cases, the language model expects input vectors of a different size (dimension) than what the vision encoder spits out. Uh oh.

To bridge this gap, we need a projector. The simplest method is often an MLP (multilayer perceptron) involving a couple of matrix multiplications with an activation function (like GELU) sandwiched in between. Here’s a basic idea in code:

import torch
import torch.nn as nn

class Projector(nn.Module):
    # input_dim:   dimension of the vision encoder's output vectors
    # hidden_size: dimension of the MLP's hidden layer
    # output_dim:  dimension the language model expects
    def __init__(self, input_dim, hidden_size, output_dim):
        super(Projector, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_size)
        self.activation = nn.GELU()
        self.fc2 = nn.Linear(hidden_size, output_dim)

    def forward(self, x):
        # input: n vectors of size input_dim
        x = self.fc1(x) # Project from input_dim to hidden_size
        x = self.activation(x)
        x = self.fc2(x) # Project from hidden_size to output_dim
        return x # output: n vectors of size output_dim

But life isn't always simple! Some models get fancier:

MiniCPM-V throws another transformer 😂 into the mix just for projection!
Models like Gemma 3, Phi-4-multimodal, and MobileVLM use a Pool2D layer to reduce the number of output vectors, kind of like "summarizing" the image embeddings, so fewer tokens get fed into the language model.

Yeah, because of these varying complexities, the projector is often one of the trickiest parts to reimplement accurately.

Language Decoder

"Hold on," you might be thinking, "language models work with tokens (which are basically just numbers), right? How can we feed them these continuous embedding vectors?"

Excellent question! Turns out, the text tokens also get converted into embedding vectors internally within the language model. Most (if not all) LLMs have a built-in lookup table (usually a tensor called embed_tokens.weight) that maps each token ID to its corresponding embedding vector.

So, what's actually happening looks more like this:

See? From the language model's perspective, the image embeddings are just another set of input vectors, seamlessly concatenated with the text embeddings. The key difference is that image embeddings are dynamic (they change based on the input image), whereas text token embeddings are typically learned and fixed.

During training, the model learns to associate these image embeddings with the surrounding text context. If it sees the image embeddings corresponding to a cat, it learns that generating the word "cat" (or related concepts) is appropriate in that context.

Decoding Images with Multiple Slices

So far, we've mostly talked about single images or slices. But things get spicier when models handle multiple slices (like LLaVA-UHD, MiniCPM-V, Idefics). How does the language model know which embeddings belong to which slice, or where they fit spatially?

A common technique is to use special 'marker' tokens in the input sequence to delineate the embeddings from different slices or rows of slices. For example, with an image split into 4 slices plus the downscaled original:

[Downscaled Image] --> [Slice 1] --> [Slice 2]
                  |                |
                  +--> [Slice 3] --> [Slice 4]

The input embeddings fed to the LLM might be structured like this (using hypothetical special tokens):

<image>[Downscaled Image]</image>\n
<slice>[Slice 1]</slice><slice>[Slice 2]</slice>\n
<slice>[Slice 3]</slice><slice>[Slice 4]</slice>\n

Of course, the exact special tokens (<image>, <slice>, <row>, etc.) and structure vary significantly between models.

Some models, like Qwen2-VL, take a different path, using the M-RoPE technique mentioned earlier to implicitly encode the 2D position of each patch's embeddings, avoiding the need for explicit slice tokens.

Illustration of M-RoPE. Source: Qwen2-VL technical report

This wide variety means figuring out the correct way to handle multi-slice embeddings in downstream projects like llama.cpp can feel like assembling IKEA furniture without instructions - definitely challenging!

What the Future Holds

We've journeyed through the ideas and inner workings of vision models.

As you've seen, the vision encoder is essentially a way to translate images into a language (embeddings) that the LLM can understand. The cool part is, this 'encoder-decoder' concept isn't just for pictures! We can apply the same fundamental idea to other modalities like audio or video. The main difference would be swapping out the vision encoder for an encoder suited to that specific modality.

For instance, to process video input, a model like Qwen2.5-Omni uses its vision encoder frame-by-frame for the visual stream and another transformer to encode the audio stream. The outputs from both encoders are then fed into the language model to understand the combined audio-visual input.

Illustration of Qwen2.5-Omni video processing. Source: Qwen2.5-Omni-7B model card

The possibilities for multi-modal AI are vast and incredibly exciting!

Conclusion

In this article, I've shared a bit about my journey wrestling with vision models and hopefully shed some light on how these fascinating beasts work under the hood. From the high-level concepts to the nitty-gritty implementation details (and occasional headaches!), it's been quite an adventure.

I hope this glimpse into the world of computer vision inspires you to explore further and perhaps even dive into contributing to the vibrant open-source AI community. There's always more to learn and build!

Community

merve

Apr 14

super well written ❤️

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote