Visual Salamandra: Pushing the Boundaries of Multimodal Understanding

Community Article Published April 11, 2025

Upvote

BSC-LT

BSC-LT

The Language Technologies Lab takes a major step forward in multimodal artificial intelligence with the release of Visual Salamandra, extending the capabilities of the Salamandra large language model (LLM) to both images and video. Visual Salamandra is based on the 7 billion parameters foundational model maintaining its compactness and efficiency while extending it to multimodal tasks.

Designed with vision-language alignment at its core, Visual Salamandra builds on top of the Salamandra Instructed 7B model by integrating Google’s SigLIP encoder (SigLIP-So400m), a 2-layer MLP projector, and advanced late-fusion techniques to bridge the gap between visual and textual modalities.

The resulting architecture enables Visual Salamandra to comprehend and generate contextually accurate responses from diverse inputs, ranging from single and multiple images and videos to purely textual instructions. This development reflects a broader commitment by the Lab to support robust, multilingual, and multimodal AI systems—especially those that prioritize European linguistic diversity.

Training Visual Salamandra: A Deep Dive into Vision Experiments

To adapt Salamandra for visual inputs, the Lab implemented a four-phase training process centered on late-fusion architecture. In this setup, a pre-trained image encoder (SigLIP, 14 patches at 384x384 resolution) generates image embeddings, which are then aligned with the LLM via a custom-trained multilayer perceptron (MLP) projector.

The four training phases include:

• Phase 1: Projector Pre-training – Only the projector is trained to map image features into the LLM’s latent space.

• Phase 2: High-Quality Vision Pretraining – Using refined datasets (e.g., OCR and re-captioned images), the entire architecture (encoder, projector, and LLM) undergoes joint training.

• Phase 3: Instruction Tuning – The model learns to follow user instructions via visual question answering (VQA), OCR, and other grounded vision tasks.

• Phase 4: Full Multimodal Tuning – Incorporates single/multi-image and video data, along with text-only examples, to optimize the model’s generalization to real-world, multi-input scenarios.

Data diversity played a crucial role throughout training. A total of 6.1 million instruction-tuning instances were used, including 842,000 text-only samples. The training corpus featured data from sources like AI2D, Cambrian, and LLaVA Next, chosen to enhance visual grounding, document understanding, mathematical reasoning, and OCR.

Figure 1. Data distribution during the Visual Salamandra 7B training process

Multilingual Data and European Language Representation

As with previous models from the Language Technologies Lab, Visual Salamandra continues the commitment to multilingual inclusivity, with a strong focus on European languages.

This approach guarantees that underrepresented languages benefit from instruction tuning and alignment with vision tasks, helping to close the resource gap in multimodal AI research. Visual Salamandra is one of the first models of its kind to integrate such linguistic plurality into a multimodal instruction-tuned framework.

Figure 2. Multilingual generation examples with the model trained with Text Regularization and merged with the original backbone LLM.

Applications and Future Directions

Visual Salamandra unlocks a wide range of applications at the intersection of language and vision, such as:

• Visual Question Answering (VQA): Ask questions about an image or video and receive context-aware, accurate responses.

• Optical Character Recognition (OCR): Accurately read and transcribe text from documents, scenes, and charts.

• Document and Chart Understanding: Analyze complex visual documents or graphical content with embedded text.

• Mathematical Reasoning: Solve visually grounded math problems through multimodal reasoning.

• Instruction-based Image Interaction: Follow detailed instructions in visual contexts, including image captioning and localization tasks.

The inclusion of video capabilities also opens the door for further developments in video summarization, event detection, and multimodal storytelling...

With Visual Salamandra, the Language Technologies Lab demonstrates its ongoing leadership in creating inclusive, high-performing foundational models. By harmonizing state-of-the-art vision encoders with strong multilingual LLMs, the team is setting the stage for next-generation AI systems that see, understand, and communicate—across modalities and languages.

Ethical concerns and limitations

While Visual Salamandra shows strong multimodal capabilities, it is important to note its limitations:

• It may hallucinate plausible but incorrect answers, especially when visual inputs are ambiguous.

• Performance on complex OCR and dense document layouts is still challenging.

• The model was trained with filtered and licensed datasets, but users should remain vigilant about potential biases or inaccuracies, particularly when deployed in sensitive applications.

We recommend using Visual Salamandra in contexts where human oversight is possible and avoiding high-stakes applications without proper evaluation.

Visual Salamandra is released under a Apache License, Version 2.0, allowing for research and non-commercial use.

Stay tunned for future releases and tools built on Visual Salamandra, and explore the full model details in our paper.

The Language Technologies Lab team

Acknowledgment

This work has been supported and funded by the Ministerio para la Transformación Digital y de la Función Pública and the Plan de Recuperación, Transformación y Resiliencia – funded by the EU through NextGenerationEU, within the framework of the Modelos del Lenguaje project.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote