|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- xiaorui638/cc3m |
|
|
- liuhaotian/LLaVA-Instruct-150K |
|
|
metrics: |
|
|
- bleu |
|
|
- accuracy |
|
|
base_model: |
|
|
- LiquidAI/LFM2-350M |
|
|
--- |
|
|
# π **VIPER-L1: A Family of Small Multimodal-LLMs** |
|
|
|
|
|
<div align="center"> |
|
|
<a href="./"> |
|
|
<img src="viper-l1_represent.png" width="80%" alt="Viper-L1 Logo"/> |
|
|
</a> |
|
|
<br/> |
|
|
<i>βFast. Compact. Vision-Language Intelligence.β</i> |
|
|
</div> |
|
|
|
|
|
***Note:*** This model is still in improving, so we recommend to fine-tune this model in your use case! |
|
|
|
|
|
--- |
|
|
|
|
|
## π Overview |
|
|
|
|
|
**Viper-L1** is an open-source **small multimodal large language model (Multimodal-LLM)** designed for efficient multimodal reasoning and deployment on consumer GPUs. |
|
|
It is built upon the [**Liquid Model**](https://huggingface.co/LiquidAI/LFM2-350M) architecture (β1.2B parameters), enabling a powerful yet lightweight foundation for **personal research, on-device applications, and internal experimentation**. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Key Features |
|
|
|
|
|
* β‘ **Efficient Training & Inference** |
|
|
Trained on **2Γ H100 GPUs** within **~2 days**, thanks to our lightweight multimodal fusion and liquid transformer design. |
|
|
Inference runs smoothly even on **RTX 4070** GPUs. |
|
|
|
|
|
* π **Multimodal Connector (Sense Integration Module)** |
|
|
Inspired by human perception, Viper-L1 introduces a *connector* that fuses signals from different sensory encoders (vision, audio, etc.), enabling deeper **cross-modal alignment** and improved reasoning. |
|
|
|
|
|
* π§© **Hybrid Architecture** |
|
|
Combines the **semantic strength of Transformers** with the **efficiency of Liquid Neural Networks**, resulting in a compact yet expressive multimodal model. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Progress |
|
|
|
|
|
* β
**Released** β Viper-L1 model checkpoint |
|
|
* π§© **Coming Soon** β Fully documented training and inference scripts |
|
|
* π§© **Coming Soon** β Fully documented for post-training (LoRA, DPO, GRPO) |
|
|
|
|
|
Stay tuned for our next updates on model fine-tuning and multimodal reasoning enhancements. |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Architecture |
|
|
|
|
|
The overall architecture is shown below: |
|
|
|
|
|
<div align="center"> |
|
|
<a href="./"> |
|
|
<img src="viper-l1.png" width="80%" alt="Viper-L1 Architecture"/> |
|
|
</a> |
|
|
</div> |
|
|
|
|
|
**Main Components:** |
|
|
|
|
|
1. π¨ **Vision Encoder** β Extracts compact visual embeddings |
|
|
2. π **Multimodal Connector** β Fuses sensory inputs efficiently |
|
|
3. π§ **Language Backbone (LFM2-350M-based)** β Performs semantic reasoning and response generation |
|
|
|
|
|
> π§ͺ *The current Viper-L1 (1.2B parameters) was trained on ~4 million images using 2Γ H100 GPUs for 2 days.* |
|
|
|
|
|
--- |
|
|
|
|
|
## π Benchmark Results |
|
|
|
|
|
| Benchmark | Task | Split | Metric | Viper-L1 (CoT) | |
|
|
|-------------|------|-------|----------|----------------| |
|
|
| RealWorldQA | VQA | Test | Accuracy | **33.73%** | |
|
|
| Other results | VQA | Test | Accuracy | On going | |
|
|
|
|
|
**Notes.** CoT = Chain-of-Thought prompting enabled during inference. Exact settings (temperature/top-p/max tokens) can influence results; see the inference snippet below to replicate typical generation settings. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§© Usage |
|
|
|
|
|
To get started with **inference**, follow the setup in the main repository: |
|
|
|
|
|
π [**Viper-VLM Repository**](https://github.com/huyquoctrinh/Viper-LM) |
|
|
π Example inference script: [`infer_viper.sh`](https://github.com/huyquoctrinh/Viper-LM/blob/feat/viper-vlm_cot/infer_viper.sh) |
|
|
|
|
|
Or you can use these functions for inference |
|
|
|
|
|
```python |
|
|
import os |
|
|
import argparse |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import AutoTokenizer, AutoProcessor |
|
|
from model import ViperLMForCausalLM # your local class |
|
|
IMAGE_TOKEN_ID = 64400 |
|
|
def build_messages(question: str, include_image: bool = True): |
|
|
# Mirror CCDataset._format_prompt() |
|
|
user_content = ("<image> " if include_image else "") + (question or "") |
|
|
return [ |
|
|
{"role": "user", "content": user_content}, |
|
|
# assistant turn is left empty; apply_chat_template(add_generation_prompt=True) will add assistant prefix |
|
|
] |
|
|
|
|
|
@torch.inference_mode() |
|
|
def generate_answer( |
|
|
ckpt_dir: str, |
|
|
tokenizer_path: str, |
|
|
processor_path: str, |
|
|
image_path: str, |
|
|
question: str, |
|
|
device: str = "cuda", |
|
|
dtype: str = "bf16", |
|
|
max_new_tokens: int = 128, |
|
|
temperature: float = 0.2, |
|
|
top_p: float = 0.9, |
|
|
repetition_penalty: float = 1.05, |
|
|
): |
|
|
# --- device / dtype --- |
|
|
device = torch.device(device if torch.cuda.is_available() else "cpu") |
|
|
use_bf16 = (dtype.lower() == "bf16") |
|
|
use_fp16 = (dtype.lower() == "fp16") |
|
|
amp_dtype = torch.bfloat16 if use_bf16 else (torch.float16 if use_fp16 else torch.float32) |
|
|
|
|
|
# --- tokenizer / processor --- |
|
|
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, use_fast=True) |
|
|
if tokenizer.pad_token_id is None: |
|
|
tokenizer.pad_token = tokenizer.eos_token |
|
|
# optional but common for generation with left context |
|
|
if not hasattr(tokenizer, "padding_side") or tokenizer.padding_side != "left": |
|
|
tokenizer.padding_side = "left" |
|
|
|
|
|
processor = AutoProcessor.from_pretrained(processor_path) |
|
|
|
|
|
# --- model --- |
|
|
model = ViperLMForCausalLM.from_pretrained( |
|
|
ckpt_dir, |
|
|
torch_dtype=amp_dtype if device.type == "cuda" else torch.float32, |
|
|
).to(device) |
|
|
model.eval() |
|
|
if getattr(model.config, "pad_token_id", None) is None: |
|
|
model.config.pad_token_id = tokenizer.pad_token_id |
|
|
|
|
|
# expose image token id if your forward expects it; keep it consistent with training |
|
|
image_token_id = getattr(model.config, "image_token_id", None) |
|
|
if image_token_id is None and "<image>" in tokenizer.get_vocab(): |
|
|
image_token_id = tokenizer.convert_tokens_to_ids("<image>") |
|
|
|
|
|
# --- text input with the SAME chat template as training --- |
|
|
messages = build_messages(question=question, include_image=True) |
|
|
enc = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
add_generation_prompt=True, # adds assistant header the model expects before generation |
|
|
tokenize=True, |
|
|
return_tensors="pt", |
|
|
) |
|
|
if isinstance(enc, torch.Tensor): |
|
|
input_ids = enc |
|
|
attention_mask = torch.ones_like(enc, dtype=torch.long) |
|
|
else: |
|
|
input_ids = enc["input_ids"] |
|
|
attention_mask = enc.get("attention_mask") |
|
|
if attention_mask is None: |
|
|
attention_mask = torch.ones_like(input_ids, dtype=torch.long) |
|
|
|
|
|
input_ids = input_ids.to(device) |
|
|
attention_mask = attention_mask.to(device) |
|
|
|
|
|
# --- image preprocessing (match training) --- |
|
|
img = Image.open(image_path).convert("RGB") |
|
|
proc = processor(images=[img], return_tensors="pt") # list, like training |
|
|
pixel_values = proc.get("pixel_values", None) |
|
|
if pixel_values is None: |
|
|
raise ValueError("Processor did not return 'pixel_values'. Check processor_path.") |
|
|
pixel_values = pixel_values.to(device) # (1, 3, H, W) |
|
|
|
|
|
# --- generate --- |
|
|
gen_kwargs = { |
|
|
"max_new_tokens": max_new_tokens, |
|
|
"do_sample": temperature > 0.0, |
|
|
"temperature": max(temperature, 1e-6), |
|
|
"top_p": top_p, |
|
|
"repetition_penalty": repetition_penalty, |
|
|
"eos_token_id": tokenizer.eos_token_id, |
|
|
"pad_token_id": tokenizer.pad_token_id, |
|
|
"image_inputs": pixel_values, |
|
|
# IMPORTANT: use the same argument names your model.forward saw in training # not "image_inputs" |
|
|
"image_token_id": image_token_id, # if your forward uses it |
|
|
"use_cache": False, |
|
|
} |
|
|
|
|
|
if device.type == "cuda" and (use_bf16 or use_fp16): |
|
|
with torch.autocast(device_type="cuda", dtype=amp_dtype): |
|
|
out = model.generate( |
|
|
input_ids=input_ids, |
|
|
attention_mask=attention_mask, |
|
|
**gen_kwargs |
|
|
) |
|
|
else: |
|
|
out = model.generate( |
|
|
input_ids=input_ids, |
|
|
attention_mask=attention_mask, |
|
|
**gen_kwargs |
|
|
) |
|
|
|
|
|
# --- decode only new tokens --- |
|
|
generated = out[0] |
|
|
prompt_len = input_ids.size(1) |
|
|
new_tokens = generated[prompt_len:] |
|
|
answer = tokenizer.decode(new_tokens, skip_special_tokens=True) |
|
|
return answer.strip() |
|
|
|
|
|
if __name__ == "__main__": |
|
|
ckpt_dir = "" |
|
|
tokenizer_path = "" |
|
|
processor_path = "" |
|
|
image_path = "" |
|
|
question = "" |
|
|
device = "" |
|
|
ans = generate_answer( |
|
|
ckpt_dir=ckpt_dir, |
|
|
tokenizer_path=tokenizer_path, |
|
|
processor_path=processor_path, |
|
|
image_path=image_path, |
|
|
question=question, |
|
|
device=device, |
|
|
dtype="bfloat16", |
|
|
max_new_tokens=128, |
|
|
temperature=0.7, |
|
|
top_p=0.8, |
|
|
repetition_penalty=1 |
|
|
) |
|
|
print("\n ======Answer===== \n") |
|
|
print(ans) |
|
|
|
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Acknowledgements |
|
|
|
|
|
We gratefully thank the following foundational projects for inspiring and enabling our research: |
|
|
|
|
|
* [**Liquid Model**](https://huggingface.co/LiquidAI/LFM2-350M) β Base architecture for dynamic neural computation |
|
|
* [**SigLIP**](https://huggingface.co/google/siglip2-base-patch16-naflex) β Vision encoder powering multimodal understanding |
|
|
|
|
|
Their open-source contributions have made **Viper-L1** possible. π |
|
|
|
|
|
--- |
|
|
|
|
|
## π« Contact |
|
|
|
|
|
If youβre interested in collaboration or research discussions: |
|
|
π [**Contact us**](https://github.com/huyquoctrinh) or open an issue in the repository. |
|
|
|
|
|
--- |