metadata

title: SmolVLM2 On Transformers
emoji: 🐠
colorFrom: blue
colorTo: pink
sdk: gradio
sdk_version: 5.34.0
app_file: app.py
pinned: false
license: mit
short_description: ' SmolVLM2 on transformers'

🎥 Real‑Time Webcam Captioning with SmolVLM2

This project provides a Gradio‑based web app that captures your webcam feed and generates natural‑language captions in real time using the SmolVLM2 image‑to‑text model family. It supports CPU, CUDA GPUs, and Intel XPU accelerators, with configurable capture intervals and custom prompts.

🚀 Features

Real‑time captions from your webcam stream
Support for CPU, CUDA (NVIDIA GPU), and XPU (Intel accelerator)
Adjustable capture interval (default: 3000 ms)
Custom system and user prompts for fine‑tuning the caption style
Debug logging of preprocessing, tokenization, inference, and decoding times
Model caching to avoid repeated reloads on parameter changes

🛠️ Prerequisites

Python 3.9+ (tested on 3.12)
Git (to clone this repo)
Intel oneAPI and PyTorch XPU build (if you plan to use XPU)

Python packages

pip install torch transformers gradio opencv-python pillow

(If you have CUDA‑enabled hardware, install a CUDA‑compatible torch build; if you have XPU hardware, ensure you install the Intel XPU‑patched torch.)

📦 Installation

Clone this repository:

git clone https://github.com/<your‑org>/smolvml2‑webcam‑caption.git
cd smolvml2‑webcam‑caption

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

▶️ Usage

Run the app:
```
python app.py
```
Open the URL printed by Gradio in your browser (usually http://127.0.0.1:7860).
Select:
- Model ID: choose between SmolVLM2-256M, SmolVLM2-500M, or SmolVLM2-2.2B.
- Device: cpu, cuda, or xpu (if available).
Adjust the Interval (ms) slider to set how often frames are captioned.
Edit the System Prompt and User Prompt to control the style and content of captions.
Allow webcam access and watch live captions appear along with debug logs.

⚙️ Configuration Options

Parameter	Description	Default
Model ID	HuggingFace repo for SmolVLM2 model	`SmolVLM2-256M-Video-Instruct`
Device	Compute device: `cpu`, `cuda`, or `xpu`	`cuda` if available
Interval (ms)	Delay between frame captures (milliseconds)	`3000`
System Prompt	Instruction guiding the caption style	`Describe the key action`
User Prompt	Question shown to the model alongside the image	`What is happening in this image?`

🐞 Troubleshooting

KeyError in Gradio event handling: ensure demo.queue() is called before demo.launch().
XPU BFloat16 errors: XPU currently handles float32 more reliably; the code defaults to float32 on XPU.
Model reload slowness: models are cached—changing only prompts or intervals won’t trigger a reload.

📜 License

This project is licensed under the MIT License. See LICENSE for details.

Developed by Luigi Liu (2025)