Spaces:
Running
on
Zero
A newer version of the Gradio SDK is available:
5.36.2
title: SmolVLM2 On Transformers
emoji: 🐠
colorFrom: blue
colorTo: pink
sdk: gradio
sdk_version: 5.34.0
app_file: app.py
pinned: false
license: mit
short_description: ' SmolVLM2 on transformers'
🎥 Real‑Time Webcam Captioning with SmolVLM2
This project provides a Gradio‑based web app that captures your webcam feed and generates natural‑language captions in real time using the SmolVLM2 image‑to‑text model family. It supports CPU, CUDA GPUs, and Intel XPU accelerators, with configurable capture intervals and custom prompts.
🚀 Features
- Real‑time captions from your webcam stream
- Support for CPU, CUDA (NVIDIA GPU), and XPU (Intel accelerator)
- Adjustable capture interval (default: 3000 ms)
- Custom system and user prompts for fine‑tuning the caption style
- Debug logging of preprocessing, tokenization, inference, and decoding times
- Model caching to avoid repeated reloads on parameter changes
🛠️ Prerequisites
- Python 3.9+ (tested on 3.12)
- Git (to clone this repo)
- Intel oneAPI and PyTorch XPU build (if you plan to use XPU)
Python packages
pip install torch transformers gradio opencv-python pillow
(If you have CUDA‑enabled hardware, install a CUDA‑compatible torch build; if you have XPU hardware, ensure you install the Intel XPU‑patched torch.)
📦 Installation
Clone this repository:
git clone https://github.com/<your‑org>/smolvml2‑webcam‑caption.git cd smolvml2‑webcam‑caption
Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate
Install dependencies:
pip install -r requirements.txt
▶️ Usage
Run the app:
python app.py
Open the URL printed by Gradio in your browser (usually
http://127.0.0.1:7860
).Select:
- Model ID: choose between
SmolVLM2-256M
,SmolVLM2-500M
, orSmolVLM2-2.2B
. - Device:
cpu
,cuda
, orxpu
(if available).
- Model ID: choose between
Adjust the Interval (ms) slider to set how often frames are captioned.
Edit the System Prompt and User Prompt to control the style and content of captions.
Allow webcam access and watch live captions appear along with debug logs.
⚙️ Configuration Options
Parameter | Description | Default |
---|---|---|
Model ID | HuggingFace repo for SmolVLM2 model | SmolVLM2-256M-Video-Instruct |
Device | Compute device: cpu , cuda , or xpu |
cuda if available |
Interval (ms) | Delay between frame captures (milliseconds) | 3000 |
System Prompt | Instruction guiding the caption style | Describe the key action |
User Prompt | Question shown to the model alongside the image | What is happening in this image? |
🐞 Troubleshooting
- KeyError in Gradio event handling: ensure
demo.queue()
is called beforedemo.launch()
. - XPU BFloat16 errors: XPU currently handles float32 more reliably; the code defaults to float32 on XPU.
- Model reload slowness: models are cached—changing only prompts or intervals won’t trigger a reload.
📜 License
This project is licensed under the MIT License. See LICENSE for details.
Developed by Luigi Liu (2025)