Luigi's picture
initial commit
5dfbac5

A newer version of the Gradio SDK is available: 5.36.2

Upgrade
metadata
title: SmolVLM2 On Transformers
emoji: 🐠
colorFrom: blue
colorTo: pink
sdk: gradio
sdk_version: 5.34.0
app_file: app.py
pinned: false
license: mit
short_description: ' SmolVLM2 on transformers'

🎥 Real‑Time Webcam Captioning with SmolVLM2

This project provides a Gradio‑based web app that captures your webcam feed and generates natural‑language captions in real time using the SmolVLM2 image‑to‑text model family. It supports CPU, CUDA GPUs, and Intel XPU accelerators, with configurable capture intervals and custom prompts.

🚀 Features

  • Real‑time captions from your webcam stream
  • Support for CPU, CUDA (NVIDIA GPU), and XPU (Intel accelerator)
  • Adjustable capture interval (default: 3000 ms)
  • Custom system and user prompts for fine‑tuning the caption style
  • Debug logging of preprocessing, tokenization, inference, and decoding times
  • Model caching to avoid repeated reloads on parameter changes

🛠️ Prerequisites

  • Python 3.9+ (tested on 3.12)
  • Git (to clone this repo)
  • Intel oneAPI and PyTorch XPU build (if you plan to use XPU)

Python packages

pip install torch transformers gradio opencv-python pillow

(If you have CUDA‑enabled hardware, install a CUDA‑compatible torch build; if you have XPU hardware, ensure you install the Intel XPU‑patched torch.)

📦 Installation

  1. Clone this repository:

    git clone https://github.com/<your‑org>/smolvml2‑webcam‑caption.git
    cd smolvml2‑webcam‑caption
    
  2. Create and activate a virtual environment:

    python -m venv .venv
    source .venv/bin/activate
    
  3. Install dependencies:

    pip install -r requirements.txt
    

▶️ Usage

  1. Run the app:

    python app.py
    
  2. Open the URL printed by Gradio in your browser (usually http://127.0.0.1:7860).

  3. Select:

    • Model ID: choose between SmolVLM2-256M, SmolVLM2-500M, or SmolVLM2-2.2B.
    • Device: cpu, cuda, or xpu (if available).
  4. Adjust the Interval (ms) slider to set how often frames are captioned.

  5. Edit the System Prompt and User Prompt to control the style and content of captions.

  6. Allow webcam access and watch live captions appear along with debug logs.

⚙️ Configuration Options

Parameter Description Default
Model ID HuggingFace repo for SmolVLM2 model SmolVLM2-256M-Video-Instruct
Device Compute device: cpu, cuda, or xpu cuda if available
Interval (ms) Delay between frame captures (milliseconds) 3000
System Prompt Instruction guiding the caption style Describe the key action
User Prompt Question shown to the model alongside the image What is happening in this image?

🐞 Troubleshooting

  • KeyError in Gradio event handling: ensure demo.queue() is called before demo.launch().
  • XPU BFloat16 errors: XPU currently handles float32 more reliably; the code defaults to float32 on XPU.
  • Model reload slowness: models are cached—changing only prompts or intervals won’t trigger a reload.

📜 License

This project is licensed under the MIT License. See LICENSE for details.


Developed by Luigi Liu (2025)