Spaces:

Luigi
/

SmolVLM2-on-transformers

Running on Zero

File size: 3,626 Bytes

f3fcd77
 
 
 
 
 
 
 
 
 
 
 
 
5dfbac5

---
title: SmolVLM2 On Transformers
emoji: 🐠
colorFrom: blue
colorTo: pink
sdk: gradio
sdk_version: 5.34.0
app_file: app.py
pinned: false
license: mit
short_description: ' SmolVLM2 on transformers'
---

# 🎥 Real‑Time Webcam Captioning with SmolVLM2

This project provides a Gradio‑based web app that captures your webcam feed and generates natural‑language captions in real time using the SmolVLM2 image‑to‑text model family. It supports CPU, CUDA GPUs, and Intel XPU accelerators, with configurable capture intervals and custom prompts.

## 🚀 Features

* **Real‑time captions** from your webcam stream
* Support for **CPU**, **CUDA** (NVIDIA GPU), and **XPU** (Intel accelerator)
* Adjustable **capture interval** (default: 3000 ms)
* Custom **system** and **user** prompts for fine‑tuning the caption style
* **Debug logging** of preprocessing, tokenization, inference, and decoding times
* Model caching to avoid repeated reloads on parameter changes

## 🛠️ Prerequisites

* **Python** 3.9+ (tested on 3.12)
* **Git** (to clone this repo)
* **Intel oneAPI** and **PyTorch XPU** build (if you plan to use XPU)

### Python packages

```
pip install torch transformers gradio opencv-python pillow
```

*(If you have CUDA‑enabled hardware, install a CUDA‑compatible torch build; if you have XPU hardware, ensure you install the Intel XPU‑patched torch.)*

## 📦 Installation

1. **Clone** this repository:

   ```bash
   git clone https://github.com/<your‑org>/smolvml2‑webcam‑caption.git
   cd smolvml2‑webcam‑caption
   ```
2. **Create** and **activate** a virtual environment:

   ```bash
   python -m venv .venv
   source .venv/bin/activate
   ```
3. **Install** dependencies:

   ```bash
   pip install -r requirements.txt
   ```

## ▶️ Usage

1. **Run** the app:

   ```bash
   python app.py
   ```
2. **Open** the URL printed by Gradio in your browser (usually `http://127.0.0.1:7860`).
3. **Select**:

   * **Model ID**: choose between `SmolVLM2-256M`, `SmolVLM2-500M`, or `SmolVLM2-2.2B`.
   * **Device**: `cpu`, `cuda`, or `xpu` (if available).
4. **Adjust** the **Interval (ms)** slider to set how often frames are captioned.
5. **Edit** the **System Prompt** and **User Prompt** to control the style and content of captions.
6. **Allow** webcam access and watch live captions appear along with debug logs.

## ⚙️ Configuration Options

| Parameter     | Description                                     | Default                            |
| ------------- | ----------------------------------------------- | ---------------------------------- |
| Model ID      | HuggingFace repo for SmolVLM2 model             | `SmolVLM2-256M-Video-Instruct`     |
| Device        | Compute device: `cpu`, `cuda`, or `xpu`         | `cuda` if available                |
| Interval (ms) | Delay between frame captures (milliseconds)     | `3000`                             |
| System Prompt | Instruction guiding the caption style           | `Describe the key action`          |
| User Prompt   | Question shown to the model alongside the image | `What is happening in this image?` |

## 🐞 Troubleshooting

* **KeyError** in Gradio event handling: ensure `demo.queue()` is called before `demo.launch()`.
* **XPU BFloat16 errors**: XPU currently handles float32 more reliably; the code defaults to float32 on XPU.
* **Model reload slowness**: models are cached—changing only prompts or intervals won’t trigger a reload.

## 📜 License

This project is licensed under the **MIT License**. See [LICENSE](LICENSE) for details.

---

*Developed by Luigi Liu (2025)*