Spaces:
Running
on
Zero
Running
on
Zero
initial commit
Browse files- README.md +86 -1
- app.py +143 -0
- packages.txt +1 -0
- requirements.txt +8 -0
README.md
CHANGED
@@ -11,4 +11,89 @@ license: mit
|
|
11 |
short_description: ' SmolVLM2 on transformers'
|
12 |
---
|
13 |
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
short_description: ' SmolVLM2 on transformers'
|
12 |
---
|
13 |
|
14 |
+
# 🎥 Real‑Time Webcam Captioning with SmolVLM2
|
15 |
+
|
16 |
+
This project provides a Gradio‑based web app that captures your webcam feed and generates natural‑language captions in real time using the SmolVLM2 image‑to‑text model family. It supports CPU, CUDA GPUs, and Intel XPU accelerators, with configurable capture intervals and custom prompts.
|
17 |
+
|
18 |
+
## 🚀 Features
|
19 |
+
|
20 |
+
* **Real‑time captions** from your webcam stream
|
21 |
+
* Support for **CPU**, **CUDA** (NVIDIA GPU), and **XPU** (Intel accelerator)
|
22 |
+
* Adjustable **capture interval** (default: 3000 ms)
|
23 |
+
* Custom **system** and **user** prompts for fine‑tuning the caption style
|
24 |
+
* **Debug logging** of preprocessing, tokenization, inference, and decoding times
|
25 |
+
* Model caching to avoid repeated reloads on parameter changes
|
26 |
+
|
27 |
+
## 🛠️ Prerequisites
|
28 |
+
|
29 |
+
* **Python** 3.9+ (tested on 3.12)
|
30 |
+
* **Git** (to clone this repo)
|
31 |
+
* **Intel oneAPI** and **PyTorch XPU** build (if you plan to use XPU)
|
32 |
+
|
33 |
+
### Python packages
|
34 |
+
|
35 |
+
```
|
36 |
+
pip install torch transformers gradio opencv-python pillow
|
37 |
+
```
|
38 |
+
|
39 |
+
*(If you have CUDA‑enabled hardware, install a CUDA‑compatible torch build; if you have XPU hardware, ensure you install the Intel XPU‑patched torch.)*
|
40 |
+
|
41 |
+
## 📦 Installation
|
42 |
+
|
43 |
+
1. **Clone** this repository:
|
44 |
+
|
45 |
+
```bash
|
46 |
+
git clone https://github.com/<your‑org>/smolvml2‑webcam‑caption.git
|
47 |
+
cd smolvml2‑webcam‑caption
|
48 |
+
```
|
49 |
+
2. **Create** and **activate** a virtual environment:
|
50 |
+
|
51 |
+
```bash
|
52 |
+
python -m venv .venv
|
53 |
+
source .venv/bin/activate
|
54 |
+
```
|
55 |
+
3. **Install** dependencies:
|
56 |
+
|
57 |
+
```bash
|
58 |
+
pip install -r requirements.txt
|
59 |
+
```
|
60 |
+
|
61 |
+
## ▶️ Usage
|
62 |
+
|
63 |
+
1. **Run** the app:
|
64 |
+
|
65 |
+
```bash
|
66 |
+
python app.py
|
67 |
+
```
|
68 |
+
2. **Open** the URL printed by Gradio in your browser (usually `http://127.0.0.1:7860`).
|
69 |
+
3. **Select**:
|
70 |
+
|
71 |
+
* **Model ID**: choose between `SmolVLM2-256M`, `SmolVLM2-500M`, or `SmolVLM2-2.2B`.
|
72 |
+
* **Device**: `cpu`, `cuda`, or `xpu` (if available).
|
73 |
+
4. **Adjust** the **Interval (ms)** slider to set how often frames are captioned.
|
74 |
+
5. **Edit** the **System Prompt** and **User Prompt** to control the style and content of captions.
|
75 |
+
6. **Allow** webcam access and watch live captions appear along with debug logs.
|
76 |
+
|
77 |
+
## ⚙️ Configuration Options
|
78 |
+
|
79 |
+
| Parameter | Description | Default |
|
80 |
+
| ------------- | ----------------------------------------------- | ---------------------------------- |
|
81 |
+
| Model ID | HuggingFace repo for SmolVLM2 model | `SmolVLM2-256M-Video-Instruct` |
|
82 |
+
| Device | Compute device: `cpu`, `cuda`, or `xpu` | `cuda` if available |
|
83 |
+
| Interval (ms) | Delay between frame captures (milliseconds) | `3000` |
|
84 |
+
| System Prompt | Instruction guiding the caption style | `Describe the key action` |
|
85 |
+
| User Prompt | Question shown to the model alongside the image | `What is happening in this image?` |
|
86 |
+
|
87 |
+
## 🐞 Troubleshooting
|
88 |
+
|
89 |
+
* **KeyError** in Gradio event handling: ensure `demo.queue()` is called before `demo.launch()`.
|
90 |
+
* **XPU BFloat16 errors**: XPU currently handles float32 more reliably; the code defaults to float32 on XPU.
|
91 |
+
* **Model reload slowness**: models are cached—changing only prompts or intervals won’t trigger a reload.
|
92 |
+
|
93 |
+
## 📜 License
|
94 |
+
|
95 |
+
This project is licensed under the **MIT License**. See [LICENSE](LICENSE) for details.
|
96 |
+
|
97 |
+
---
|
98 |
+
|
99 |
+
*Developed by Luigi Liu (2025)*
|
app.py
ADDED
@@ -0,0 +1,143 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import time
|
2 |
+
import logging
|
3 |
+
import gradio as gr
|
4 |
+
import cv2
|
5 |
+
import os
|
6 |
+
from transformers import AutoProcessor, AutoModelForImageTextToText
|
7 |
+
import torch
|
8 |
+
from PIL import Image
|
9 |
+
|
10 |
+
# Cache for loaded model and processor
|
11 |
+
default_cache = {'model_id': None, 'processor': None, 'model': None, 'device': None}
|
12 |
+
model_cache = default_cache.copy()
|
13 |
+
|
14 |
+
# Check for XPU availability
|
15 |
+
has_xpu = hasattr(torch, 'xpu') and torch.xpu.is_available()
|
16 |
+
|
17 |
+
def update_model(model_id, device):
|
18 |
+
if model_cache['model_id'] != model_id or model_cache['device'] != device:
|
19 |
+
logging.info(f'Loading model {model_id} on {device}')
|
20 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
21 |
+
# Load model with appropriate precision for each device
|
22 |
+
if device == 'cuda':
|
23 |
+
# Use bfloat16 for CUDA for performance
|
24 |
+
model = AutoModelForImageTextToText.from_pretrained(
|
25 |
+
model_id,
|
26 |
+
torch_dtype=torch.bfloat16,
|
27 |
+
_attn_implementation='flash_attention_2'
|
28 |
+
).to('cuda')
|
29 |
+
elif device == 'xpu' and has_xpu:
|
30 |
+
# Use float32 on XPU to avoid bfloat16 layernorm issues
|
31 |
+
model = AutoModelForImageTextToText.from_pretrained(
|
32 |
+
model_id,
|
33 |
+
torch_dtype=torch.float32
|
34 |
+
).to('xpu')
|
35 |
+
else:
|
36 |
+
# Default to float32 on CPU
|
37 |
+
model = AutoModelForImageTextToText.from_pretrained(model_id).to('cpu')
|
38 |
+
model.eval()
|
39 |
+
model_cache.update({'model_id': model_id, 'processor': processor, 'model': model, 'device': device})
|
40 |
+
|
41 |
+
|
42 |
+
def caption_frame(frame, model_id, interval_ms, sys_prompt, usr_prompt, device):
|
43 |
+
debug_msgs = []
|
44 |
+
update_model(model_id, device)
|
45 |
+
processor = model_cache['processor']
|
46 |
+
model = model_cache['model']
|
47 |
+
|
48 |
+
# Control capture interval
|
49 |
+
time.sleep(interval_ms / 1000)
|
50 |
+
|
51 |
+
# Preprocess frame
|
52 |
+
t0 = time.time()
|
53 |
+
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
|
54 |
+
pil_img = Image.fromarray(rgb)
|
55 |
+
temp_path = 'frame.jpg'
|
56 |
+
pil_img.save(temp_path, format='JPEG', quality=50)
|
57 |
+
debug_msgs.append(f'Preprocess: {int((time.time()-t0)*1000)} ms')
|
58 |
+
|
59 |
+
# Prepare multimodal chat messages
|
60 |
+
messages = [
|
61 |
+
{'role': 'system', 'content': [{'type': 'text', 'text': sys_prompt}]},
|
62 |
+
{'role': 'user', 'content': [
|
63 |
+
{'type': 'image', 'url': temp_path},
|
64 |
+
{'type': 'text', 'text': usr_prompt}
|
65 |
+
]}
|
66 |
+
]
|
67 |
+
|
68 |
+
# Tokenize and encode
|
69 |
+
t1 = time.time()
|
70 |
+
inputs = processor.apply_chat_template(
|
71 |
+
messages,
|
72 |
+
add_generation_prompt=True,
|
73 |
+
tokenize=True,
|
74 |
+
return_dict=True,
|
75 |
+
return_tensors='pt'
|
76 |
+
).to(model.device)
|
77 |
+
debug_msgs.append(f'Tokenize: {int((time.time()-t1)*1000)} ms')
|
78 |
+
|
79 |
+
# Inference
|
80 |
+
t2 = time.time()
|
81 |
+
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=128)
|
82 |
+
debug_msgs.append(f'Inference: {int((time.time()-t2)*1000)} ms')
|
83 |
+
|
84 |
+
# Decode and strip history
|
85 |
+
t3 = time.time()
|
86 |
+
raw = processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
87 |
+
debug_msgs.append(f'Decode: {int((time.time()-t3)*1000)} ms')
|
88 |
+
if "Assistant:" in raw:
|
89 |
+
caption = raw.split("Assistant:")[-1].strip()
|
90 |
+
else:
|
91 |
+
lines = raw.splitlines()
|
92 |
+
caption = lines[-1].strip() if len(lines) > 1 else raw.strip()
|
93 |
+
|
94 |
+
return caption, '\n'.join(debug_msgs)
|
95 |
+
|
96 |
+
|
97 |
+
def main():
|
98 |
+
logging.basicConfig(level=logging.INFO)
|
99 |
+
model_choices = [
|
100 |
+
'HuggingFaceTB/SmolVLM2-256M-Video-Instruct',
|
101 |
+
'HuggingFaceTB/SmolVLM2-500M-Video-Instruct',
|
102 |
+
'HuggingFaceTB/SmolVLM2-2.2B-Instruct'
|
103 |
+
]
|
104 |
+
# Determine available devices
|
105 |
+
device_options = ['cpu']
|
106 |
+
if torch.cuda.is_available():
|
107 |
+
device_options.append('cuda')
|
108 |
+
if has_xpu:
|
109 |
+
device_options.append('xpu')
|
110 |
+
|
111 |
+
default_device = 'cuda' if torch.cuda.is_available() else ('xpu' if has_xpu else 'cpu')
|
112 |
+
|
113 |
+
with gr.Blocks() as demo:
|
114 |
+
gr.Markdown('## 🎥 Real-Time Webcam Captioning with SmolVLM2 (Transformers)')
|
115 |
+
|
116 |
+
with gr.Row():
|
117 |
+
model_dd = gr.Dropdown(model_choices, value=model_choices[0], label='Model ID')
|
118 |
+
device_dd = gr.Dropdown(device_options, value=default_device, label='Device')
|
119 |
+
|
120 |
+
interval = gr.Slider(100, 20000, step=100, value=3000, label='Interval (ms)')
|
121 |
+
sys_p = gr.Textbox(lines=2, value='Describe the key action', label='System Prompt')
|
122 |
+
usr_p = gr.Textbox(lines=1, value='What is happening in this image?', label='User Prompt')
|
123 |
+
|
124 |
+
cam = gr.Image(sources=['webcam'], streaming=True, label='Webcam Feed')
|
125 |
+
caption_tb = gr.Textbox(interactive=False, label='Caption')
|
126 |
+
log_tb = gr.Textbox(lines=4, interactive=False, label='Debug Log')
|
127 |
+
|
128 |
+
cam.stream(
|
129 |
+
fn=caption_frame,
|
130 |
+
inputs=[cam, model_dd, interval, sys_p, usr_p, device_dd],
|
131 |
+
outputs=[caption_tb, log_tb],
|
132 |
+
time_limit=600
|
133 |
+
)
|
134 |
+
|
135 |
+
# Enable Gradio's async event queue to register callback IDs and prevent KeyErrors
|
136 |
+
demo.queue()
|
137 |
+
|
138 |
+
# Launch the app
|
139 |
+
demo.launch()
|
140 |
+
|
141 |
+
|
142 |
+
if __name__ == '__main__':
|
143 |
+
main()
|
packages.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
python3-opencv
|
requirements.txt
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
gradio>=5.0
|
2 |
+
opencv-python
|
3 |
+
pillow
|
4 |
+
huggingface-hub
|
5 |
+
termcolor
|
6 |
+
transformers
|
7 |
+
torch
|
8 |
+
num2words
|