Luigi commited on
Commit
5dfbac5
·
1 Parent(s): f3fcd77

initial commit

Browse files
Files changed (4) hide show
  1. README.md +86 -1
  2. app.py +143 -0
  3. packages.txt +1 -0
  4. requirements.txt +8 -0
README.md CHANGED
@@ -11,4 +11,89 @@ license: mit
11
  short_description: ' SmolVLM2 on transformers'
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  short_description: ' SmolVLM2 on transformers'
12
  ---
13
 
14
+ # 🎥 Real‑Time Webcam Captioning with SmolVLM2
15
+
16
+ This project provides a Gradio‑based web app that captures your webcam feed and generates natural‑language captions in real time using the SmolVLM2 image‑to‑text model family. It supports CPU, CUDA GPUs, and Intel XPU accelerators, with configurable capture intervals and custom prompts.
17
+
18
+ ## 🚀 Features
19
+
20
+ * **Real‑time captions** from your webcam stream
21
+ * Support for **CPU**, **CUDA** (NVIDIA GPU), and **XPU** (Intel accelerator)
22
+ * Adjustable **capture interval** (default: 3000 ms)
23
+ * Custom **system** and **user** prompts for fine‑tuning the caption style
24
+ * **Debug logging** of preprocessing, tokenization, inference, and decoding times
25
+ * Model caching to avoid repeated reloads on parameter changes
26
+
27
+ ## 🛠️ Prerequisites
28
+
29
+ * **Python** 3.9+ (tested on 3.12)
30
+ * **Git** (to clone this repo)
31
+ * **Intel oneAPI** and **PyTorch XPU** build (if you plan to use XPU)
32
+
33
+ ### Python packages
34
+
35
+ ```
36
+ pip install torch transformers gradio opencv-python pillow
37
+ ```
38
+
39
+ *(If you have CUDA‑enabled hardware, install a CUDA‑compatible torch build; if you have XPU hardware, ensure you install the Intel XPU‑patched torch.)*
40
+
41
+ ## 📦 Installation
42
+
43
+ 1. **Clone** this repository:
44
+
45
+ ```bash
46
+ git clone https://github.com/<your‑org>/smolvml2‑webcam‑caption.git
47
+ cd smolvml2‑webcam‑caption
48
+ ```
49
+ 2. **Create** and **activate** a virtual environment:
50
+
51
+ ```bash
52
+ python -m venv .venv
53
+ source .venv/bin/activate
54
+ ```
55
+ 3. **Install** dependencies:
56
+
57
+ ```bash
58
+ pip install -r requirements.txt
59
+ ```
60
+
61
+ ## ▶️ Usage
62
+
63
+ 1. **Run** the app:
64
+
65
+ ```bash
66
+ python app.py
67
+ ```
68
+ 2. **Open** the URL printed by Gradio in your browser (usually `http://127.0.0.1:7860`).
69
+ 3. **Select**:
70
+
71
+ * **Model ID**: choose between `SmolVLM2-256M`, `SmolVLM2-500M`, or `SmolVLM2-2.2B`.
72
+ * **Device**: `cpu`, `cuda`, or `xpu` (if available).
73
+ 4. **Adjust** the **Interval (ms)** slider to set how often frames are captioned.
74
+ 5. **Edit** the **System Prompt** and **User Prompt** to control the style and content of captions.
75
+ 6. **Allow** webcam access and watch live captions appear along with debug logs.
76
+
77
+ ## ⚙️ Configuration Options
78
+
79
+ | Parameter | Description | Default |
80
+ | ------------- | ----------------------------------------------- | ---------------------------------- |
81
+ | Model ID | HuggingFace repo for SmolVLM2 model | `SmolVLM2-256M-Video-Instruct` |
82
+ | Device | Compute device: `cpu`, `cuda`, or `xpu` | `cuda` if available |
83
+ | Interval (ms) | Delay between frame captures (milliseconds) | `3000` |
84
+ | System Prompt | Instruction guiding the caption style | `Describe the key action` |
85
+ | User Prompt | Question shown to the model alongside the image | `What is happening in this image?` |
86
+
87
+ ## 🐞 Troubleshooting
88
+
89
+ * **KeyError** in Gradio event handling: ensure `demo.queue()` is called before `demo.launch()`.
90
+ * **XPU BFloat16 errors**: XPU currently handles float32 more reliably; the code defaults to float32 on XPU.
91
+ * **Model reload slowness**: models are cached—changing only prompts or intervals won’t trigger a reload.
92
+
93
+ ## 📜 License
94
+
95
+ This project is licensed under the **MIT License**. See [LICENSE](LICENSE) for details.
96
+
97
+ ---
98
+
99
+ *Developed by Luigi Liu (2025)*
app.py ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import time
2
+ import logging
3
+ import gradio as gr
4
+ import cv2
5
+ import os
6
+ from transformers import AutoProcessor, AutoModelForImageTextToText
7
+ import torch
8
+ from PIL import Image
9
+
10
+ # Cache for loaded model and processor
11
+ default_cache = {'model_id': None, 'processor': None, 'model': None, 'device': None}
12
+ model_cache = default_cache.copy()
13
+
14
+ # Check for XPU availability
15
+ has_xpu = hasattr(torch, 'xpu') and torch.xpu.is_available()
16
+
17
+ def update_model(model_id, device):
18
+ if model_cache['model_id'] != model_id or model_cache['device'] != device:
19
+ logging.info(f'Loading model {model_id} on {device}')
20
+ processor = AutoProcessor.from_pretrained(model_id)
21
+ # Load model with appropriate precision for each device
22
+ if device == 'cuda':
23
+ # Use bfloat16 for CUDA for performance
24
+ model = AutoModelForImageTextToText.from_pretrained(
25
+ model_id,
26
+ torch_dtype=torch.bfloat16,
27
+ _attn_implementation='flash_attention_2'
28
+ ).to('cuda')
29
+ elif device == 'xpu' and has_xpu:
30
+ # Use float32 on XPU to avoid bfloat16 layernorm issues
31
+ model = AutoModelForImageTextToText.from_pretrained(
32
+ model_id,
33
+ torch_dtype=torch.float32
34
+ ).to('xpu')
35
+ else:
36
+ # Default to float32 on CPU
37
+ model = AutoModelForImageTextToText.from_pretrained(model_id).to('cpu')
38
+ model.eval()
39
+ model_cache.update({'model_id': model_id, 'processor': processor, 'model': model, 'device': device})
40
+
41
+
42
+ def caption_frame(frame, model_id, interval_ms, sys_prompt, usr_prompt, device):
43
+ debug_msgs = []
44
+ update_model(model_id, device)
45
+ processor = model_cache['processor']
46
+ model = model_cache['model']
47
+
48
+ # Control capture interval
49
+ time.sleep(interval_ms / 1000)
50
+
51
+ # Preprocess frame
52
+ t0 = time.time()
53
+ rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
54
+ pil_img = Image.fromarray(rgb)
55
+ temp_path = 'frame.jpg'
56
+ pil_img.save(temp_path, format='JPEG', quality=50)
57
+ debug_msgs.append(f'Preprocess: {int((time.time()-t0)*1000)} ms')
58
+
59
+ # Prepare multimodal chat messages
60
+ messages = [
61
+ {'role': 'system', 'content': [{'type': 'text', 'text': sys_prompt}]},
62
+ {'role': 'user', 'content': [
63
+ {'type': 'image', 'url': temp_path},
64
+ {'type': 'text', 'text': usr_prompt}
65
+ ]}
66
+ ]
67
+
68
+ # Tokenize and encode
69
+ t1 = time.time()
70
+ inputs = processor.apply_chat_template(
71
+ messages,
72
+ add_generation_prompt=True,
73
+ tokenize=True,
74
+ return_dict=True,
75
+ return_tensors='pt'
76
+ ).to(model.device)
77
+ debug_msgs.append(f'Tokenize: {int((time.time()-t1)*1000)} ms')
78
+
79
+ # Inference
80
+ t2 = time.time()
81
+ outputs = model.generate(**inputs, do_sample=False, max_new_tokens=128)
82
+ debug_msgs.append(f'Inference: {int((time.time()-t2)*1000)} ms')
83
+
84
+ # Decode and strip history
85
+ t3 = time.time()
86
+ raw = processor.batch_decode(outputs, skip_special_tokens=True)[0]
87
+ debug_msgs.append(f'Decode: {int((time.time()-t3)*1000)} ms')
88
+ if "Assistant:" in raw:
89
+ caption = raw.split("Assistant:")[-1].strip()
90
+ else:
91
+ lines = raw.splitlines()
92
+ caption = lines[-1].strip() if len(lines) > 1 else raw.strip()
93
+
94
+ return caption, '\n'.join(debug_msgs)
95
+
96
+
97
+ def main():
98
+ logging.basicConfig(level=logging.INFO)
99
+ model_choices = [
100
+ 'HuggingFaceTB/SmolVLM2-256M-Video-Instruct',
101
+ 'HuggingFaceTB/SmolVLM2-500M-Video-Instruct',
102
+ 'HuggingFaceTB/SmolVLM2-2.2B-Instruct'
103
+ ]
104
+ # Determine available devices
105
+ device_options = ['cpu']
106
+ if torch.cuda.is_available():
107
+ device_options.append('cuda')
108
+ if has_xpu:
109
+ device_options.append('xpu')
110
+
111
+ default_device = 'cuda' if torch.cuda.is_available() else ('xpu' if has_xpu else 'cpu')
112
+
113
+ with gr.Blocks() as demo:
114
+ gr.Markdown('## 🎥 Real-Time Webcam Captioning with SmolVLM2 (Transformers)')
115
+
116
+ with gr.Row():
117
+ model_dd = gr.Dropdown(model_choices, value=model_choices[0], label='Model ID')
118
+ device_dd = gr.Dropdown(device_options, value=default_device, label='Device')
119
+
120
+ interval = gr.Slider(100, 20000, step=100, value=3000, label='Interval (ms)')
121
+ sys_p = gr.Textbox(lines=2, value='Describe the key action', label='System Prompt')
122
+ usr_p = gr.Textbox(lines=1, value='What is happening in this image?', label='User Prompt')
123
+
124
+ cam = gr.Image(sources=['webcam'], streaming=True, label='Webcam Feed')
125
+ caption_tb = gr.Textbox(interactive=False, label='Caption')
126
+ log_tb = gr.Textbox(lines=4, interactive=False, label='Debug Log')
127
+
128
+ cam.stream(
129
+ fn=caption_frame,
130
+ inputs=[cam, model_dd, interval, sys_p, usr_p, device_dd],
131
+ outputs=[caption_tb, log_tb],
132
+ time_limit=600
133
+ )
134
+
135
+ # Enable Gradio's async event queue to register callback IDs and prevent KeyErrors
136
+ demo.queue()
137
+
138
+ # Launch the app
139
+ demo.launch()
140
+
141
+
142
+ if __name__ == '__main__':
143
+ main()
packages.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ python3-opencv
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ gradio>=5.0
2
+ opencv-python
3
+ pillow
4
+ huggingface-hub
5
+ termcolor
6
+ transformers
7
+ torch
8
+ num2words