Sam Heutmaker
commited on
Commit
·
8b066f8
1
Parent(s):
b413f49
update readme
Browse files
README.md
CHANGED
@@ -1,36 +1,36 @@
|
|
1 |
---
|
2 |
language:
|
3 |
-
- en
|
4 |
license: apache-2.0
|
5 |
tags:
|
6 |
-
- VLM
|
7 |
-
- video-understanding
|
8 |
-
- image-captioning
|
9 |
-
- gemma
|
10 |
-
- json-mode
|
11 |
-
- structured-output
|
12 |
-
- video-analysis
|
13 |
base_model: google/gemma-12b
|
14 |
pipeline_tag: image-text-to-text
|
15 |
model-index:
|
16 |
-
- name: ClipTagger-12b
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
---
|
35 |
|
36 |
# ClipTagger-12b
|
@@ -41,6 +41,8 @@ model-index:
|
|
41 |
|
42 |
**ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
|
43 |
|
|
|
|
|
44 |
The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
|
45 |
|
46 |
### Key Features
|
@@ -57,6 +59,7 @@ The model generates structured, schema-consistent JSON outputs for every video f
|
|
57 |
ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
|
58 |
|
59 |
### Technical Specifications
|
|
|
60 |
- **Parameters**: 12 billion
|
61 |
- **Base Architecture**: Gemma-12B
|
62 |
- **Quantization**: FP8 (no quality loss vs bf16)
|
@@ -70,6 +73,7 @@ ClipTagger-12b is based on the Gemma-12B architecture and has been optimized wit
|
|
70 |
The model was trained on 1 million carefully curated single-frame samples from publicly available video data. Training employed knowledge distillation from a high-quality teacher model to ensure consistent, accurate outputs while maintaining the ability to generalize across diverse video content types.
|
71 |
|
72 |
### Training Process
|
|
|
73 |
- **Dataset Size**: 1M video frames
|
74 |
- **Training Method**: Teacher-student distillation
|
75 |
- **Data Source**: Publicly available video content
|
@@ -80,11 +84,9 @@ The model was trained on 1 million carefully curated single-frame samples from p
|
|
80 |
Performance metrics on our internal evaluation set:
|
81 |
| Model | Avg Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
|
82 |
|-------|-----------------|---------|---------|---------|------|
|
83 |
-
|
|
84 |
-
|
|
85 |
-
|
|
86 |
-
|
87 |
-
### Benchmark Visualizations
|
88 |
|
89 |
<table>
|
90 |
<tr>
|
@@ -113,26 +115,26 @@ ClipTagger-12b offers **15x cost savings** compared to GPT-4.1 and **17x cost sa
|
|
113 |
| GPT-4.1 | $3.00 | $12.00 | $5,100 | $0.0051 |
|
114 |
| Claude 4 Sonnet | $3.00 | $15.00 | $5,850 | $0.00585 |
|
115 |
|
116 |
-
*Cost calculations based on an average of 700 input tokens and 250 output tokens per generation.
|
117 |
-
|
118 |
## Usage
|
119 |
|
120 |
### API Access
|
121 |
|
122 |
For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
|
123 |
|
124 |
-
**[Run ClipTagger-12b via Inference.net API →](https://
|
125 |
|
126 |
### Required Prompts
|
127 |
|
128 |
The model requires specific system and user prompts for optimal performance. Use these prompts exactly as shown:
|
129 |
|
130 |
#### System Prompt
|
|
|
131 |
```
|
132 |
You are an image annotation API trained to analyze YouTube video keyframes. You will be given instructions on the output format, what to caption, and how to perform your job. Follow those instructions. For descriptions and summaries, provide them directly and do not lead them with 'This image shows' or 'This keyframe displays...', just get right into the details.
|
133 |
```
|
134 |
|
135 |
#### User Prompt
|
|
|
136 |
```
|
137 |
You are an image annotation API trained to analyze YouTube video keyframes. You must respond with a valid JSON object matching the exact structure below.
|
138 |
|
@@ -212,37 +214,21 @@ Given a nature scene with a wooden boardwalk through grassland:
|
|
212 |
```
|
213 |
|
214 |
## Use Cases
|
215 |
-
|
216 |
- **Video Search & Discovery** - Build searchable databases with structured metadata
|
217 |
- **Content Moderation** - Automated content analysis and categorization
|
218 |
- **Accessibility** - Generate consistent alt-text and scene descriptions
|
219 |
- **Ad Verification** - Track product visibility and brand appearances
|
220 |
- **Video Analytics** - Extract insights from large video collections
|
221 |
- **Content Management** - Automatic tagging and organization of video libraries
|
222 |
-
|
223 |
-
## Limitations
|
224 |
-
|
225 |
-
- Processes one video frame per request
|
226 |
-
- English-only descriptions (can identify text in other languages)
|
227 |
-
- Maximum image size: 1MB
|
228 |
-
- Requires specific prompts for optimal performance
|
229 |
-
- Not supported on A100 GPUs (no native FP8)
|
230 |
-
|
231 |
-
## Best Practices
|
232 |
-
|
233 |
-
1. **Use exact prompts** - The provided system and user prompts are optimized for best results
|
234 |
-
2. **Set low temperature** - Use temperature=0.1 for consistent outputs
|
235 |
-
3. **Enable JSON mode** - Always set response_format to ensure valid JSON
|
236 |
-
4. **Process systematically** - Maintain temporal order when processing video sequences
|
237 |
-
5. **Batch similar content** - Group frames from the same video for efficiency
|
238 |
|
239 |
## Support
|
240 |
-
|
241 |
-
- **
|
242 |
-
- **API Access**: [inference.net/use-cases/video-understanding](https://localhost:3000/use-cases/video-understanding)
|
243 |
- **Email**: [email protected]
|
244 |
-
- **Enterprise**: [Schedule a consultation](https://inference.net/sales)
|
245 |
|
246 |
-
##
|
|
|
247 |
|
248 |
-
|
|
|
|
1 |
---
|
2 |
language:
|
3 |
+
- en
|
4 |
license: apache-2.0
|
5 |
tags:
|
6 |
+
- VLM
|
7 |
+
- video-understanding
|
8 |
+
- image-captioning
|
9 |
+
- gemma
|
10 |
+
- json-mode
|
11 |
+
- structured-output
|
12 |
+
- video-analysis
|
13 |
base_model: google/gemma-12b
|
14 |
pipeline_tag: image-text-to-text
|
15 |
model-index:
|
16 |
+
- name: ClipTagger-12b
|
17 |
+
results:
|
18 |
+
- task:
|
19 |
+
type: image-to-text
|
20 |
+
name: Video Frame Captioning
|
21 |
+
metrics:
|
22 |
+
- name: Average Judge Score
|
23 |
+
type: quality
|
24 |
+
value: 3.53
|
25 |
+
- name: ROUGE-1
|
26 |
+
type: rouge-1
|
27 |
+
value: 0.674
|
28 |
+
- name: ROUGE-L
|
29 |
+
type: rouge-l
|
30 |
+
value: 0.520
|
31 |
+
- name: BLEU
|
32 |
+
type: bleu
|
33 |
+
value: 0.267
|
34 |
---
|
35 |
|
36 |
# ClipTagger-12b
|
|
|
41 |
|
42 |
**ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
|
43 |
|
44 |
+
**ClipTagger outperforms GPT-4.1 and Claude 4 Sonnet, while costing 15x less per generation.**
|
45 |
+
|
46 |
The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
|
47 |
|
48 |
### Key Features
|
|
|
59 |
ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
|
60 |
|
61 |
### Technical Specifications
|
62 |
+
|
63 |
- **Parameters**: 12 billion
|
64 |
- **Base Architecture**: Gemma-12B
|
65 |
- **Quantization**: FP8 (no quality loss vs bf16)
|
|
|
73 |
The model was trained on 1 million carefully curated single-frame samples from publicly available video data. Training employed knowledge distillation from a high-quality teacher model to ensure consistent, accurate outputs while maintaining the ability to generalize across diverse video content types.
|
74 |
|
75 |
### Training Process
|
76 |
+
|
77 |
- **Dataset Size**: 1M video frames
|
78 |
- **Training Method**: Teacher-student distillation
|
79 |
- **Data Source**: Publicly available video content
|
|
|
84 |
Performance metrics on our internal evaluation set:
|
85 |
| Model | Avg Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
|
86 |
|-------|-----------------|---------|---------|---------|------|
|
87 |
+
| ClipTagger-12b | **3.53** | **0.674** | **0.404** | **0.520** | **0.267** |
|
88 |
+
| Claude 4 Sonnet | 3.16 | 0.463 | 0.179 | 0.281 | 0.060 |
|
89 |
+
| GPT-4.1 | 3.64 | 0.581 | 0.260 | 0.376 | 0.119 |
|
|
|
|
|
90 |
|
91 |
<table>
|
92 |
<tr>
|
|
|
115 |
| GPT-4.1 | $3.00 | $12.00 | $5,100 | $0.0051 |
|
116 |
| Claude 4 Sonnet | $3.00 | $15.00 | $5,850 | $0.00585 |
|
117 |
|
|
|
|
|
118 |
## Usage
|
119 |
|
120 |
### API Access
|
121 |
|
122 |
For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
|
123 |
|
124 |
+
**[Run ClipTagger-12b via Inference.net API →](https://inference.net/use-cases/video-understanding)**
|
125 |
|
126 |
### Required Prompts
|
127 |
|
128 |
The model requires specific system and user prompts for optimal performance. Use these prompts exactly as shown:
|
129 |
|
130 |
#### System Prompt
|
131 |
+
|
132 |
```
|
133 |
You are an image annotation API trained to analyze YouTube video keyframes. You will be given instructions on the output format, what to caption, and how to perform your job. Follow those instructions. For descriptions and summaries, provide them directly and do not lead them with 'This image shows' or 'This keyframe displays...', just get right into the details.
|
134 |
```
|
135 |
|
136 |
#### User Prompt
|
137 |
+
|
138 |
```
|
139 |
You are an image annotation API trained to analyze YouTube video keyframes. You must respond with a valid JSON object matching the exact structure below.
|
140 |
|
|
|
214 |
```
|
215 |
|
216 |
## Use Cases
|
|
|
217 |
- **Video Search & Discovery** - Build searchable databases with structured metadata
|
218 |
- **Content Moderation** - Automated content analysis and categorization
|
219 |
- **Accessibility** - Generate consistent alt-text and scene descriptions
|
220 |
- **Ad Verification** - Track product visibility and brand appearances
|
221 |
- **Video Analytics** - Extract insights from large video collections
|
222 |
- **Content Management** - Automatic tagging and organization of video libraries
|
223 |
+
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
224 |
|
225 |
## Support
|
226 |
+
- **Documentation**: [docs.inference.net](https://inference.net/use-cases/video-understanding)
|
227 |
+
- **API Access**: [inference.net/use-cases/video-understanding](https://inference.net/use-cases/video-understanding)
|
|
|
228 |
- **Email**: [email protected]
|
|
|
229 |
|
230 |
+
## Interested in training your own model?
|
231 |
+
Contact us at [[email protected]](mailto:[email protected]) for a free consultation with our research team.
|
232 |
|
233 |
+
## License
|
234 |
+
This model is released under the Apache-2.0 license, allowing for commercial use and modification with proper attribution.
|