Sam Heutmaker
commited on
Commit
·
43a643a
1
Parent(s):
81d530d
update readme
Browse files
README.md
CHANGED
@@ -1,36 +1,36 @@
|
|
1 |
---
|
2 |
language:
|
3 |
-
|
4 |
license: apache-2.0
|
5 |
tags:
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
base_model: google/gemma-12b
|
14 |
pipeline_tag: image-text-to-text
|
15 |
model-index:
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
---
|
35 |
|
36 |
# ClipTagger-12b
|
@@ -39,9 +39,7 @@ model-index:
|
|
39 |
|
40 |
## Model Description
|
41 |
|
42 |
-
**ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads
|
43 |
-
|
44 |
-
**ClipTagger-12b exceeds or matches the performance of GPT-4.1 and Claude 4 Sonnet, while costing 15x less per generation.**
|
45 |
|
46 |
The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
|
47 |
|
@@ -59,7 +57,6 @@ The model generates structured, schema-consistent JSON outputs for every video f
|
|
59 |
ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
|
60 |
|
61 |
### Technical Specifications
|
62 |
-
|
63 |
- **Parameters**: 12 billion
|
64 |
- **Base Architecture**: Gemma-12B
|
65 |
- **Quantization**: FP8 (no quality loss vs bf16)
|
@@ -73,7 +70,6 @@ ClipTagger-12b is based on the Gemma-12B architecture and has been optimized wit
|
|
73 |
The model was trained on 1 million carefully curated single-frame samples from publicly available video data. Training employed knowledge distillation from a high-quality teacher model to ensure consistent, accurate outputs while maintaining the ability to generalize across diverse video content types.
|
74 |
|
75 |
### Training Process
|
76 |
-
|
77 |
- **Dataset Size**: 1M video frames
|
78 |
- **Training Method**: Teacher-student distillation
|
79 |
- **Data Source**: Publicly available video content
|
@@ -81,12 +77,16 @@ The model was trained on 1 million carefully curated single-frame samples from p
|
|
81 |
|
82 |
## Benchmarks
|
83 |
|
|
|
|
|
84 |
Performance metrics on our internal evaluation set:
|
85 |
| Model | Avg Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
|
86 |
|-------|-----------------|---------|---------|---------|------|
|
87 |
-
|
|
88 |
-
|
|
89 |
-
|
|
|
|
|
|
90 |
|
91 |
<table>
|
92 |
<tr>
|
@@ -115,26 +115,26 @@ ClipTagger-12b offers **15x cost savings** compared to GPT-4.1 and **17x cost sa
|
|
115 |
| GPT-4.1 | $3.00 | $12.00 | $5,100 | $0.0051 |
|
116 |
| Claude 4 Sonnet | $3.00 | $15.00 | $5,850 | $0.00585 |
|
117 |
|
|
|
|
|
118 |
## Usage
|
119 |
|
120 |
### API Access
|
121 |
|
122 |
For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
|
123 |
|
124 |
-
**[Run ClipTagger-12b via Inference.net API →](https://
|
125 |
|
126 |
### Required Prompts
|
127 |
|
128 |
The model requires specific system and user prompts for optimal performance. Use these prompts exactly as shown:
|
129 |
|
130 |
#### System Prompt
|
131 |
-
|
132 |
```
|
133 |
You are an image annotation API trained to analyze YouTube video keyframes. You will be given instructions on the output format, what to caption, and how to perform your job. Follow those instructions. For descriptions and summaries, provide them directly and do not lead them with 'This image shows' or 'This keyframe displays...', just get right into the details.
|
134 |
```
|
135 |
|
136 |
#### User Prompt
|
137 |
-
|
138 |
```
|
139 |
You are an image annotation API trained to analyze YouTube video keyframes. You must respond with a valid JSON object matching the exact structure below.
|
140 |
|
@@ -214,21 +214,37 @@ Given a nature scene with a wooden boardwalk through grassland:
|
|
214 |
```
|
215 |
|
216 |
## Use Cases
|
|
|
217 |
- **Video Search & Discovery** - Build searchable databases with structured metadata
|
218 |
- **Content Moderation** - Automated content analysis and categorization
|
219 |
- **Accessibility** - Generate consistent alt-text and scene descriptions
|
220 |
- **Ad Verification** - Track product visibility and brand appearances
|
221 |
- **Video Analytics** - Extract insights from large video collections
|
222 |
- **Content Management** - Automatic tagging and organization of video libraries
|
223 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
224 |
|
225 |
## Support
|
226 |
-
- **Documentation**: [docs.inference.net](https://inference.net/use-cases/video-understanding)
|
227 |
-
- **API Access**: [inference.net/use-cases/video-understanding](https://inference.net/use-cases/video-understanding)
|
228 |
-
- **Email**: [email protected]
|
229 |
|
230 |
-
|
231 |
-
|
|
|
|
|
232 |
|
233 |
## License
|
234 |
-
|
|
|
|
1 |
---
|
2 |
language:
|
3 |
+
- en
|
4 |
license: apache-2.0
|
5 |
tags:
|
6 |
+
- VLM
|
7 |
+
- video-understanding
|
8 |
+
- image-captioning
|
9 |
+
- gemma
|
10 |
+
- json-mode
|
11 |
+
- structured-output
|
12 |
+
- video-analysis
|
13 |
base_model: google/gemma-12b
|
14 |
pipeline_tag: image-text-to-text
|
15 |
model-index:
|
16 |
+
- name: ClipTagger-12b
|
17 |
+
results:
|
18 |
+
- task:
|
19 |
+
type: image-to-text
|
20 |
+
name: Video Frame Captioning
|
21 |
+
metrics:
|
22 |
+
- name: Average Judge Score
|
23 |
+
type: quality
|
24 |
+
value: 3.53
|
25 |
+
- name: ROUGE-1
|
26 |
+
type: rouge-1
|
27 |
+
value: 0.674
|
28 |
+
- name: ROUGE-L
|
29 |
+
type: rouge-l
|
30 |
+
value: 0.520
|
31 |
+
- name: BLEU
|
32 |
+
type: bleu
|
33 |
+
value: 0.267
|
34 |
---
|
35 |
|
36 |
# ClipTagger-12b
|
|
|
39 |
|
40 |
## Model Description
|
41 |
|
42 |
+
**ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
|
|
|
|
|
43 |
|
44 |
The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
|
45 |
|
|
|
57 |
ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
|
58 |
|
59 |
### Technical Specifications
|
|
|
60 |
- **Parameters**: 12 billion
|
61 |
- **Base Architecture**: Gemma-12B
|
62 |
- **Quantization**: FP8 (no quality loss vs bf16)
|
|
|
70 |
The model was trained on 1 million carefully curated single-frame samples from publicly available video data. Training employed knowledge distillation from a high-quality teacher model to ensure consistent, accurate outputs while maintaining the ability to generalize across diverse video content types.
|
71 |
|
72 |
### Training Process
|
|
|
73 |
- **Dataset Size**: 1M video frames
|
74 |
- **Training Method**: Teacher-student distillation
|
75 |
- **Data Source**: Publicly available video content
|
|
|
77 |
|
78 |
## Benchmarks
|
79 |
|
80 |
+
ClipTagger-12b achieves **equal or superior performance** compared to the leading closed-source models across all major evaluation metrics. Despite being open-source and significantly more cost-effective, our model **outperforms Claude 4 Sonnet across every metric** and achieves **comparable quality to GPT-4.1**.
|
81 |
+
|
82 |
Performance metrics on our internal evaluation set:
|
83 |
| Model | Avg Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
|
84 |
|-------|-----------------|---------|---------|---------|------|
|
85 |
+
| cliptagger_12b | **3.53** | **0.674** | **0.404** | **0.520** | **0.267** |
|
86 |
+
| claude_4_sonnet | 3.16 | 0.463 | 0.179 | 0.281 | 0.060 |
|
87 |
+
| gpt_4.1 | 3.64 | 0.581 | 0.260 | 0.376 | 0.119 |
|
88 |
+
|
89 |
+
### Benchmark Visualizations
|
90 |
|
91 |
<table>
|
92 |
<tr>
|
|
|
115 |
| GPT-4.1 | $3.00 | $12.00 | $5,100 | $0.0051 |
|
116 |
| Claude 4 Sonnet | $3.00 | $15.00 | $5,850 | $0.00585 |
|
117 |
|
118 |
+
*Cost calculations based on an average of 700 input tokens and 250 output tokens per generation.
|
119 |
+
|
120 |
## Usage
|
121 |
|
122 |
### API Access
|
123 |
|
124 |
For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
|
125 |
|
126 |
+
**[Run ClipTagger-12b via Inference.net API →](https://localhost:3000/use-cases/video-understanding)**
|
127 |
|
128 |
### Required Prompts
|
129 |
|
130 |
The model requires specific system and user prompts for optimal performance. Use these prompts exactly as shown:
|
131 |
|
132 |
#### System Prompt
|
|
|
133 |
```
|
134 |
You are an image annotation API trained to analyze YouTube video keyframes. You will be given instructions on the output format, what to caption, and how to perform your job. Follow those instructions. For descriptions and summaries, provide them directly and do not lead them with 'This image shows' or 'This keyframe displays...', just get right into the details.
|
135 |
```
|
136 |
|
137 |
#### User Prompt
|
|
|
138 |
```
|
139 |
You are an image annotation API trained to analyze YouTube video keyframes. You must respond with a valid JSON object matching the exact structure below.
|
140 |
|
|
|
214 |
```
|
215 |
|
216 |
## Use Cases
|
217 |
+
|
218 |
- **Video Search & Discovery** - Build searchable databases with structured metadata
|
219 |
- **Content Moderation** - Automated content analysis and categorization
|
220 |
- **Accessibility** - Generate consistent alt-text and scene descriptions
|
221 |
- **Ad Verification** - Track product visibility and brand appearances
|
222 |
- **Video Analytics** - Extract insights from large video collections
|
223 |
- **Content Management** - Automatic tagging and organization of video libraries
|
224 |
+
|
225 |
+
## Limitations
|
226 |
+
|
227 |
+
- Processes one video frame per request
|
228 |
+
- English-only descriptions (can identify text in other languages)
|
229 |
+
- Maximum image size: 1MB
|
230 |
+
- Requires specific prompts for optimal performance
|
231 |
+
- Not supported on A100 GPUs (no native FP8)
|
232 |
+
|
233 |
+
## Best Practices
|
234 |
+
|
235 |
+
1. **Use exact prompts** - The provided system and user prompts are optimized for best results
|
236 |
+
2. **Set low temperature** - Use temperature=0.1 for consistent outputs
|
237 |
+
3. **Enable JSON mode** - Always set response_format to ensure valid JSON
|
238 |
+
4. **Process systematically** - Maintain temporal order when processing video sequences
|
239 |
+
5. **Batch similar content** - Group frames from the same video for efficiency
|
240 |
|
241 |
## Support
|
|
|
|
|
|
|
242 |
|
243 |
+
- **Documentation**: [docs.inference.net](https://docs.inference.net)
|
244 |
+
- **API Access**: [inference.net/use-cases/video-understanding](https://localhost:3000/use-cases/video-understanding)
|
245 |
+
- **Email**: [email protected]
|
246 |
+
- **Enterprise**: [Schedule a consultation](https://inference.net/sales)
|
247 |
|
248 |
## License
|
249 |
+
|
250 |
+
This model is released under the Apache-2.0 license, allowing for commercial use and modification with proper attribution.
|