Sam Heutmaker commited on
Commit
8b066f8
·
1 Parent(s): b413f49

update readme

Browse files
Files changed (1) hide show
  1. README.md +43 -57
README.md CHANGED
@@ -1,36 +1,36 @@
1
  ---
2
  language:
3
- - en
4
  license: apache-2.0
5
  tags:
6
- - VLM
7
- - video-understanding
8
- - image-captioning
9
- - gemma
10
- - json-mode
11
- - structured-output
12
- - video-analysis
13
  base_model: google/gemma-12b
14
  pipeline_tag: image-text-to-text
15
  model-index:
16
- - name: ClipTagger-12b
17
- results:
18
- - task:
19
- type: image-to-text
20
- name: Video Frame Captioning
21
- metrics:
22
- - name: Average Judge Score
23
- type: quality
24
- value: 3.53
25
- - name: ROUGE-1
26
- type: rouge-1
27
- value: 0.674
28
- - name: ROUGE-L
29
- type: rouge-l
30
- value: 0.520
31
- - name: BLEU
32
- type: bleu
33
- value: 0.267
34
  ---
35
 
36
  # ClipTagger-12b
@@ -41,6 +41,8 @@ model-index:
41
 
42
  **ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
43
 
 
 
44
  The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
45
 
46
  ### Key Features
@@ -57,6 +59,7 @@ The model generates structured, schema-consistent JSON outputs for every video f
57
  ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
58
 
59
  ### Technical Specifications
 
60
  - **Parameters**: 12 billion
61
  - **Base Architecture**: Gemma-12B
62
  - **Quantization**: FP8 (no quality loss vs bf16)
@@ -70,6 +73,7 @@ ClipTagger-12b is based on the Gemma-12B architecture and has been optimized wit
70
  The model was trained on 1 million carefully curated single-frame samples from publicly available video data. Training employed knowledge distillation from a high-quality teacher model to ensure consistent, accurate outputs while maintaining the ability to generalize across diverse video content types.
71
 
72
  ### Training Process
 
73
  - **Dataset Size**: 1M video frames
74
  - **Training Method**: Teacher-student distillation
75
  - **Data Source**: Publicly available video content
@@ -80,11 +84,9 @@ The model was trained on 1 million carefully curated single-frame samples from p
80
  Performance metrics on our internal evaluation set:
81
  | Model | Avg Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
82
  |-------|-----------------|---------|---------|---------|------|
83
- | cliptagger_12b | **3.53** | **0.674** | **0.404** | **0.520** | **0.267** |
84
- | claude_4_sonnet | 3.16 | 0.463 | 0.179 | 0.281 | 0.060 |
85
- | gpt_4.1 | 3.64 | 0.581 | 0.260 | 0.376 | 0.119 |
86
-
87
- ### Benchmark Visualizations
88
 
89
  <table>
90
  <tr>
@@ -113,26 +115,26 @@ ClipTagger-12b offers **15x cost savings** compared to GPT-4.1 and **17x cost sa
113
  | GPT-4.1 | $3.00 | $12.00 | $5,100 | $0.0051 |
114
  | Claude 4 Sonnet | $3.00 | $15.00 | $5,850 | $0.00585 |
115
 
116
- *Cost calculations based on an average of 700 input tokens and 250 output tokens per generation.
117
-
118
  ## Usage
119
 
120
  ### API Access
121
 
122
  For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
123
 
124
- **[Run ClipTagger-12b via Inference.net API →](https://localhost:3000/use-cases/video-understanding)**
125
 
126
  ### Required Prompts
127
 
128
  The model requires specific system and user prompts for optimal performance. Use these prompts exactly as shown:
129
 
130
  #### System Prompt
 
131
  ```
132
  You are an image annotation API trained to analyze YouTube video keyframes. You will be given instructions on the output format, what to caption, and how to perform your job. Follow those instructions. For descriptions and summaries, provide them directly and do not lead them with 'This image shows' or 'This keyframe displays...', just get right into the details.
133
  ```
134
 
135
  #### User Prompt
 
136
  ```
137
  You are an image annotation API trained to analyze YouTube video keyframes. You must respond with a valid JSON object matching the exact structure below.
138
 
@@ -212,37 +214,21 @@ Given a nature scene with a wooden boardwalk through grassland:
212
  ```
213
 
214
  ## Use Cases
215
-
216
  - **Video Search & Discovery** - Build searchable databases with structured metadata
217
  - **Content Moderation** - Automated content analysis and categorization
218
  - **Accessibility** - Generate consistent alt-text and scene descriptions
219
  - **Ad Verification** - Track product visibility and brand appearances
220
  - **Video Analytics** - Extract insights from large video collections
221
  - **Content Management** - Automatic tagging and organization of video libraries
222
-
223
- ## Limitations
224
-
225
- - Processes one video frame per request
226
- - English-only descriptions (can identify text in other languages)
227
- - Maximum image size: 1MB
228
- - Requires specific prompts for optimal performance
229
- - Not supported on A100 GPUs (no native FP8)
230
-
231
- ## Best Practices
232
-
233
- 1. **Use exact prompts** - The provided system and user prompts are optimized for best results
234
- 2. **Set low temperature** - Use temperature=0.1 for consistent outputs
235
- 3. **Enable JSON mode** - Always set response_format to ensure valid JSON
236
- 4. **Process systematically** - Maintain temporal order when processing video sequences
237
- 5. **Batch similar content** - Group frames from the same video for efficiency
238
 
239
  ## Support
240
-
241
- - **Documentation**: [docs.inference.net](https://docs.inference.net)
242
- - **API Access**: [inference.net/use-cases/video-understanding](https://localhost:3000/use-cases/video-understanding)
243
  - **Email**: [email protected]
244
- - **Enterprise**: [Schedule a consultation](https://inference.net/sales)
245
 
246
- ## License
 
247
 
248
- This model is released under the Apache-2.0 license, allowing for commercial use and modification with proper attribution.
 
 
1
  ---
2
  language:
3
+ - en
4
  license: apache-2.0
5
  tags:
6
+ - VLM
7
+ - video-understanding
8
+ - image-captioning
9
+ - gemma
10
+ - json-mode
11
+ - structured-output
12
+ - video-analysis
13
  base_model: google/gemma-12b
14
  pipeline_tag: image-text-to-text
15
  model-index:
16
+ - name: ClipTagger-12b
17
+ results:
18
+ - task:
19
+ type: image-to-text
20
+ name: Video Frame Captioning
21
+ metrics:
22
+ - name: Average Judge Score
23
+ type: quality
24
+ value: 3.53
25
+ - name: ROUGE-1
26
+ type: rouge-1
27
+ value: 0.674
28
+ - name: ROUGE-L
29
+ type: rouge-l
30
+ value: 0.520
31
+ - name: BLEU
32
+ type: bleu
33
+ value: 0.267
34
  ---
35
 
36
  # ClipTagger-12b
 
41
 
42
  **ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
43
 
44
+ **ClipTagger outperforms GPT-4.1 and Claude 4 Sonnet, while costing 15x less per generation.**
45
+
46
  The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
47
 
48
  ### Key Features
 
59
  ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
60
 
61
  ### Technical Specifications
62
+
63
  - **Parameters**: 12 billion
64
  - **Base Architecture**: Gemma-12B
65
  - **Quantization**: FP8 (no quality loss vs bf16)
 
73
  The model was trained on 1 million carefully curated single-frame samples from publicly available video data. Training employed knowledge distillation from a high-quality teacher model to ensure consistent, accurate outputs while maintaining the ability to generalize across diverse video content types.
74
 
75
  ### Training Process
76
+
77
  - **Dataset Size**: 1M video frames
78
  - **Training Method**: Teacher-student distillation
79
  - **Data Source**: Publicly available video content
 
84
  Performance metrics on our internal evaluation set:
85
  | Model | Avg Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
86
  |-------|-----------------|---------|---------|---------|------|
87
+ | ClipTagger-12b | **3.53** | **0.674** | **0.404** | **0.520** | **0.267** |
88
+ | Claude 4 Sonnet | 3.16 | 0.463 | 0.179 | 0.281 | 0.060 |
89
+ | GPT-4.1 | 3.64 | 0.581 | 0.260 | 0.376 | 0.119 |
 
 
90
 
91
  <table>
92
  <tr>
 
115
  | GPT-4.1 | $3.00 | $12.00 | $5,100 | $0.0051 |
116
  | Claude 4 Sonnet | $3.00 | $15.00 | $5,850 | $0.00585 |
117
 
 
 
118
  ## Usage
119
 
120
  ### API Access
121
 
122
  For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
123
 
124
+ **[Run ClipTagger-12b via Inference.net API →](https://inference.net/use-cases/video-understanding)**
125
 
126
  ### Required Prompts
127
 
128
  The model requires specific system and user prompts for optimal performance. Use these prompts exactly as shown:
129
 
130
  #### System Prompt
131
+
132
  ```
133
  You are an image annotation API trained to analyze YouTube video keyframes. You will be given instructions on the output format, what to caption, and how to perform your job. Follow those instructions. For descriptions and summaries, provide them directly and do not lead them with 'This image shows' or 'This keyframe displays...', just get right into the details.
134
  ```
135
 
136
  #### User Prompt
137
+
138
  ```
139
  You are an image annotation API trained to analyze YouTube video keyframes. You must respond with a valid JSON object matching the exact structure below.
140
 
 
214
  ```
215
 
216
  ## Use Cases
 
217
  - **Video Search & Discovery** - Build searchable databases with structured metadata
218
  - **Content Moderation** - Automated content analysis and categorization
219
  - **Accessibility** - Generate consistent alt-text and scene descriptions
220
  - **Ad Verification** - Track product visibility and brand appearances
221
  - **Video Analytics** - Extract insights from large video collections
222
  - **Content Management** - Automatic tagging and organization of video libraries
223
+ -
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
224
 
225
  ## Support
226
+ - **Documentation**: [docs.inference.net](https://inference.net/use-cases/video-understanding)
227
+ - **API Access**: [inference.net/use-cases/video-understanding](https://inference.net/use-cases/video-understanding)
 
228
  - **Email**: [email protected]
 
229
 
230
+ ## Interested in training your own model?
231
+ Contact us at [[email protected]](mailto:[email protected]) for a free consultation with our research team.
232
 
233
+ ## License
234
+ This model is released under the Apache-2.0 license, allowing for commercial use and modification with proper attribution.