Create README.md
Browse files- .gitattributes +1 -0
- 2025-08-14T02-36_export.csv +4 -0
- README.md +41 -10
- assets/bleu.png +3 -0
- assets/cliptagger-example.png +3 -0
- assets/cost.png +3 -0
- assets/grass-x-inference.png +3 -0
- assets/judge-score.png +3 -0
- assets/rouge-1.png +3 -0
- assets/rouge-L.png +3 -0
.gitattributes
CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
|
|
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
37 |
+
assets/*.png filter=lfs diff=lfs merge=lfs -text
|
2025-08-14T02-36_export.csv
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Model,Avg Judge Score,ROUGE-1,ROUGE-2,ROUGE-L,BLEU,Samples w/ Eval,Samples w/ Caption
|
2 |
+
claude_4_sonnet,3.16,0.463,0.179,0.281,0.060,500,500
|
3 |
+
cliptagger_12b,3.53,0.674,0.404,0.520,0.267,499,998
|
4 |
+
gpt_4.1,3.64,0.581,0.260,0.376,0.119,494,500
|
README.md
CHANGED
@@ -1,8 +1,9 @@
|
|
1 |
-
# GrassData/cliptagger-12b
|
2 |
|
3 |
-
|
4 |
|
5 |
-
|
|
|
|
|
6 |
|
7 |
The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
|
8 |
|
@@ -17,7 +18,7 @@ The model generates structured, schema-consistent JSON outputs for every video f
|
|
17 |
|
18 |
## Architecture
|
19 |
|
20 |
-
GrassData/
|
21 |
|
22 |
### Technical Specifications
|
23 |
- **Parameters**: 12 billion
|
@@ -42,21 +43,51 @@ The model was trained on 1 million carefully curated single-frame samples from p
|
|
42 |
|
43 |
Performance metrics on our internal evaluation set:
|
44 |
|
45 |
-
| Model
|
46 |
-
|
47 |
-
|
|
48 |
-
|
|
49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
|
51 |
FP8 quantization showed no measurable quality degradation compared to bf16 precision.
|
52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
## Usage
|
54 |
|
55 |
### API Access
|
56 |
|
57 |
For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
|
58 |
|
59 |
-
**[Run GrassData/
|
60 |
|
61 |
### Required Prompts
|
62 |
|
|
|
|
|
1 |
|
2 |
+

|
3 |
|
4 |
+
## Model Information
|
5 |
+
|
6 |
+
**GrassData/ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
|
7 |
|
8 |
The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
|
9 |
|
|
|
18 |
|
19 |
## Architecture
|
20 |
|
21 |
+
GrassData/ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
|
22 |
|
23 |
### Technical Specifications
|
24 |
- **Parameters**: 12 billion
|
|
|
43 |
|
44 |
Performance metrics on our internal evaluation set:
|
45 |
|
46 |
+
| Model | Avg Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
|
47 |
+
|-------|-----------------|---------|---------|---------|------|
|
48 |
+
| cliptagger_12b | **3.53** | **0.674** | **0.404** | **0.520** | **0.267** |
|
49 |
+
| claude_4_sonnet | 3.16 | 0.463 | 0.179 | 0.281 | 0.060 |
|
50 |
+
| gpt_4.1 | 3.64 | 0.581 | 0.260 | 0.376 | 0.119 |
|
51 |
+
|
52 |
+
### Benchmark Visualizations
|
53 |
+
|
54 |
+
<div align="center">
|
55 |
+
<img src="./assets/judge-score.png" alt="Average Judge Score Comparison" width="45%" />
|
56 |
+
<img src="./assets/rouge-1.png" alt="ROUGE-1 Score Comparison" width="45%" />
|
57 |
+
<br/>
|
58 |
+
<img src="./assets/rouge-L.png" alt="ROUGE-L Score Comparison" width="45%" />
|
59 |
+
<img src="./assets/bleu.png" alt="BLEU Score Comparison" width="45%" />
|
60 |
+
</div>
|
61 |
|
62 |
FP8 quantization showed no measurable quality degradation compared to bf16 precision.
|
63 |
|
64 |
+
## Cost Comparison
|
65 |
+
|
66 |
+
GrassData/ClipTagger-12b delivers frontier-quality performance at a fraction of the cost of closed-source alternatives. Based on typical usage patterns (700 input tokens and 250 output tokens per generation), here's how the costs compare:
|
67 |
+
|
68 |
+
### Pricing Comparison
|
69 |
+
|
70 |
+
| Model | Input Cost/MTok | Output Cost/MTok | Cost per 1M Generations | Cost per Generation |
|
71 |
+
|-------|-----------------|------------------|------------------------|-------------------|
|
72 |
+
| ClipTagger-12b | $0.30 | $0.50 | $335 | $0.000335 |
|
73 |
+
| GPT-4.1 | $3.00 | $12.00 | $5,100 | $0.0051 |
|
74 |
+
| Claude 4 Sonnet | $3.00 | $15.00 | $5,850 | $0.00585 |
|
75 |
+
|
76 |
+
*Cost calculations based on 700 input tokens and 250 output tokens per generation.
|
77 |
+
|
78 |
+
<div align="center">
|
79 |
+
<img src="./assets/cost.png" alt="Cost Comparison Per 1 Million Generations" width="80%" />
|
80 |
+
</div>
|
81 |
+
|
82 |
+
ClipTagger-12b offers **15x cost savings** compared to GPT-4.1 and **17x cost savings** compared to Claude 4 Sonnet, while maintaining comparable quality metrics.
|
83 |
+
|
84 |
## Usage
|
85 |
|
86 |
### API Access
|
87 |
|
88 |
For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
|
89 |
|
90 |
+
**[Run GrassData/ClipTagger-12b via Inference.net API →](https://localhost:3000/use-cases/video-understanding)**
|
91 |
|
92 |
### Required Prompts
|
93 |
|
assets/bleu.png
ADDED
![]() |
Git LFS Details
|
assets/cliptagger-example.png
ADDED
![]() |
Git LFS Details
|
assets/cost.png
ADDED
![]() |
Git LFS Details
|
assets/grass-x-inference.png
ADDED
![]() |
Git LFS Details
|
assets/judge-score.png
ADDED
![]() |
Git LFS Details
|
assets/rouge-1.png
ADDED
![]() |
Git LFS Details
|
assets/rouge-L.png
ADDED
![]() |
Git LFS Details
|