samhogan commited on
Commit
fb437aa
·
1 Parent(s): 19e5eb4

Create README.md

Browse files
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ assets/*.png filter=lfs diff=lfs merge=lfs -text
2025-08-14T02-36_export.csv ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ Model,Avg Judge Score,ROUGE-1,ROUGE-2,ROUGE-L,BLEU,Samples w/ Eval,Samples w/ Caption
2
+ claude_4_sonnet,3.16,0.463,0.179,0.281,0.060,500,500
3
+ cliptagger_12b,3.53,0.674,0.404,0.520,0.267,499,998
4
+ gpt_4.1,3.64,0.581,0.260,0.376,0.119,494,500
README.md CHANGED
@@ -1,8 +1,9 @@
1
- # GrassData/cliptagger-12b
2
 
3
- ## Model Description
4
 
5
- **GrassData/cliptagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
 
 
6
 
7
  The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
8
 
@@ -17,7 +18,7 @@ The model generates structured, schema-consistent JSON outputs for every video f
17
 
18
  ## Architecture
19
 
20
- GrassData/cliptagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
21
 
22
  ### Technical Specifications
23
  - **Parameters**: 12 billion
@@ -42,21 +43,51 @@ The model was trained on 1 million carefully curated single-frame samples from p
42
 
43
  Performance metrics on our internal evaluation set:
44
 
45
- | Model Variant | Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
46
- |--------------|-------------|---------|---------|---------|------|
47
- | Base Gemma 12B | 3.00 | 0.490 | 0.198 | 0.299 | 0.074 |
48
- | + 100K samples | 3.29 | 0.649 | 0.367 | 0.490 | 0.232 |
49
- | + 1M samples (final) | **3.53** | **0.674** | **0.404** | **0.520** | **0.267** |
 
 
 
 
 
 
 
 
 
 
50
 
51
  FP8 quantization showed no measurable quality degradation compared to bf16 precision.
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ## Usage
54
 
55
  ### API Access
56
 
57
  For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
58
 
59
- **[Run GrassData/cliptagger-12b via Inference.net API →](https://localhost:3000/use-cases/video-understanding)**
60
 
61
  ### Required Prompts
62
 
 
 
1
 
2
+ ![GrassData/ClipTagger-12b](./assets/grass-x-inference.png)
3
 
4
+ ## Model Information
5
+
6
+ **GrassData/ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
7
 
8
  The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
9
 
 
18
 
19
  ## Architecture
20
 
21
+ GrassData/ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
22
 
23
  ### Technical Specifications
24
  - **Parameters**: 12 billion
 
43
 
44
  Performance metrics on our internal evaluation set:
45
 
46
+ | Model | Avg Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
47
+ |-------|-----------------|---------|---------|---------|------|
48
+ | cliptagger_12b | **3.53** | **0.674** | **0.404** | **0.520** | **0.267** |
49
+ | claude_4_sonnet | 3.16 | 0.463 | 0.179 | 0.281 | 0.060 |
50
+ | gpt_4.1 | 3.64 | 0.581 | 0.260 | 0.376 | 0.119 |
51
+
52
+ ### Benchmark Visualizations
53
+
54
+ <div align="center">
55
+ <img src="./assets/judge-score.png" alt="Average Judge Score Comparison" width="45%" />
56
+ <img src="./assets/rouge-1.png" alt="ROUGE-1 Score Comparison" width="45%" />
57
+ <br/>
58
+ <img src="./assets/rouge-L.png" alt="ROUGE-L Score Comparison" width="45%" />
59
+ <img src="./assets/bleu.png" alt="BLEU Score Comparison" width="45%" />
60
+ </div>
61
 
62
  FP8 quantization showed no measurable quality degradation compared to bf16 precision.
63
 
64
+ ## Cost Comparison
65
+
66
+ GrassData/ClipTagger-12b delivers frontier-quality performance at a fraction of the cost of closed-source alternatives. Based on typical usage patterns (700 input tokens and 250 output tokens per generation), here's how the costs compare:
67
+
68
+ ### Pricing Comparison
69
+
70
+ | Model | Input Cost/MTok | Output Cost/MTok | Cost per 1M Generations | Cost per Generation |
71
+ |-------|-----------------|------------------|------------------------|-------------------|
72
+ | ClipTagger-12b | $0.30 | $0.50 | $335 | $0.000335 |
73
+ | GPT-4.1 | $3.00 | $12.00 | $5,100 | $0.0051 |
74
+ | Claude 4 Sonnet | $3.00 | $15.00 | $5,850 | $0.00585 |
75
+
76
+ *Cost calculations based on 700 input tokens and 250 output tokens per generation.
77
+
78
+ <div align="center">
79
+ <img src="./assets/cost.png" alt="Cost Comparison Per 1 Million Generations" width="80%" />
80
+ </div>
81
+
82
+ ClipTagger-12b offers **15x cost savings** compared to GPT-4.1 and **17x cost savings** compared to Claude 4 Sonnet, while maintaining comparable quality metrics.
83
+
84
  ## Usage
85
 
86
  ### API Access
87
 
88
  For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
89
 
90
+ **[Run GrassData/ClipTagger-12b via Inference.net API →](https://localhost:3000/use-cases/video-understanding)**
91
 
92
  ### Required Prompts
93
 
assets/bleu.png ADDED

Git LFS Details

  • SHA256: 9a47848051f78a55358dc3c11b0b256d90842e271ceb31687bd1c237ba4bcecf
  • Pointer size: 130 Bytes
  • Size of remote file: 36.6 kB
assets/cliptagger-example.png ADDED

Git LFS Details

  • SHA256: 584f2603f53570a6a41c1136384afba9c6c71318ebf58c4a91b821f2e1575b27
  • Pointer size: 132 Bytes
  • Size of remote file: 6.24 MB
assets/cost.png ADDED

Git LFS Details

  • SHA256: 690196a06399b5aa5d32b4e796712e2aa29c997a3e694569510a908e282c9505
  • Pointer size: 130 Bytes
  • Size of remote file: 45.3 kB
assets/grass-x-inference.png ADDED

Git LFS Details

  • SHA256: eac7e621481d1f9c1e23d6e528202e8fc59d564bd49a79a2918341adf537ef5d
  • Pointer size: 131 Bytes
  • Size of remote file: 196 kB
assets/judge-score.png ADDED

Git LFS Details

  • SHA256: 01aa1677ceb0e9c4cdad92884ca8cb13384430edbcfe46374d6e0695ab30c386
  • Pointer size: 130 Bytes
  • Size of remote file: 44.4 kB
assets/rouge-1.png ADDED

Git LFS Details

  • SHA256: 33f41d383f66a8aea68f70904f04dc703a218827dc8dc8c6af6430ba1a4d33c4
  • Pointer size: 130 Bytes
  • Size of remote file: 38.6 kB
assets/rouge-L.png ADDED

Git LFS Details

  • SHA256: 1011f2de9e4fa7963f2e05216e5ff2a4d2339e99e72d6ade4219084b6aa6c872
  • Pointer size: 130 Bytes
  • Size of remote file: 38.7 kB