Create README.md

Browse files

Files changed (10) hide show

.gitattributes +1 -0
2025-08-14T02-36_export.csv +4 -0
README.md +41 -10
assets/bleu.png +3 -0
assets/cliptagger-example.png +3 -0
assets/cost.png +3 -0
assets/grass-x-inference.png +3 -0
assets/judge-score.png +3 -0
assets/rouge-1.png +3 -0
assets/rouge-L.png +3 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+assets/*.png filter=lfs diff=lfs merge=lfs -text

2025-08-14T02-36_export.csv ADDED Viewed

	@@ -0,0 +1,4 @@

+Model,Avg Judge Score,ROUGE-1,ROUGE-2,ROUGE-L,BLEU,Samples w/ Eval,Samples w/ Caption
+claude_4_sonnet,3.16,0.463,0.179,0.281,0.060,500,500
+cliptagger_12b,3.53,0.674,0.404,0.520,0.267,499,998
+gpt_4.1,3.64,0.581,0.260,0.376,0.119,494,500

README.md CHANGED Viewed

@@ -1,8 +1,9 @@
-# GrassData/cliptagger-12b
-## Model Description
-**GrassData/cliptagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
 The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
@@ -17,7 +18,7 @@ The model generates structured, schema-consistent JSON outputs for every video f
 ## Architecture
-GrassData/cliptagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
 ### Technical Specifications
 - **Parameters**: 12 billion
@@ -42,21 +43,51 @@ The model was trained on 1 million carefully curated single-frame samples from p
 Performance metrics on our internal evaluation set:
-| Model Variant | Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
-|--------------|-------------|---------|---------|---------|------|
-| Base Gemma 12B | 3.00 | 0.490 | 0.198 | 0.299 | 0.074 |
-| + 100K samples | 3.29 | 0.649 | 0.367 | 0.490 | 0.232 |
-| + 1M samples (final) | **3.53** | **0.674** | **0.404** | **0.520** | **0.267** |
 FP8 quantization showed no measurable quality degradation compared to bf16 precision.
 ## Usage
 ### API Access
 For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
-**[Run GrassData/cliptagger-12b via Inference.net API →](https://localhost:3000/use-cases/video-understanding)**
 ### Required Prompts

+![GrassData/ClipTagger-12b](./assets/grass-x-inference.png)
+## Model Information
+**GrassData/ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
 The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
 ## Architecture
+GrassData/ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
 ### Technical Specifications
 - **Parameters**: 12 billion
 Performance metrics on our internal evaluation set:
+| Model | Avg Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
+|-------|-----------------|---------|---------|---------|------|
+| cliptagger_12b | **3.53** | **0.674** | **0.404** | **0.520** | **0.267** |
+| claude_4_sonnet | 3.16 | 0.463 | 0.179 | 0.281 | 0.060 |
+| gpt_4.1 | 3.64 | 0.581 | 0.260 | 0.376 | 0.119 |
+### Benchmark Visualizations
+<div align="center">
+  <img src="./assets/judge-score.png" alt="Average Judge Score Comparison" width="45%" />
+  <img src="./assets/rouge-1.png" alt="ROUGE-1 Score Comparison" width="45%" />
+  <br/>
+  <img src="./assets/rouge-L.png" alt="ROUGE-L Score Comparison" width="45%" />
+  <img src="./assets/bleu.png" alt="BLEU Score Comparison" width="45%" />
+</div>
 FP8 quantization showed no measurable quality degradation compared to bf16 precision.
+## Cost Comparison
+GrassData/ClipTagger-12b delivers frontier-quality performance at a fraction of the cost of closed-source alternatives. Based on typical usage patterns (700 input tokens and 250 output tokens per generation), here's how the costs compare:
+### Pricing Comparison
+| Model | Input Cost/MTok | Output Cost/MTok | Cost per 1M Generations | Cost per Generation |
+|-------|-----------------|------------------|------------------------|-------------------|
+| ClipTagger-12b | $0.30 | $0.50 | $335 | $0.000335 |
+| GPT-4.1 | $3.00 | $12.00 | $5,100 | $0.0051 |
+| Claude 4 Sonnet | $3.00 | $15.00 | $5,850 | $0.00585 |
+*Cost calculations based on 700 input tokens and 250 output tokens per generation.
+<div align="center">
+  <img src="./assets/cost.png" alt="Cost Comparison Per 1 Million Generations" width="80%" />
+</div>
+ClipTagger-12b offers **15x cost savings** compared to GPT-4.1 and **17x cost savings** compared to Claude 4 Sonnet, while maintaining comparable quality metrics.
 ## Usage
 ### API Access
 For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
+**[Run GrassData/ClipTagger-12b via Inference.net API →](https://localhost:3000/use-cases/video-understanding)**
 ### Required Prompts