File size: 15,346 Bytes
d475200
 
 
 
 
 
 
 
 
 
 
ac8c1c1
d475200
90df0cf
 
d475200
 
7f3e05b
 
 
057aa98
c3fcdcd
d475200
6ebb200
d475200
 
 
 
c3fcdcd
d475200
 
 
d2059e0
d475200
c3fcdcd
6ebb200
 
d475200
8db4df7
a8e34d7
c3fcdcd
a8e34d7
c3fcdcd
d475200
 
 
a8e34d7
 
 
 
 
 
 
 
d475200
a8e34d7
d475200
c3fcdcd
d475200
 
 
a8e34d7
 
 
 
 
d475200
a8e34d7
d475200
c3fcdcd
d475200
c3fcdcd
d475200
 
 
 
 
 
 
 
 
 
 
 
c3fcdcd
d475200
 
 
 
 
 
 
 
 
 
 
 
 
 
c3fcdcd
d475200
 
 
2da3fad
 
c3fcdcd
d475200
c3fcdcd
d475200
 
 
 
c3fcdcd
d475200
 
 
 
 
 
 
c3fcdcd
d475200
 
 
 
 
 
c3fcdcd
d475200
c3fcdcd
d475200
c3fcdcd
d475200
 
 
 
 
c3fcdcd
d475200
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ebb200
 
d475200
 
d30f110
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d475200
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b945fd6
d475200
 
 
 
 
53766b0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
---
license: gpl-3.0
datasets:
- p1atdev/danbooru-2024
language:
- en
pipeline_tag: image-classification
---

# Camie Tagger v2

An advanced deep learning model for automatically tagging anime/manga illustrations with relevant tags across multiple categories, achieving an amazing **67.3% micro F1 score** (using the micro optimized threshold profile)  and **50.6% macro F1 score** (using the macro optimized threshold preset) across **70,527 possible tags** on a test set of 20,116 samples. Now with Vision Transformer backbone and significantly improved performance. This dataset is notoriously long-tailed and sparse.

![Application Interface](images/app_screenshot.png)

## ๐Ÿš€ What's New in v2

![Performance Bar](images/performance_bar.PNG)

![Performance Dumbbell](images/dumbbell_plot.png)

### Major Performance Improvements:
- **Micro F1**: 58.1% โ†’ **67.3%** (+9.2 percentage points)
- **Macro F1**: 33.8% โ†’ **50.6%** (+16.8 percentage points) 
- **Model Size**: 424M โ†’ **143M parameters** (-66% reduction)
- **Architecture**: Switched from EfficientNetV2-L to Vision Transformer (ViT) backbone
- **Simplified Design**: Streamlined from dual-stage to single refined prediction model

### Training Innovations:
- **Multi-Resolution Training**: Progressive scaling from 384px โ†’ 512px resolution
- **IRFS (Instance-Aware Repeat Factor Sampling)**: Significant macro F1 improvements for rare tags
- **Adaptive Training**: Models quickly adapt to resolution/distribution changes after initial pretraining
- **Overall the model is more accurate, faster, and less code!**

## โœจ Features:
- **Streamlit web interface app and game**: User-friendly UI for uploading and analyzing images and a tag collection game
- **Adjustable threshold profiles**: Micro, Macro, Balanced, Category-specific, profiles
- **Fine-grained control**: Per-category threshold adjustments for precision-recall tradeoffs
- **Safetensors and ONNX**: Available in main directory

## ๐Ÿ“Š Performance Analysis:

### Complete v1 vs v2 Performance Comparison:

| CATEGORY | v1 Micro F1 | v2 Micro F1 | Micro ฮ” | v1 Macro F1 | v2 Macro F1 | Macro ฮ” |
|----------|-------------|-------------|---------|-------------|-------------|---------|
| **Overall** | 61.3% | **67.3%** | **+6.0pp** | 33.8% | **50.6%** | **+16.8pp** |
| **Artist** | 48.0% | **70.0%** | **+22.0pp** | 29.9% | **66.1%** | **+36.2pp** |
| **Character** | 75.7% | **83.4%** | **+7.7pp** | 52.4% | **66.2%** | **+13.8pp** |
| **Copyright** | 79.2% | **86.6%** | **+7.4pp** | 41.9% | **56.2%** | **+14.3pp** |
| **General** | 60.8% | **66.4%** | **+5.6pp** | 21.5% | **34.6%** | **+13.1pp** |
| **Meta** | 60.2% | **61.2%** | **+1.0pp** | 14.5% | **23.7%** | **+9.2pp** |
| **Rating** | 80.8% | **83.1%** | **+2.3pp** | 79.5% | **77.5%** | **-2.0pp** |
| **Year** | 33.2% | **30.8%** | **-2.4pp** | 29.3% | **32.6%** | **+3.3pp** |

*Micro F1 comparison using micro-optimized thresholds, Macro F1 comparison using macro-optimized thresholds for fair evaluation.*

### Key Performance Insights:

The v2 model shows remarkable improvements across nearly all categories:

- **Artist Recognition**: Massive +22.0pp micro F1 improvement and +36.2pp macro improvement, indicating much better artist identification
- **Character Detection**: Large +7.7pp micro F1 and +13.8pp macro F1 gains
- **Copyright Recognition**: Excellent +7.4pp micro F1 improvement and +14.3pp macro improvement for series identification  
- **General Tags**: Improved +5.6pp micro F1 and +13.1pp macro F1 for visual attributes
- **Overall Macro F1**: Exceptional +16.8pp improvement shows much better rare tag recognition

Only the year category shows slight regression.

### Detailed v2 Performance:

#### MACRO OPTIMIZED (Recommended):

| CATEGORY | THRESHOLD | MICRO-F1 | MACRO-F1 |
|----------|-----------|----------|----------|
| **overall** | 0.492 | **60.9%** | **50.6%** |
| artist | 0.492 | 62.3% | 66.1% |
| character | 0.492 | 79.9% | 66.2% |
| copyright | 0.492 | 81.8% | 56.2% |
| general | 0.492 | 60.2% | 34.6% |
| meta | 0.492 | 56.3% | 23.7% |
| rating | 0.492 | 78.7% | 77.5% |
| year | 0.492 | 37.2% | 32.6% |

#### MICRO OPTIMIZED:

| CATEGORY | THRESHOLD | MICRO-F1 | MACRO-F1 |
|----------|-----------|----------|----------|
| **overall** | 0.614 | **67.3%** | **46.3%** |
| artist | 0.614 | 70.0% | 64.4% |
| character | 0.614 | 83.4% | 64.5% |
| copyright | 0.614 | 86.6% | 53.1% |
| general | 0.614 | 66.4% | 27.4% |
| meta | 0.614 | 61.2% | 19.2% |
| rating | 0.614 | 83.1% | 81.8% |
| year | 0.614 | 30.8% | 21.3% |

The model performs exceptionally well on character identification (83.4% F1 across 26,968 tags), copyright/series detection (86.6% F1 across 5,364 tags), and content rating classification (83.1% F1 across 4 tags).

### Real-world Tag Accuracy:

The macro optimized threshold is recommended as many "false positives" according to the benchmark are actually correct tags missing from the Danbooru dataset. The model frequently identifies appropriate tags that weren't included in the original tagging, making perceived accuracy higher than formal metrics suggest.

**If you'd like to support further training on the complete dataset or my future projects, consider supporting me here:(https://ko-fi.com/camais). Your support will directly enable longer training runs and better models!**

## ๐Ÿง  Architecture Overview:

### Vision Transformer Backbone:
- **Base Model**: Vision Transformer (ViT) with patch-based image processing
- **Dual Output**: Patch feature map + CLS token for comprehensive image understanding
- **Efficient Design**: 86.4M backbone parameters vs previous 214M+ classifier layers

### Refined Prediction Pipeline:
1. **Feature Extraction**: ViT processes image into patch tokens and global CLS token
2. **Global Pooling**: Combines mean-pooled patches with CLS token (dual-pool approach)  
3. **Initial Predictions**: Shared weights between tag embeddings and classification layer
4. **Candidate Selection**: Top-K tag selection based on initial confidence
5. **Cross-Attention**: Tag embeddings attend to image patch features
6. **Final Scoring**: Refined predictions for selected candidate tags

### Key Improvements:
- **Shared Weights**: Tag embeddings directly used for initial classification
- **Simplified Pipeline**: Single refined prediction stage (vs previous initial + refined)
- **Native PyTorch**: Uses optimized MultiheadAttention instead of Flash Attention
- **Custom Embeddings**: No dependency on external models like CLIP
- **Gradient Checkpointing**: Memory-efficient training on consumer hardware

## ๐Ÿ› ๏ธ Training Details:

### Multi-Resolution Training Strategy:

The model was trained using a multi-resolution approach:

1. **Phase 1**: 3 epochs at 384px resolution with learning rate 1e-4
2. **Phase 2**: IRFS (Instance-Aware Repeat Factor Sampling) - addresses long-tailed distribution imbalance  
3. **Phase 3**: 512px resolution fine-tuning with learning rate 5e-5

### Key Training Insights:

**Rapid Adaptation**: Once the model learns good general features during initial pretraining, it adapts to resolution changes and distribution shifts very quickly - often within a fraction of an epoch rather than requiring full retraining.

**IRFS Benefits**: Instance-Aware Repeat Factor Sampling provided substantial macro F1 improvements by addressing the long-tailed distribution of anime tags, where instance counts vary dramatically between classes even with similar image counts.

**Efficient Scaling**: The ViT architecture generalizes resolution and capacity changes to the entire dataset, making incremental training highly efficient.

#### Training Data:
- **Training subset**: 2,000,000 images
- **Training duration**: 3+ epochs with multi-resolution scaling
- **Final resolution**: 512x512 pixels

## ๐ŸŽฎ Tag Collector Game (Camie Collector)

Introducing a Tagging game - a gamified approach to anime image tagging that helps you understand the performance and limits of the model. This was a shower thought gone to far! Lots of Project Moon references.

### How to Play:
1. Upload an image
2. Scan for tags to discover them
   ![Collect Tags Tab](images/collect_tags.PNG)
3. Earn TagCoins for new discoveries
4. Spend TagCoins on upgrades to lower the threshold
   ![Upgrades Tab](images/upgrades.PNG)
5. Lower thresholds reveal rarer tags!
6. Collect sets of related tags for bonuses and reveal unique mosaics!
   ![Mosaics Tab](images/mosaics.PNG)
7. Visit the Library System to discover unique tags (not collect)
   ![Library Tab](images/library.PNG)
8. Use collected tags to either inspire new searches or generate essence
9. Use Enkephalin to generate Tag Essences
   ![Essence Tab](images/essence_tab.PNG)
10. Use the Tag Essence Generator to collect the tag and related tags to it. Lamp Essence:
    ![Lamp Essence](images/lamp_essence.jpg)

## ๐Ÿ–ฅ๏ธ Web Interface Guide

The interface is divided into three main sections:

1. **Model Selection** (Sidebar):
   - Choose between Full Model, Initial-only Model or ONNX accelerated (initial only)
   - View model information and memory usage

2. **Image Upload** (Left Panel):
   - Upload your own images or select from examples
   - View the selected image

3. **Tagging Controls** (Right Panel):
   - Select threshold profile
   - Adjust thresholds for precision-recall and micro/macro tradeoff
   - Configure display options
   - View predictions organized by category

### Display Options:

- **Show all tags**: Display all tags including those below threshold
- **Compact view**: Hide progress bars for cleaner display
- **Minimum confidence**: Filter out low-confidence predictions
- **Category selection**: Choose which categories to include in the summary

### Interface Screenshots:

![Application Interface](images/app_screenshot.png)

*Note the rare characters and tags idenified. Some only have 100's of samples on danbooru!*

![Tag Results Example](images/tag_results_example.png)

### ๐Ÿ› ๏ธ Requirements

- **Python 3.11.9 specifically** (newer versions are incompatible)
- PyTorch 1.10+
- Streamlit
- PIL/Pillow
- NumPy

### ๐Ÿ”ง Usage

Setup the application and game by executing `setup.bat`. This installs the required virtual environment:

- Upload your own images or select from example images
- Choose different threshold profiles
- Adjust category-specific thresholds
- View predictions organized by category
- Filter and sort tags based on confidence

Use run_app.bat and run_game.bat.

## ๐Ÿง  Training Details

### Dataset

The model was trained on a carefully filtered subset of the [Danbooru 2024 dataset](https://huggingface.co/datasets/p1atdev/danbooru-2024), which contains a vast collection of anime/manga illustrations with comprehensive tagging.

#### Filtering Process:

The dataset was filtered with the following constraints:

```python
# Minimum tags per category required for each image
min_tag_counts = {
    'general': 25, 
    'character': 1, 
    'copyright': 1, 
    'artist': 0, 
    'meta': 0
}

# Minimum samples per tag required for tag to be included
min_tag_samples = {
    'general': 20, 
    'character': 40, 
    'copyright': 50, 
    'artist': 200, 
    'meta': 50
}
```

This filtering process:
1. First removed low-sample tags (tags with fewer occurrences than specified in `min_tag_samples`)
2. Then removed images with insufficient tags per category (as specified in `min_tag_counts`)

#### Training Data:

- **Starting dataset size**: ~3,000,000 filtered images
- **Training subset**: 2,000,000 images (due to storage and time constraints)

#### Preprocessing:

Images were preprocessed with minimal transformations:
- Tensor normalization (scaled to 0-1 range)
- ImageNet normalization.
- Resized while maintaining original aspect ratio
- No additional augmentations were applied

#### Tag Categories:

The model recognizes tags across these categories:
- **General**: Visual elements, concepts, clothing, etc. (30,841 tags)
- **Character**: Individual characters appearing in the image (26,968 tags)
- **Copyright**: Source material (anime, manga, game) (5,364 tags)
- **Artist**: Creator of the artwork (7,007 tags)
- **Meta**: Meta information about the image (323 tags)
- **Rating**: Content rating (4 tags)
- **Year**: Year of upload (20 tags)

All supported tags are stored in `model/metadata.json`, which maps tag IDs to their names and categories.

### Training Notebooks

The repository includes the main training notebook:

1. **camie-tagger-v2.ipynb**:
   - Main training notebook
   - Dataset loading and preprocessing
   - Model initialization
   - Tag selection optimization
   - Metric tracking and visualization

### Training Monitor

The project includes a real-time training monitor accessible via browser at `localhost:5000` during training:

#### Performance Tips:

โš ๏ธ **Important**: For optimal training speed, keep VSCode minimized and the training monitor open in your browser. This can improve iteration speed by **3-5x** due to how the Windows/WSL graphics stack handles window focus and CUDA kernel execution.

#### Monitor Features:

The training monitor provides three main views:

##### 1. Overview Tab:

![Overview Tab](images/training_monitor_overview.png)

- **Training Progress**: Real-time metrics including epoch, batch, speed, and time estimates
- **Loss Chart**: Training and validation loss visualization
- **F1 Scores**: Initial and refined F1 metrics for both training and validation

##### 2. Predictions Tab:

![Predictions Tab](images/training_monitor_predictions.png)

- **Image Preview**: Shows the current sample being analyzed
- **Prediction Controls**: Toggle between initial and refined predictions
- **Tag Analysis**: 
  - Color-coded tag results (correct, incorrect, missing)
  - Confidence visualization with probability bars
  - Category-based organization
  - Filtering options for error analysis

##### 3. Selection Analysis Tab:

![Selection Analysis Tab](images/training_monitor_selection.png)

- **Selection Metrics**: Statistics on tag selection quality
  - Ground truth recall
  - Average probability for ground truth vs. non-ground truth tags
  - Unique tags selected
- **Selection Graph**: Trends in selection quality over time
- **Selected Tags Details**: Detailed view of model-selected tags with confidence scores

The monitor provides invaluable insights into how the two-stage prediction model is performing, particularly how the tag selection process is working between the initial and refined prediction stages.

### Training Notes:

- Training notebooks may require WSL and 32GB+ of RAM to handle the dataset
- With more computational resources, the model could be trained longer on the full dataset

## ๐Ÿ™ Acknowledgments

- Claude Sonnet 3.5, 4 and ChatGPT 5 Thinking for development assistance and brainstorming
- [Vision Transformer](https://arxiv.org/abs/2010.11929) for the foundational architecture
- [Danbooru](https://danbooru.donmai.us/) for the comprehensive tagged anime image dataset
- [p1atdev](https://huggingface.co/p1atdev) for the processed Danbooru 2024 dataset
- [IRFS paper](https://arxiv.org/abs/2305.08069) for Instance-Aware Repeat Factor Sampling methodology
- PyTorch team for optimized attention implementations and gradient checkpointing
- The open-source ML community for foundational tools and methods