camie-tagger-v2 / README.md

Camais03

Upload 130 files

53766b0 verified 2 months ago

preview code

raw

history blame

16.3 kB

metadata

license: gpl-3.0
datasets:
  - p1atdev/danbooru-2024
language:
  - en
pipeline_tag: image-classification

Camie Tagger v2

An advanced deep learning model for automatically tagging anime/manga illustrations with relevant tags across multiple categories, achieving 67.3% micro F1 score (50.6% macro F1 score using the macro optimized threshold preset) across 70,527 possible tags on a test set of 20,116 samples. Now with Vision Transformer backbone and significantly improved performance.

🚀 What's New in v2

Major Performance Improvements

Micro F1: 58.1% → 67.3% (+9.2 percentage points)
Macro F1: 31.5% → 50.6% (+19.1 percentage points)
Model Size: 424M → 143M parameters (-66% reduction)
Architecture: Switched from EfficientNetV2-L to Vision Transformer (ViT) backbone
Simplified Design: Streamlined from dual-stage to single refined prediction model

Training Innovations

Multi-Resolution Training: Progressive scaling from 384px → 512px resolution
IRFS (Instance-Aware Repeat Factor Sampling): Significant macro F1 improvements for rare tags
Adaptive Training: Models quickly adapt to resolution/distribution changes after initial pretraining

v2 demonstrates that Vision Transformers can achieve superior anime image tagging performance with fewer parameters and cleaner architecture.

🔑 Key Highlights

Efficient Training: Completed on just a single RTX 3060 GPU (12GB VRAM)
Fast Adaptation: Models adapt to new resolutions/distributions within partial epochs after pretraining
Comprehensive Coverage: 70,527 tags across 7 categories (general, character, copyright, artist, meta, rating, year)
Modern Architecture: Vision Transformer backbone with cross-attention refinement
User-Friendly Interface: Easy-to-use application with customizable thresholds and tag collection game

✨ Features

Multi-category tagging system: Handles general tags, characters, copyright (series), artists, meta information, and content ratings
High performance: 67.3% micro F1 score (50.6% macro F1) across 70,527 possible tags
Windows compatibility: Works on Windows without Flash Attention requirements
Streamlit web interface: User-friendly UI for uploading and analyzing images and a tag collection game
Adjustable threshold profiles: Micro, Macro, Balanced, Category-specific, High Precision, and High Recall profiles
Fine-grained control: Per-category threshold adjustments for precision-recall tradeoffs
Safetensors and ONNX: Original pickle files available in /models
Vision Transformer Backbone: Modern architecture with superior performance-to-parameter ratio

📊 Performance Analysis

Complete v1 vs v2 Performance Comparison

CATEGORY	v1 Micro F1	v2 Micro F1	Micro Δ	v1 Macro F1	v2 Macro F1	Macro Δ
Overall	58.1%	67.3%	+9.2pp	31.5%	50.6%	+19.1pp
Artist	47.4%	70.0%	+22.6pp	29.8%	64.4%	+34.6pp
Character	74.6%	83.4%	+8.8pp	47.8%	64.5%	+16.7pp
Copyright	76.3%	86.6%	+10.3pp	37.7%	53.1%	+15.4pp
General	57.6%	66.4%	+8.8pp	20.4%	27.4%	+7.0pp
Meta	55.7%	61.2%	+5.5pp	14.4%	19.2%	+4.8pp
Rating	77.9%	83.1%	+5.2pp	76.8%	81.8%	+5.0pp
Year	33.1%	30.8%	-2.3pp	28.6%	21.3%	-7.3pp

*Both using the balanced preset.

Key Performance Insights

The v2 model shows remarkable improvements across nearly all categories:

Artist Recognition: Massive +22.6pp micro F1 improvement, indicating much better artist identification
Character Detection: Strong +8.8pp micro F1 and +16.7pp macro F1 gains
Copyright Recognition: Excellent +10.3pp micro F1 improvement for series identification
General Tags: Consistent +8.8pp micro F1 improvement for visual attributes
Overall Macro F1: Exceptional +19.1pp improvement shows much better rare tag recognition

Only the year category shows slight regression, likely due to the reduced model complexity making temporal classification more challenging.

Detailed v2 Performance

MACRO OPTIMIZED (Recommended)

CATEGORY	THRESHOLD	MICRO-F1	MACRO-F1
overall	0.492	60.9%	50.6%
artist	0.492	62.3%	66.1%
character	0.492	79.9%	66.2%
copyright	0.492	81.8%	56.2%
general	0.492	60.2%	34.6%
meta	0.492	56.3%	23.7%
rating	0.492	78.7%	77.5%
year	0.492	37.2%	32.6%

MICRO OPTIMIZED

CATEGORY	THRESHOLD	MICRO-F1	MACRO-F1
overall	0.614	67.3%	46.3%
artist	0.614	70.0%	64.4%
character	0.614	83.4%	64.5%
copyright	0.614	86.6%	53.1%
general	0.614	66.4%	27.4%
meta	0.614	61.2%	19.2%
rating	0.614	83.1%	81.8%
year	0.614	30.8%	21.3%

The model performs exceptionally well on character identification (83.4% F1 across 26,968 tags), copyright/series detection (86.6% F1 across 5,364 tags), and content rating classification (83.1% F1 across 4 tags).

Real-world Tag Accuracy

The macro optimized threshold is recommended as many "false positives" according to the benchmark are actually correct tags missing from the Danbooru dataset. The model frequently identifies appropriate tags that weren't included in the original tagging, making perceived accuracy higher than formal metrics suggest.

🧠 Architecture Overview

Vision Transformer Backbone

Base Model: Vision Transformer (ViT) with patch-based image processing
Dual Output: Patch feature map + CLS token for comprehensive image understanding
Efficient Design: 86.4M backbone parameters vs previous 214M+ classifier layers

Refined Prediction Pipeline

Feature Extraction: ViT processes image into patch tokens and global CLS token
Global Pooling: Combines mean-pooled patches with CLS token (dual-pool approach)
Initial Predictions: Shared weights between tag embeddings and classification layer
Candidate Selection: Top-K tag selection based on initial confidence
Cross-Attention: Tag embeddings attend to image patch features
Final Scoring: Refined predictions for selected candidate tags

Key Improvements

Shared Weights: Tag embeddings directly used for initial classification
Simplified Pipeline: Single refined prediction stage (vs previous initial + refined)
Native PyTorch: Uses optimized MultiheadAttention instead of Flash Attention
Custom Embeddings: No dependency on external models like CLIP
Gradient Checkpointing: Memory-efficient training on consumer hardware

🛠️ Training Details

Multi-Resolution Training Strategy

The model was trained using an innovative multi-resolution approach:

Phase 1: 3 epochs at 384px resolution with learning rate 1e-4
Phase 2: IRFS (Instance-Aware Repeat Factor Sampling) - addresses long-tailed distribution imbalance
Phase 3: 512px resolution fine-tuning with learning rate 5e-5

Key Training Insights

Rapid Adaptation: Once the model learns good general features during initial pretraining, it adapts to resolution changes and distribution shifts very quickly - often within a fraction of an epoch rather than requiring full retraining.

IRFS Benefits: Instance-Aware Repeat Factor Sampling provided substantial macro F1 improvements by addressing the long-tailed distribution of anime tags, where instance counts vary dramatically between classes even with similar image counts.

Efficient Scaling: The ViT architecture generalizes resolution and capacity changes to the entire dataset, making incremental training highly efficient.

Training Data:

Training subset: 2,000,000 images
Training duration: 3+ epochs with multi-resolution scaling
Final resolution: 512x512 pixels

🛠️ Requirements

Python 3.11.9 specifically (newer versions are incompatible)
PyTorch 1.10+
Streamlit
PIL/Pillow
NumPy
Flash Attention (note: doesn't work properly on Windows only needed for refined model which I'm not supporting that much anyway)

🔧 Usage

Setup the application and game by executing setup.bat. This installs the required virtual environment:

Upload your own images or select from example images
Choose different threshold profiles
Adjust category-specific thresholds
View predictions organized by category
Filter and sort tags based on confidence

Use run_app.bat and run_game.bat.

🎮 Tag Collector Game (Camie Collector)

Introducing a Tagging game - a gamified approach to anime image tagging that helps you understand the performance and limits of the model. This was a shower thought gone to far! Lots of Project Moon references.

How to Play:

Upload an image
Scan for tags to discover them
Earn TagCoins for new discoveries
Spend TagCoins on upgrades to lower the threshold
Lower thresholds reveal rarer tags!
Collect sets of related tags for bonuses and reveal unique mosaics!
Visit the Library System to discover unique tags (not collect)
Use collected tags to either inspire new searches or generate essence
Use Enkephalin to generate Tag Essences
Use the Tag Essence Generator to collect the tag and related tags to it. Lamp Essence:

🖥️ Web Interface Guide

The interface is divided into three main sections:

Model Selection (Sidebar):
- Choose between Full Model, Initial-only Model or ONNX accelerated (initial only)
- View model information and memory usage
Image Upload (Left Panel):
- Upload your own images or select from examples
- View the selected image
Tagging Controls (Right Panel):
- Select threshold profile
- Adjust thresholds for precision-recall and micro/macro tradeoff
- Configure display options
- View predictions organized by category

Display Options:

Show all tags: Display all tags including those below threshold
Compact view: Hide progress bars for cleaner display
Minimum confidence: Filter out low-confidence predictions
Category selection: Choose which categories to include in the summary

Interface Screenshots:

🧠 Training Details

Dataset

The model was trained on a carefully filtered subset of the Danbooru 2024 dataset, which contains a vast collection of anime/manga illustrations with comprehensive tagging.

Filtering Process:

The dataset was filtered with the following constraints:

# Minimum tags per category required for each image
min_tag_counts = {
    'general': 25, 
    'character': 1, 
    'copyright': 1, 
    'artist': 0, 
    'meta': 0
}

# Minimum samples per tag required for tag to be included
min_tag_samples = {
    'general': 20, 
    'character': 40, 
    'copyright': 50, 
    'artist': 200, 
    'meta': 50
}

This filtering process:

First removed low-sample tags (tags with fewer occurrences than specified in min_tag_samples)
Then removed images with insufficient tags per category (as specified in min_tag_counts)

Training Data:

Starting dataset size: ~3,000,000 filtered images
Training subset: 2,000,000 images (due to storage and time constraints)
Training duration: 3.5 epochs

Preprocessing:

Images were preprocessed with minimal transformations:

Tensor normalization (scaled to 0-1 range)
ImageNet normalization.
Resized while maintaining original aspect ratio
No additional augmentations were applied

Tag Categories:

The model recognizes tags across these categories:

General: Visual elements, concepts, clothing, etc. (30,841 tags)
Character: Individual characters appearing in the image (26,968 tags)
Copyright: Source material (anime, manga, game) (5,364 tags)
Artist: Creator of the artwork (7,007 tags)
Meta: Meta information about the image (323 tags)
Rating: Content rating (4 tags)
Year: Year of upload (20 tags)

All supported tags are stored in model/metadata.json, which maps tag IDs to their names and categories.

Training Notebooks

The repository includes the main training notebook:

camie-tagger-v2.ipynb:
- Main training notebook
- Dataset loading and preprocessing
- Model initialization
- Initial training loop with DeepSpeed integration
- Tag selection optimization
- Metric tracking and visualization

Training Monitor

The project includes a real-time training monitor accessible via browser at localhost:5000 during training:

Performance Tips:

⚠️ Important: For optimal training speed, keep VSCode minimized and the training monitor open in your browser. This can improve iteration speed by 3-5x due to how the Windows/WSL graphics stack handles window focus and CUDA kernel execution.

Monitor Features:

The training monitor provides three main views:

1. Overview Tab:

Training Progress: Real-time metrics including epoch, batch, speed, and time estimates
Loss Chart: Training and validation loss visualization
F1 Scores: Initial and refined F1 metrics for both training and validation

2. Predictions Tab:

Image Preview: Shows the current sample being analyzed
Prediction Controls: Toggle between initial and refined predictions
Tag Analysis:
- Color-coded tag results (correct, incorrect, missing)
- Confidence visualization with probability bars
- Category-based organization
- Filtering options for error analysis

3. Selection Analysis Tab:

Selection Metrics: Statistics on tag selection quality
- Ground truth recall
- Average probability for ground truth vs. non-ground truth tags
- Unique tags selected
Selection Graph: Trends in selection quality over time
Selected Tags Details: Detailed view of model-selected tags with confidence scores

The monitor provides invaluable insights into how the two-stage prediction model is performing, particularly how the tag selection process is working between the initial and refined prediction stages.

Training Notes:

Training notebooks may require WSL and 32GB+ of RAM to handle the dataset
With more computational resources, the model could be trained longer on the full dataset

🙏 Acknowledgments

Claude Sonnet 3.5 and 4 for development assistance and architectural insights
Vision Transformer for the foundational architecture
Danbooru for the comprehensive tagged anime image dataset
p1atdev for the processed Danbooru 2024 dataset
IRFS paper for Instance-Aware Repeat Factor Sampling methodology
PyTorch team for optimized attention implementations and gradient checkpointing
The open-source ML community for foundational tools and methods