FaceGuard β ViT (20 CelebA IDs)
A Vision Transformer (ViT-Base) fine-tuned for identity classification on a 20-identity subset of the CelebA dataset.
This model predicts anonymized celeb_id
integers (not celebrity names).
It powers the demo Space: https://huggingface.co/spaces/hudaakram/FaceGuard-demo
Model Details
Model Description
- Architecture:
google/vit-base-patch16-224
(pretrained on ImageNet-1k) - Fine-tuned for: 20-class identity classification (CelebA
celeb_id
s) - Input: RGB image (face crop), resized and normalized to 224Γ224
- Output: Probability distribution over 20 anonymized IDs
- Parameters: ~86M
Sources
- Base model: https://huggingface.co/google/vit-base-patch16-224
- Demo Space: https://huggingface.co/spaces/hudaakram/FaceGuard-demo
- Dataset: CelebA (community mirror on the Hub)
Uses
Direct Use
- Research demo for identity classification with anonymized CelebA IDs
- Educational example of fine-tuning ViT for image classification
Downstream Use
- As a starting point for transfer learning to other small identity classification tasks
- As an educational reference for hackathons, workshops, or courses
Out-of-Scope Use
- β Production face recognition / surveillance
- β Identifying real celebrity names (dataset only provides integer IDs)
- β Any high-stakes application involving privacy or personal data
Bias, Risks, and Limitations
- Bias: CelebA contains celebrity faces, which are not representative of all demographics.
- Limitations: Trained on only 20 identities (~600 images total) β limited generalization.
- Privacy: CelebA IDs are anonymized integers, not real names. The model is not capable of returning actual identities.
Recommendation: Use strictly for research/educational purposes.
How to Get Started
Use the code below to get started with the model.
from transformers import ViTForImageClassification, AutoImageProcessor
from PIL import Image
import torch
model_id = "hudaakram/FaceGuard-20ID-ViT"
processor = AutoImageProcessor.from_pretrained(model_id)
model = ViTForImageClassification.from_pretrained(model_id)
img = Image.open("face.jpg").convert("RGB")
inputs = processor(images=img, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0]
id2label = {int(k): v for k, v in model.config.id2label.items()}
top5 = probs.topk(5)
for score, idx in zip(top5.values, top5.indices):
print(f"Label {idx.item()} (celeb_id {id2label[idx.item()]}): {score:.3f}")
Training Details
Training Data
- Dataset: CelebA (top 20 identities by frequency)
- Splits: Stratified 80% train / 10% validation / 10% test
- Sizes: Train 501, Val 60, Test 77
Training Procedure
- Seed: 42
- Epochs: 4
- Batch size: 16
- Learning rate: 5e-5
- Optimizer: AdamW
- Weight decay: 0.01
- Precision: FP16 on GPU (Colab)
- Head resized: from 1000 classes β 20 classes
Preprocessing
- Images resized + center-cropped to 224Γ224
- Normalized to ImageNet mean/std
- Labels mapped from CelebA
celeb_id
β contiguous 0β19
Training Hyperparameters
- Training regime: fp16 mixed precision on GPU
- Total epochs: 4 (~3 minutes each on Colab T4)
Speeds, Sizes, Times
- Checkpoint size: ~343 MB
- Throughput: ~10 samples/sec (Colab T4)
Evaluation
- Validation Accuracy: ~0.93
- Test Accuracy: ~0.83
- Macro AUC: (see ROC below)
Test Split Summary
Split | #Images | #Classes | Min/Class | Median/Class | Max/Class |
---|---|---|---|---|---|
Train | 501 | 20 | 24 | 24 | 28 |
Val | 60 | 20 | 3 | 3 | 3 |
Test | 77 | 20 | 3 | 4 | 4 |
Results
Confusion Matrix (normalized):
Environmental Impact
- Hardware: Google Colab T4 GPU
- Training time: ~12 minutes total (4 epochs)
- Carbon emissions: negligible (short fine-tuning run)
Technical Specifications
Model Architecture and Objective
- Vision Transformer (ViT-Base, patch16, 224Γ224)
- Objective: Cross-entropy classification across 20 labels
Compute Infrastructure
- Hardware: Google Colab T4 GPU
- Framework: PyTorch + Hugging Face Transformers
Citation
CelebA Dataset:
Z. Liu, P. Luo, X. Wang, and X. Tang. Deep Learning Face Attributes in the Wild. ICCV 2015.
ViT:
A. Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Model Card Authors
Hackathon submission by Huda Akram
Contact
- Hugging Face profile: https://huggingface.co/hudaakram
- Downloads last month
- 15