EffiSwinT: Efficient Deep Fake Detection using EfficientNet-Swin Transformer Hybrid Architecture
Abstract
This repository presents EffiSwinT, a novel hybrid architecture combining EfficientNet-B3 and Swin Transformer for robust deepfake detection. The model leverages the complementary strengths of both architectures: EfficientNet's efficient feature extraction and Swin Transformer's hierarchical representation learning capabilities.
Architecture
General DeepFake Architecture
Detailed Architecture to Detect Deepfake Images
SWIN transformer architecture d) SWIN Transformer: Layer Normalization helps in estimating the normalization statistics without introducing any more dependencies between the training set shifted window multi-head self-attention-It takes the O/P of W-MSA shift all
Region Merging for boosting feature dimension e) Region Merging: The input patches are divided into equal 4 parts combined by this layer. This boosts the feature dimension by 4 times, a linear layer later reduces the feature dimensions back to the original 2. This entire procedure is carried out three times paired with SWIN transformer blocks. SWIN transformer selectively merges adjacent patches to capture the global information properly By merging 4 patches, we keep on increasing the resolution. Fig.5 shows the region merging for boosting feature dimension. 2) DECODER:
Complete Block Diagram
The EffiSwinT architecture consists of three main components:
- EfficientNet-B3 Branch: Extracts local features efficiently
- Swin Transformer Branch: Captures global dependencies and hierarchical features
- Fusion Module: Combines features from both branches through concatenation and MLP layers
Technical Details
- Input Image Size: 224x224
- Backbone Models:
- EfficientNet-B3 (pretrained)
- Swin-Base-Patch4-Window7 (pretrained)
- Feature Fusion: Concatenation followed by MLP (512 units)
- Training Augmentations:
- CutMix with α=1.0
- Random Horizontal Flip
- Normalization
Results
The model achieves competitive results on the Hemg/deepfake-and-real-images dataset:
- Training Accuracy: 91.7%
- Validation Accuracy: 98.9%
Accuracy Plot
Loss Plot
Classification Report
Train & Validation Loss
Dataset
The Hemg/deepfake-and-real-images dataset is used for training and validation. It contains a balanced distribution of real and deepfake images.
Training Details
- Training Epochs: 5
- Batch Size: 32
- Optimizer: AdamW
- Learning Rate: 1e-4
- Scheduler: Cosine Annealing
- Augmentations: CutMix, Random Horizontal Flip, Normalization
This Model is Trained on GPU-p100 and it takes around 10 Hours to train.
Implementation Details
# Example usage
from PIL import Image
model = DeepfakeDetector()
model.load_state_dict(torch.load("effiswint_model.pt"))
result, confidence = predict_image("path/to/image.jpg")
Future Improvements
Data Diversity
- Incorporate multiple deepfake datasets
- Add more diverse real images
- Include different types of manipulations
Hyperparameter Optimization
- Learning rate scheduling strategies
- Batch size optimization
- CutMix probability tuning
- Architecture-specific parameters
Training Enhancements
- Increase training epochs (current: 5)
- Implement gradient accumulation
- Experiment with different optimizers
- Add more augmentation techniques
Model Robustness
- Test on cross-dataset scenarios
- Add adversarial training
- Implement ensemble methods
Dependencies
- PyTorch
- timm
- pytorch-lightning
- transformers
- datasets
- scikit-learn
- seaborn
Citation
@unknown{unknown,
author = {Mishra, Soumya and Mohapatra, Hitesh and Gourisaria, Mahendra},
year = {2024},
month = {07},
pages = {},
title = {A Robust Approach for Deepfake Detection Using SWIN Transformer},
doi = {10.21203/rs.3.rs-4672886/v1}
}
@article{coccomini2021combining,
title={Combining EfficientNet and Vision Transformers for Video Deepfake Detection},
author={Coccomini, Davide and Bechini, Alessio and Bertini, Marco},
journal={arXiv preprint arXiv:2107.02612},
year={2021}
}
@mastersthesis{saha2024deepfake,
title = {Leveraging Ensemble Models for Enhanced Deepfake Detection},
author = {Saha, Shawna},
school = {University at Buffalo, The State University of New York},
year = {2024},
type = {Master's thesis},
url = {https://cse.buffalo.edu/tech-reports/2024-06.pdf}
}
License
MIT
Contact
Contact on [email protected]
Dataset used to train Saqib772/EffiSwinT
Evaluation results
- Test Accuracy on Deepfake and Real Images Datasetself-reported98.900
- F1 Score on Deepfake and Real Images Datasetself-reported0.990
- Precision on Deepfake and Real Images Datasetself-reported0.990
- Recall on Deepfake and Real Images Datasetself-reported0.990