metadata

license: mit
datasets:
  - Hemg/deepfake-and-real-images
language:
  - en
tags:
  - deepfake-detection
  - computer-vision
  - ensemble-learning
  - pytorch
  - vision-transformer
  - cnn
  - image-classification
  - swint-transformer
  - EffiSwinT
metrics:
  - accuracy
  - precision
  - recall
  - f1
model-index:
  - name: EffiSwinT-Deepfake-Detector
    results:
      - task:
          type: image-classification
          name: Deepfake Detection
        dataset:
          type: Hemg/deepfake-and-real-images
          name: Deepfake and Real Images Dataset
        metrics:
          - type: accuracy
            value: 98.9
            name: Test Accuracy
          - type: f1
            value: 0.99
            name: F1 Score
          - type: precision
            value: 0.99
            name: Precision
          - type: recall
            value: 0.99
            name: Recall
pipeline_tag: image-classification
library_name: pytorch

EffiSwinT: Efficient Deep Fake Detection using EfficientNet-Swin Transformer Hybrid Architecture

Abstract

This repository presents EffiSwinT, a novel hybrid architecture combining EfficientNet-B3 and Swin Transformer for robust deepfake detection. The model leverages the complementary strengths of both architectures: EfficientNet's efficient feature extraction and Swin Transformer's hierarchical representation learning capabilities.

Architecture

General DeepFake Architecture

Detailed Architecture to Detect Deepfake Images

SWIN transformer architecture d) SWIN Transformer: Layer Normalization helps in estimating the normalization statistics without introducing any more dependencies between the training set shifted window multi-head self-attention-It takes the O/P of W-MSA shift all

Region Merging for boosting feature dimension e) Region Merging: The input patches are divided into equal 4 parts combined by this layer. This boosts the feature dimension by 4 times, a linear layer later reduces the feature dimensions back to the original 2. This entire procedure is carried out three times paired with SWIN transformer blocks. SWIN transformer selectively merges adjacent patches to capture the global information properly By merging 4 patches, we keep on increasing the resolution. Fig.5 shows the region merging for boosting feature dimension. 2) DECODER:

Complete Block Diagram

The EffiSwinT architecture consists of three main components:

EfficientNet-B3 Branch: Extracts local features efficiently
Swin Transformer Branch: Captures global dependencies and hierarchical features
Fusion Module: Combines features from both branches through concatenation and MLP layers

Technical Details

Input Image Size: 224x224
Backbone Models:
- EfficientNet-B3 (pretrained)
- Swin-Base-Patch4-Window7 (pretrained)
Feature Fusion: Concatenation followed by MLP (512 units)
Training Augmentations:
- CutMix with α=1.0
- Random Horizontal Flip
- Normalization

Results

The model achieves competitive results on the Hemg/deepfake-and-real-images dataset:

Training Accuracy: 91.7%
Validation Accuracy: 98.9%

Accuracy Plot

Loss Plot

Classification Report

Train & Validation Loss

Dataset

The Hemg/deepfake-and-real-images dataset is used for training and validation. It contains a balanced distribution of real and deepfake images.

Training Details

Training Epochs: 5
Batch Size: 32
Optimizer: AdamW
Learning Rate: 1e-4
Scheduler: Cosine Annealing
Augmentations: CutMix, Random Horizontal Flip, Normalization

This Model is Trained on GPU-p100 and it takes around 10 Hours to train.

Implementation Details

# Example usage
from PIL import Image
model = DeepfakeDetector()
model.load_state_dict(torch.load("effiswint_model.pt"))
result, confidence = predict_image("path/to/image.jpg")

Future Improvements

Data Diversity
- Incorporate multiple deepfake datasets
- Add more diverse real images
- Include different types of manipulations
Hyperparameter Optimization
- Learning rate scheduling strategies
- Batch size optimization
- CutMix probability tuning
- Architecture-specific parameters
Training Enhancements
- Increase training epochs (current: 5)
- Implement gradient accumulation
- Experiment with different optimizers
- Add more augmentation techniques
Model Robustness
- Test on cross-dataset scenarios
- Add adversarial training
- Implement ensemble methods

Dependencies

PyTorch
timm
pytorch-lightning
transformers
datasets
scikit-learn
seaborn

Citation

@unknown{unknown,
author = {Mishra, Soumya and Mohapatra, Hitesh and Gourisaria, Mahendra},
year = {2024},
month = {07},
pages = {},
title = {A Robust Approach for Deepfake Detection Using SWIN Transformer},
doi = {10.21203/rs.3.rs-4672886/v1}
}

@article{coccomini2021combining,
  title={Combining EfficientNet and Vision Transformers for Video Deepfake Detection},
  author={Coccomini, Davide and Bechini, Alessio and Bertini, Marco},
  journal={arXiv preprint arXiv:2107.02612},
  year={2021}
}

@mastersthesis{saha2024deepfake,
  title     = {Leveraging Ensemble Models for Enhanced Deepfake Detection},
  author    = {Saha, Shawna},
  school    = {University at Buffalo, The State University of New York},
  year      = {2024},
  type      = {Master's thesis},
  url       = {https://cse.buffalo.edu/tech-reports/2024-06.pdf}
}

License

MIT

Contact

Contact on [email protected]