EffiSwinT / README.md
Saqib772's picture
added tags, libs
682c553 verified
metadata
license: mit
datasets:
  - Hemg/deepfake-and-real-images
language:
  - en
tags:
  - deepfake-detection
  - computer-vision
  - ensemble-learning
  - pytorch
  - vision-transformer
  - cnn
  - image-classification
  - swint-transformer
  - EffiSwinT
metrics:
  - accuracy
  - precision
  - recall
  - f1
model-index:
  - name: EffiSwinT-Deepfake-Detector
    results:
      - task:
          type: image-classification
          name: Deepfake Detection
        dataset:
          type: Hemg/deepfake-and-real-images
          name: Deepfake and Real Images Dataset
        metrics:
          - type: accuracy
            value: 98.9
            name: Test Accuracy
          - type: f1
            value: 0.99
            name: F1 Score
          - type: precision
            value: 0.99
            name: Precision
          - type: recall
            value: 0.99
            name: Recall
pipeline_tag: image-classification
library_name: pytorch

EffiSwinT: Efficient Deep Fake Detection using EfficientNet-Swin Transformer Hybrid Architecture

Abstract

This repository presents EffiSwinT, a novel hybrid architecture combining EfficientNet-B3 and Swin Transformer for robust deepfake detection. The model leverages the complementary strengths of both architectures: EfficientNet's efficient feature extraction and Swin Transformer's hierarchical representation learning capabilities.

Architecture

General DeepFake Architecture

General DeepFake Architecture Detailed Architecture to Detect Deepfake Images

Detailed Architecture to Detect Deepfake Images SWIN transformer architecture d) SWIN Transformer: Layer Normalization helps in estimating the normalization statistics without introducing any more dependencies between the training set shifted window multi-head self-attention-It takes the O/P of W-MSA shift all

SWIN Transformer: Layer Normalization helps in estimating the normalization statistics without introducing any more dependencies between the training set shifted window multi-head self-attention-It takes the O/P of W-MSA shift all

Region Merging for boosting feature dimension e) Region Merging: The input patches are divided into equal 4 parts combined by this layer. This boosts the feature dimension by 4 times, a linear layer later reduces the feature dimensions back to the original 2. This entire procedure is carried out three times paired with SWIN transformer blocks. SWIN transformer selectively merges adjacent patches to capture the global information properly By merging 4 patches, we keep on increasing the resolution. Fig.5 shows the region merging for boosting feature dimension. 2) DECODER:

Region Merging for boosting feature dimension e) Region Merging: The input patches are divided into equal 4 parts combined by this layer. This boosts the feature dimension by 4 times, a linear layer later reduces the feature dimensions back to the original 2. This entire procedure is carried out three times paired with SWIN transformer blocks. SWIN transformer selectively merges adjacent patches to capture the global information properly By merging 4 patches, we keep on increasing the resolution. Fig.5 shows the region merging for boosting feature dimension. 2) DECODER:

Complete Block Diagram

Complete Block Diagram

The EffiSwinT architecture consists of three main components:

  1. EfficientNet-B3 Branch: Extracts local features efficiently
  2. Swin Transformer Branch: Captures global dependencies and hierarchical features
  3. Fusion Module: Combines features from both branches through concatenation and MLP layers

Technical Details

  • Input Image Size: 224x224
  • Backbone Models:
    • EfficientNet-B3 (pretrained)
    • Swin-Base-Patch4-Window7 (pretrained)
  • Feature Fusion: Concatenation followed by MLP (512 units)
  • Training Augmentations:
    • CutMix with α=1.0
    • Random Horizontal Flip
    • Normalization

Results

Confusion Matrix

The model achieves competitive results on the Hemg/deepfake-and-real-images dataset:

  • Training Accuracy: 91.7%
  • Validation Accuracy: 98.9%

Accuracy Plot

Accuracy Plot

Loss Plot

Loss Plot

Classification Report

plot

Train & Validation Loss

plot

Dataset

The Hemg/deepfake-and-real-images dataset is used for training and validation. It contains a balanced distribution of real and deepfake images.

alt text

Training Details

  • Training Epochs: 5
  • Batch Size: 32
  • Optimizer: AdamW
  • Learning Rate: 1e-4
  • Scheduler: Cosine Annealing
  • Augmentations: CutMix, Random Horizontal Flip, Normalization

This Model is Trained on GPU-p100 and it takes around 10 Hours to train.

Implementation Details

# Example usage
from PIL import Image
model = DeepfakeDetector()
model.load_state_dict(torch.load("effiswint_model.pt"))
result, confidence = predict_image("path/to/image.jpg")

Future Improvements

  1. Data Diversity

    • Incorporate multiple deepfake datasets
    • Add more diverse real images
    • Include different types of manipulations
  2. Hyperparameter Optimization

    • Learning rate scheduling strategies
    • Batch size optimization
    • CutMix probability tuning
    • Architecture-specific parameters
  3. Training Enhancements

    • Increase training epochs (current: 5)
    • Implement gradient accumulation
    • Experiment with different optimizers
    • Add more augmentation techniques
  4. Model Robustness

    • Test on cross-dataset scenarios
    • Add adversarial training
    • Implement ensemble methods

Dependencies

  • PyTorch
  • timm
  • pytorch-lightning
  • transformers
  • datasets
  • scikit-learn
  • seaborn

Citation

@unknown{unknown,
author = {Mishra, Soumya and Mohapatra, Hitesh and Gourisaria, Mahendra},
year = {2024},
month = {07},
pages = {},
title = {A Robust Approach for Deepfake Detection Using SWIN Transformer},
doi = {10.21203/rs.3.rs-4672886/v1}
}

@article{coccomini2021combining,
  title={Combining EfficientNet and Vision Transformers for Video Deepfake Detection},
  author={Coccomini, Davide and Bechini, Alessio and Bertini, Marco},
  journal={arXiv preprint arXiv:2107.02612},
  year={2021}
}

@mastersthesis{saha2024deepfake,
  title     = {Leveraging Ensemble Models for Enhanced Deepfake Detection},
  author    = {Saha, Shawna},
  school    = {University at Buffalo, The State University of New York},
  year      = {2024},
  type      = {Master's thesis},
  url       = {https://cse.buffalo.edu/tech-reports/2024-06.pdf}
}


License

MIT

Contact

Contact on [email protected]