AST-AMVD-SAD-v1

Description

A fine-tuned audio classification model for detecting AI-generated audio content.

Author

Model Details

Model Description

  • Architecture: Based on the Audio Spectrogram Transformer (AST) architecture from MattyB95/AST-ASVspoof2019-Synthetic-Voice-Detection
  • Input: Audio waveforms converted to mel-spectrogram representations
  • Output: Four-class classification for audio authenticity detection

Intended Use

This model is designed to:

  • Detect AI-generated audio content
  • Identify different types of synthetic audio:
    • Class 0 (H): Real Human Audio
    • Class 1 (C): AI Cloned Audio
    • Class 2 (A): AI Generated Audio
    • Class 3 (Combined): Mixed Human/AI Audio
  • Primary use cases include:
    • Content authenticity verification
    • AI-generated content detection systems
    • Audio forensics applications

Training Data

  • Dataset: AMVD_AS Dataset
  • Data Composition:
    • Balanced samples across four categories
    • Contains both synthetic and genuine human audio samples

Training Procedure

Fine-tuning Parameters

  • Base Model: MattyB95/AST-ASVspoof2019-Synthetic-Voice-Detection
  • Initial Learning Rate: 4e-5 โ†’ 1e-5 (linear decay)
  • Total Training Steps: 25,000
  • Batch Size: 32
  • Warmup Steps: 5,000
  • Weight Decay: 0.01
  • Gradient Clip Norm: 1.0
  • Training Duration: ~4.5 hours (A100 GPU)

Evaluation

Validation Performance

  • Training Loss: 0.0874
  • Eval Loss: 0.07367
  • Eval Accuracy: 0.98109
  • Final Steps per Second: 2.566
  • Final Samples per Second: 10.264
  • Runtime at 25k Steps: 824.1802
  • Gradient Norm: 0.000075778
  • LR Stability: 1e-5
Downloads last month
8
Safetensors
Model size
86.2M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AnodHuang/AST-AMVD-SAD-v2

Dataset used to train AnodHuang/AST-AMVD-SAD-v2