Voice-Based Stress Recognition (StudentNet)

Model Card for forwarder1121/voice-based-stress-recognition

Model Details

Model architecture:
A lightweight MLP-based StudentNet distilled from a multimodal TeacherNet trained on the StressID dataset.

  • Inputs: 512-dim audio embedding

  • Embedding Spec:

    This model expects 512-dimensional embeddings generated by fairseq’s Wav2Vec2 (base) model

  • Layers:

    1. Linear(512→128) → ReLU → Dropout(0.3) → LayerNorm
    2. Dropout(0.3) → Linear(128→128) → ReLU → Dropout(0.3)
    3. Linear(128→2) → Softmax

Output:
Two-class stress probability:

  • index 0 → “not stressed”
  • index 1 → “stressed”

Intended Use & Limitations

Intended use:

  • Real-time binary stress detection on edge devices or mobile apps using only audio input.
  • Lightweight inference where only pre-computed audio embeddings are available.

Limitations:

  • Not designed for multiclass stress intensity prediction.
  • Trained on StressID data — performance may degrade on other languages or recording setups.
  • Assumes clean audio and accurate W2V embeddings; high background noise may reduce accuracy.

Training Data

  • Dataset: StressID
  • Modalities collected: ECG, RR, EDA, face/video, voice
  • Labels: Self-assessment on 0–10 scale, converted to binary stress (0 if <5, 1 if ≥5)
  • Split:
    • Used only train split for Teacher training; test split held out for final evaluation
    • Ensured no subject’s tasks appeared in more than one split

Training Procedure

  1. TeacherNet trained on all four modalities (ECG, RR, EDA, Video) with CrossEntropyLoss.
  2. StudentNet trained on audio embeddings with a Distillation Loss:
    loss = CE(student_logits, labels) \
         + α * MSE(student_features, teacher_features)
    
  • α ∈ {0, 1e−7, 1e−6}, best performance at α = 1e−6
  • Optimizer: AdamW, lr=1e−4, batch_size=8, epochs=100, early stopping patience=100

Evaluation

  • TeacherNet (multimodal):

    • Accuracy ≈ 0.82, Macro-F1 ≈ 0.80, UAR ≈ 0.79
  • StudentNet (α = 0):

    • Accuracy ≈ 0.65, Macro-F1 ≈ 0.62, UAR ≈ 0.61
  • StudentNet (α = 1e−6):

    • Accuracy ≈ 0.76, Macro-F1 ≈ 0.74, UAR ≈ 0.73

⚡️ Wav2Vec2 Embedding Notice

  • Audio input for this model should be converted to a 512-dimensional embedding using fairseq's Wav2Vec2 (base) model (torchaudio.pipelines.WAV2VEC2_BASE).
  • The exact model weights used for embedding extraction during training are provided as wav2vec_large.pt in the root directory of this repository.
  • To use this model for inference on raw audio,
    1. Load wav2vec_large.pt with torchaudio/fairseq,
    2. Generate the 512-dim audio embedding for your input audio,
    3. Pass this embedding to StudentNet.

How to Use

Below is a self-contained example that dynamically downloads both the model code (models.py) and the weights from the Hub, then runs inference via the Hugging Face Transformers API—all in one script:

from huggingface_hub import hf_hub_download
import importlib.util
from transformers import AutoConfig, AutoModelForAudioClassification
import torch
import torch.nn.functional as F

def main():
    repo = "forwarder1121/voice-based-stress-recognition"

    # 1) Dynamically download & load the custom models.py
    code_path = hf_hub_download(repo_id=repo, filename="models.py")
    spec = importlib.util.spec_from_file_location("models", code_path)
    models = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(models)
    # now we have models.StudentForAudioClassification and models.StressConfig

    # 2) Load config & model via Transformers (with remote code trust)
    cfg = AutoConfig.from_pretrained(repo, trust_remote_code=True)
    model = AutoModelForAudioClassification.from_pretrained(
        repo,
        trust_remote_code=True,
        torch_dtype="auto"
    )
    model.eval()

    # 3) Prepare a dummy W2V embedding for testing
    #    In real use, replace this with your (1, 512) pre-computed W2V tensor.
    batch_size = 1
    DIM_W2V = 512
    x_w2v = torch.randn(batch_size, DIM_W2V, dtype=next(model.parameters()).dtype)

    # 4) Inference
    with torch.no_grad():
        outputs = model(x_w2v)                # SequenceClassifierOutput
        probs   = F.softmax(outputs.logits, dim=-1)

    print(f"Not stressed: {probs[0,0]*100:.1f}%")
    print(f"Stressed    : {probs[0,1]*100:.1f}%")

if __name__ == "__main__":
    main()

Citation

If you use this model in your research, please cite:

@inproceedings{your2025voice,
  title={Lightweight Audio-Embedding-Based Stress Recognition via Multimodal Knowledge Distillation},
  author={Your Name and …},
  booktitle={Conference/Journal},
  year={2025}
}

Contact: [email protected] Feel free to open an issue or discussion for questions!


Downloads last month
305
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train forwarder1121/voice-based-stress-recognition