Voice-Based Stress Recognition (StudentNet)

Model Card for forwarder1121/voice-based-stress-recognition

Model Details

Model name: Voice-Based Stress Recognition (StudentNet)
Repository: https://huggingface.co/forwarder1121/voice-based-stress-recognition
License: MIT
Library version: PyTorch ≥1.7

Model architecture:
A lightweight MLP-based StudentNet distilled from a multimodal TeacherNet trained on the StressID dataset.

Inputs: 512-dim audio embedding
Embedding Spec:

This model expects 512-dimensional embeddings generated by fairseq’s Wav2Vec2 (base) model
Layers:
1. Linear(512→128) → ReLU → Dropout(0.3) → LayerNorm
2. Dropout(0.3) → Linear(128→128) → ReLU → Dropout(0.3)
3. Linear(128→2) → Softmax

Output:
Two-class stress probability:

index 0 → “not stressed”
index 1 → “stressed”

Intended Use & Limitations

Intended use:

Real-time binary stress detection on edge devices or mobile apps using only audio input.
Lightweight inference where only pre-computed audio embeddings are available.

Limitations:

Not designed for multiclass stress intensity prediction.
Trained on StressID data — performance may degrade on other languages or recording setups.
Assumes clean audio and accurate W2V embeddings; high background noise may reduce accuracy.

Training Data

Dataset: StressID
Modalities collected: ECG, RR, EDA, face/video, voice
Labels: Self-assessment on 0–10 scale, converted to binary stress (0 if <5, 1 if ≥5)
Split:
- Used only train split for Teacher training; test split held out for final evaluation
- Ensured no subject’s tasks appeared in more than one split

Training Procedure

TeacherNet trained on all four modalities (ECG, RR, EDA, Video) with CrossEntropyLoss.

StudentNet trained on audio embeddings with a Distillation Loss:

loss = CE(student_logits, labels) \
     + α * MSE(student_features, teacher_features)

α ∈ {0, 1e−7, 1e−6}, best performance at α = 1e−6
Optimizer: AdamW, lr=1e−4, batch_size=8, epochs=100, early stopping patience=100

Evaluation

TeacherNet (multimodal):
- Accuracy ≈ 0.82, Macro-F1 ≈ 0.80, UAR ≈ 0.79
StudentNet (α = 0):
- Accuracy ≈ 0.65, Macro-F1 ≈ 0.62, UAR ≈ 0.61
StudentNet (α = 1e−6):
- Accuracy ≈ 0.76, Macro-F1 ≈ 0.74, UAR ≈ 0.73

⚡️ Wav2Vec2 Embedding Notice

Audio input for this model should be converted to a 512-dimensional embedding using fairseq's Wav2Vec2 (base) model (torchaudio.pipelines.WAV2VEC2_BASE).
The exact model weights used for embedding extraction during training are provided as wav2vec_large.pt in the root directory of this repository.
To use this model for inference on raw audio,
1. Load wav2vec_large.pt with torchaudio/fairseq,
2. Generate the 512-dim audio embedding for your input audio,
3. Pass this embedding to StudentNet.

How to Use

Below is a self-contained example that dynamically downloads both the model code (models.py) and the weights from the Hub, then runs inference via the Hugging Face Transformers API—all in one script:

from huggingface_hub import hf_hub_download
import importlib.util
from transformers import AutoConfig, AutoModelForAudioClassification
import torch
import torch.nn.functional as F

def main():
    repo = "forwarder1121/voice-based-stress-recognition"

    # 1) Dynamically download & load the custom models.py
    code_path = hf_hub_download(repo_id=repo, filename="models.py")
    spec = importlib.util.spec_from_file_location("models", code_path)
    models = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(models)
    # now we have models.StudentForAudioClassification and models.StressConfig

    # 2) Load config & model via Transformers (with remote code trust)
    cfg = AutoConfig.from_pretrained(repo, trust_remote_code=True)
    model = AutoModelForAudioClassification.from_pretrained(
        repo,
        trust_remote_code=True,
        torch_dtype="auto"
    )
    model.eval()

    # 3) Prepare a dummy W2V embedding for testing
    #    In real use, replace this with your (1, 512) pre-computed W2V tensor.
    batch_size = 1
    DIM_W2V = 512
    x_w2v = torch.randn(batch_size, DIM_W2V, dtype=next(model.parameters()).dtype)

    # 4) Inference
    with torch.no_grad():
        outputs = model(x_w2v)                # SequenceClassifierOutput
        probs   = F.softmax(outputs.logits, dim=-1)

    print(f"Not stressed: {probs[0,0]*100:.1f}%")
    print(f"Stressed    : {probs[0,1]*100:.1f}%")

if __name__ == "__main__":
    main()

Citation

If you use this model in your research, please cite:

@inproceedings{your2025voice,
  title={Lightweight Audio-Embedding-Based Stress Recognition via Multimodal Knowledge Distillation},
  author={Your Name and …},
  booktitle={Conference/Journal},
  year={2025}
}

Contact: [email protected] Feel free to open an issue or discussion for questions!

forwarder1121
/

voice-based-stress-recognition