Voice-Based Stress Recognition (StudentNet)
Model Card for forwarder1121/voice-based-stress-recognition
Model Details
- Model name: Voice-Based Stress Recognition (StudentNet)
- Repository: https://huggingface.co/forwarder1121/voice-based-stress-recognition
- License: MIT
- Library version: PyTorch ≥1.7
Model architecture:
A lightweight MLP-based StudentNet
distilled from a multimodal TeacherNet
trained on the StressID dataset.
Inputs: 512-dim audio embedding
Embedding Spec:
This model expects 512-dimensional embeddings generated by fairseq’s Wav2Vec2 (base) model
Layers:
- Linear(512→128) → ReLU → Dropout(0.3) → LayerNorm
- Dropout(0.3) → Linear(128→128) → ReLU → Dropout(0.3)
- Linear(128→2) → Softmax
Output:
Two-class stress probability:
- index 0 → “not stressed”
- index 1 → “stressed”
Intended Use & Limitations
Intended use:
- Real-time binary stress detection on edge devices or mobile apps using only audio input.
- Lightweight inference where only pre-computed audio embeddings are available.
Limitations:
- Not designed for multiclass stress intensity prediction.
- Trained on StressID data — performance may degrade on other languages or recording setups.
- Assumes clean audio and accurate W2V embeddings; high background noise may reduce accuracy.
Training Data
- Dataset: StressID
- Modalities collected: ECG, RR, EDA, face/video, voice
- Labels: Self-assessment on 0–10 scale, converted to binary stress (0 if <5, 1 if ≥5)
- Split:
- Used only
train
split for Teacher training;test
split held out for final evaluation - Ensured no subject’s tasks appeared in more than one split
- Used only
Training Procedure
- TeacherNet trained on all four modalities (ECG, RR, EDA, Video) with CrossEntropyLoss.
- StudentNet trained on audio embeddings with a Distillation Loss:
loss = CE(student_logits, labels) \ + α * MSE(student_features, teacher_features)
- α ∈ {0, 1e−7, 1e−6}, best performance at α = 1e−6
- Optimizer: AdamW, lr=1e−4, batch_size=8, epochs=100, early stopping patience=100
Evaluation
TeacherNet (multimodal):
- Accuracy ≈ 0.82, Macro-F1 ≈ 0.80, UAR ≈ 0.79
StudentNet (α = 0):
- Accuracy ≈ 0.65, Macro-F1 ≈ 0.62, UAR ≈ 0.61
StudentNet (α = 1e−6):
- Accuracy ≈ 0.76, Macro-F1 ≈ 0.74, UAR ≈ 0.73
⚡️ Wav2Vec2 Embedding Notice
- Audio input for this model should be converted to a 512-dimensional embedding using fairseq's Wav2Vec2 (base) model (
torchaudio.pipelines.WAV2VEC2_BASE
). - The exact model weights used for embedding extraction during training are provided as
wav2vec_large.pt
in the root directory of this repository. - To use this model for inference on raw audio,
- Load
wav2vec_large.pt
with torchaudio/fairseq, - Generate the 512-dim audio embedding for your input audio,
- Pass this embedding to
StudentNet
.
- Load
How to Use
Below is a self-contained example that dynamically downloads both the model code (models.py
) and the weights from the Hub, then runs inference via the Hugging Face Transformers API—all in one script:
from huggingface_hub import hf_hub_download
import importlib.util
from transformers import AutoConfig, AutoModelForAudioClassification
import torch
import torch.nn.functional as F
def main():
repo = "forwarder1121/voice-based-stress-recognition"
# 1) Dynamically download & load the custom models.py
code_path = hf_hub_download(repo_id=repo, filename="models.py")
spec = importlib.util.spec_from_file_location("models", code_path)
models = importlib.util.module_from_spec(spec)
spec.loader.exec_module(models)
# now we have models.StudentForAudioClassification and models.StressConfig
# 2) Load config & model via Transformers (with remote code trust)
cfg = AutoConfig.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForAudioClassification.from_pretrained(
repo,
trust_remote_code=True,
torch_dtype="auto"
)
model.eval()
# 3) Prepare a dummy W2V embedding for testing
# In real use, replace this with your (1, 512) pre-computed W2V tensor.
batch_size = 1
DIM_W2V = 512
x_w2v = torch.randn(batch_size, DIM_W2V, dtype=next(model.parameters()).dtype)
# 4) Inference
with torch.no_grad():
outputs = model(x_w2v) # SequenceClassifierOutput
probs = F.softmax(outputs.logits, dim=-1)
print(f"Not stressed: {probs[0,0]*100:.1f}%")
print(f"Stressed : {probs[0,1]*100:.1f}%")
if __name__ == "__main__":
main()
Citation
If you use this model in your research, please cite:
@inproceedings{your2025voice,
title={Lightweight Audio-Embedding-Based Stress Recognition via Multimodal Knowledge Distillation},
author={Your Name and …},
booktitle={Conference/Journal},
year={2025}
}
Contact: [email protected] Feel free to open an issue or discussion for questions!
- Downloads last month
- 305