WhiSPA: Whisper Semantically and Psychologically Aligned

This model is from the WhiSPA paper.

Description

WhiSPA (Whisper with Semantic-Psychological Alignment) is a novel speech encoder that leverages the Whisper model as a backbone and aligns its audio embeddings with text representations from SBERT and psychological embeddings. This alignment is achieved through a contrastive student-teacher learning objective, using hundreds of thousands of audio segments from mental health interviews. WhiSPA aims to capture both semantic and psychological information in audio-only encoder models, surpassing state-of-the-art speech models in various tasks.

Training Procedure

WhiSPA is trained using a student-teacher contrastive alignment approach. The Whisper model (student) is aligned with SBERT and psychological embeddings (teacher) to increase the cosine similarity between their embeddings. This alignment helps WhiSPA capture both semantic and psychological information in the audio embeddings.

Quickstart

Clone the repo

git clone https://github.com/humanlab/WhiSPA.git

Create a conda environment

conda env create -f environment.yaml
conda activate whispa

Use the model

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from pretrain.whispa_model import WhiSPAModel
from inference.encode import encode

processor = WhisperProcessor.from_pretrained('openai/whisper-small')
whisper = WhisperForConditionalGeneration.from_pretrained('openai/whisper-small').to('cuda')
whispa = WhiSPAModel.from_pretrained('Jarhatz/WhiSPA-V1-Small').to('cuda')

audio_paths = [
      '/path/to/audio/file.wav',
      '/path/to/audio/file.mp3',
      '/path/to/audio/file.m4a',
]

audio_embeddings = encode(whispa, whisper, processor, audio_paths)
for name, embedding in audio_embeddings.items():
      print(f'audio: {name}   emb: {embedding.shape}')

Downloads last month: 24

Model tree for Jarhatz/WhiSPA-V1-Small

Base model

openai/whisper-small

Finetuned

(2953)

this model