metadata

library_name: hertz-dev
license: apache-2.0
pipeline_tag: audio-to-audio

Hertz-dev

Hertz-dev is an open-source, first-of-its-kind base model for full-duplex conversational audio. It is an 8.5B parameter transformer trained on 20 million unique hours of high-quality audio data. This repo contains code for both mono- and full-duplex generation; we expect to do a full Transformers library integration in the near future.

Hertz-dev is a base model, without fine-tuning, RLHF, or instruction-following behavior. It can be fine-tuned for almost 𝘢𝘯𝘺 audio modeling task, from live translation to classification. Base models excel at faithfully modeling their training set, and accurate maps come from contact with reality.

From the world’s largest known dataset of high-quality real-world conversational audio, hertz-dev exhibits state-of-the art ability in human-like speech patterns such as pauses and emotional inflections. Hertz-dev has a 80ms theoretical average latency, and benchmarks 120ms real-world latency on a single RTX 4090, which is 1.5-2x lower than the previous state of the art. Low latency is necessary for natural audio, and we're proud to move the field in this direction.

To learn more, see our blogpost at https://si.inc/hertz-dev/

Setup

To get started, clone the git repository and install requirements with

git clone https://github.com/Standard-Intelligence/hertz-dev
cd hertz-dev
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
#sudo apt-get install libportaudio2 #just on ubuntu

Inference is known to work on Python 3.10 and CUDA 12.1. Other versions have not been tested as thoroughly. If you want to use CUDA 12.1, you'll need to install torch with pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 All three scripts will automatically download the models you need.

Usage

We recommend starting by using inference.ipynb to generate one- or two-channel completions from a prompt. Usage on Windows does not work out of the box because the repository tries to use flash attention. Switching to efficient attention in the SDPA backend code may work but is untested.

Then, you can use inference_client.py and inference_server.py to talk to the model live through your microphone. These are currently experimental, and have primarily been tested with Ubuntu on the server and MacOS on the client.