metadata

title: Real-time Speech Transcription
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.19.2
app_file: gradio_app.py
pinned: false

Real-time Transcription with FastRTC

This project implements real-time audio transcription using FastRTC and Gradio, deployed on Hugging Face Spaces.

Features

Real-time audio transcription
Voice Activity Detection (VAD)
Web-based interface using Gradio
Deployed on Hugging Face Spaces

Prerequisites

Python 3.8 or higher
Hugging Face account and token
Git

Setup

Clone the repository:

git clone <your-repo-url>
cd realtime-transcription-fastrtc

Create a .env file with your Hugging Face credentials:

HUGGINGFACE_TOKEN=your_token_here
HUGGINGFACE_USERNAME=your_username_here

Install dependencies:

pip install -r requirements.txt

Deployment

Make sure you have set up your .env file with the required credentials.
Run the deployment script:

python deploy.py

The script will:

Check for required environment variables
Install dependencies
Log in to Hugging Face
Create a new Space
Deploy your application

Once deployed, your application will be available at:

https://huggingface.co/spaces/<your-username>/realtime-transcription

Local Development

To run the application locally:

python app.py

The application will be available at http://localhost:7860

Troubleshooting

If you encounter any issues during deployment:

Check that your Hugging Face token is valid and has the necessary permissions
Ensure all dependencies are installed correctly
Verify that your .env file contains the correct credentials
Check the Hugging Face Spaces logs for any deployment errors

License

This project is licensed under the MIT License - see the LICENSE file for details.

Technical Details

Uses FastRTC for WebRTC streaming
Powered by Whisper large-v3-turbo model
Voice Activity Detection for optimal transcription
FastAPI backend with WebSocket support

Environment Variables

The following environment variables can be configured:

MODEL_ID: Hugging Face model ID (default: "openai/whisper-large-v3-turbo")
APP_MODE: Set to "deployed" for Hugging Face Spaces
UI_MODE: Set to "fastapi" for the custom UI

Credits

FastRTC for WebRTC streaming
Whisper for speech recognition
Hugging Face for model hosting

System Requirements

python >= 3.10
ffmpeg

Installation

Step 1: Clone the repository

git clone https://github.com/sofi444/realtime-transcription-fastrtc
cd realtime-transcription-fastrtc

Step 2: Set up environment

Choose your preferred package manager:

📦 Using UV (recommended)

Install uv

uv venv --python 3.11 && source .venv/bin/activate
uv pip install -r requirements.txt

🐍 Using pip

python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Step 3: Install ffmpeg

🍎 macOS

brew install ffmpeg

🐧 Linux (Ubuntu/Debian)

sudo apt update
sudo apt install ffmpeg

Step 4: Configure environment

Create a .env file in the project root:

UI_MODE=fastapi
APP_MODE=local
SERVER_NAME=localhost

UI_MODE: controls the interface to use. If set to gradio, you will launch the app via Gradio and use their default UI. If set to anything else (eg. fastapi) it will use the index.html file in the root directory to create the UI (you can customise it as you want) (default fastapi).
APP_MODE: ignore this if running only locally. If you're deploying eg. in Spaces, you need to configure a Turn Server. In that case, set it to deployed, follow the instructions here (default local).
MODEL_ID: HF model identifier for the ASR model you want to use (see here) (default openai/whisper-large-v3-turbo)
SERVER_NAME: Host to bind to (default localhost)
PORT: Port number (default 7860)

Step 5: Launch the application

python main.py

click on the url that pops up (eg. https://localhost:7860) to start using the app!

Whisper

Choose the Whisper model version you want to use. See all here - you can of course also use a non-Whisper ASR model.

On MPS, I can run whisper-large-v3-turbo without problems. This is my current favourite as it's lightweight, performant and multi-lingual!

Adjust the parameters as you like, but remember that for real-time, we want the batch size to be 1 (i.e. start transcribing as soon as a chunk is available).

If you want to transcribe different languages, set the language parameter to the target language, otherwise Whisper defaults to translating to English (even if you set transcribe as the task).