Spaces:
Build error
A newer version of the Gradio SDK is available:
5.35.0
title: Real-time Speech Transcription
emoji: ποΈ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.19.2
app_file: gradio_app.py
pinned: false
Real-time Transcription with FastRTC
This project implements real-time audio transcription using FastRTC and Gradio, deployed on Hugging Face Spaces.
Features
- Real-time audio transcription
- Voice Activity Detection (VAD)
- Web-based interface using Gradio
- Deployed on Hugging Face Spaces
Prerequisites
- Python 3.8 or higher
- Hugging Face account and token
- Git
Setup
- Clone the repository:
git clone <your-repo-url>
cd realtime-transcription-fastrtc
- Create a
.env
file with your Hugging Face credentials:
HUGGINGFACE_TOKEN=your_token_here
HUGGINGFACE_USERNAME=your_username_here
- Install dependencies:
pip install -r requirements.txt
Deployment
Make sure you have set up your
.env
file with the required credentials.Run the deployment script:
python deploy.py
The script will:
- Check for required environment variables
- Install dependencies
- Log in to Hugging Face
- Create a new Space
- Deploy your application
- Once deployed, your application will be available at:
https://huggingface.co/spaces/<your-username>/realtime-transcription
Local Development
To run the application locally:
python app.py
The application will be available at http://localhost:7860
Troubleshooting
If you encounter any issues during deployment:
- Check that your Hugging Face token is valid and has the necessary permissions
- Ensure all dependencies are installed correctly
- Verify that your
.env
file contains the correct credentials - Check the Hugging Face Spaces logs for any deployment errors
License
This project is licensed under the MIT License - see the LICENSE file for details.
Technical Details
- Uses FastRTC for WebRTC streaming
- Powered by Whisper large-v3-turbo model
- Voice Activity Detection for optimal transcription
- FastAPI backend with WebSocket support
Environment Variables
The following environment variables can be configured:
MODEL_ID
: Hugging Face model ID (default: "openai/whisper-large-v3-turbo")APP_MODE
: Set to "deployed" for Hugging Face SpacesUI_MODE
: Set to "fastapi" for the custom UI
Credits
- FastRTC for WebRTC streaming
- Whisper for speech recognition
- Hugging Face for model hosting
System Requirements
- python >= 3.10
- ffmpeg
Installation
Step 1: Clone the repository
git clone https://github.com/sofi444/realtime-transcription-fastrtc
cd realtime-transcription-fastrtc
Step 2: Set up environment
Choose your preferred package manager:
π¦ Using UV (recommended)
uv venv --python 3.11 && source .venv/bin/activate
uv pip install -r requirements.txt
π Using pip
python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Step 3: Install ffmpeg
π macOS
brew install ffmpeg
π§ Linux (Ubuntu/Debian)
sudo apt update
sudo apt install ffmpeg
Step 4: Configure environment
Create a .env
file in the project root:
UI_MODE=fastapi
APP_MODE=local
SERVER_NAME=localhost
- UI_MODE: controls the interface to use. If set to
gradio
, you will launch the app via Gradio and use their default UI. If set to anything else (eg.fastapi
) it will use theindex.html
file in the root directory to create the UI (you can customise it as you want) (defaultfastapi
). - APP_MODE: ignore this if running only locally. If you're deploying eg. in Spaces, you need to configure a Turn Server. In that case, set it to
deployed
, follow the instructions here (defaultlocal
). - MODEL_ID: HF model identifier for the ASR model you want to use (see here) (default
openai/whisper-large-v3-turbo
) - SERVER_NAME: Host to bind to (default
localhost
) - PORT: Port number (default
7860
)
Step 5: Launch the application
python main.py
click on the url that pops up (eg. https://localhost:7860) to start using the app!
Whisper
Choose the Whisper model version you want to use. See all here - you can of course also use a non-Whisper ASR model.
On MPS, I can run whisper-large-v3-turbo
without problems. This is my current favourite as it's lightweight, performant and multi-lingual!
Adjust the parameters as you like, but remember that for real-time, we want the batch size to be 1 (i.e. start transcribing as soon as a chunk is available).
If you want to transcribe different languages, set the language parameter to the target language, otherwise Whisper defaults to translating to English (even if you set transcribe
as the task).