on1onmangoes's picture
Upload README.md with huggingface_hub
c030400 verified

A newer version of the Gradio SDK is available: 5.35.0

Upgrade
metadata
title: Real-time Speech Transcription
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.19.2
app_file: gradio_app.py
pinned: false

Real-time Transcription with FastRTC

This project implements real-time audio transcription using FastRTC and Gradio, deployed on Hugging Face Spaces.

Features

  • Real-time audio transcription
  • Voice Activity Detection (VAD)
  • Web-based interface using Gradio
  • Deployed on Hugging Face Spaces

Prerequisites

  • Python 3.8 or higher
  • Hugging Face account and token
  • Git

Setup

  1. Clone the repository:
git clone <your-repo-url>
cd realtime-transcription-fastrtc
  1. Create a .env file with your Hugging Face credentials:
HUGGINGFACE_TOKEN=your_token_here
HUGGINGFACE_USERNAME=your_username_here
  1. Install dependencies:
pip install -r requirements.txt

Deployment

  1. Make sure you have set up your .env file with the required credentials.

  2. Run the deployment script:

python deploy.py

The script will:

  • Check for required environment variables
  • Install dependencies
  • Log in to Hugging Face
  • Create a new Space
  • Deploy your application
  1. Once deployed, your application will be available at:
https://huggingface.co/spaces/<your-username>/realtime-transcription

Local Development

To run the application locally:

python app.py

The application will be available at http://localhost:7860

Troubleshooting

If you encounter any issues during deployment:

  1. Check that your Hugging Face token is valid and has the necessary permissions
  2. Ensure all dependencies are installed correctly
  3. Verify that your .env file contains the correct credentials
  4. Check the Hugging Face Spaces logs for any deployment errors

License

This project is licensed under the MIT License - see the LICENSE file for details.

Technical Details

  • Uses FastRTC for WebRTC streaming
  • Powered by Whisper large-v3-turbo model
  • Voice Activity Detection for optimal transcription
  • FastAPI backend with WebSocket support

Environment Variables

The following environment variables can be configured:

  • MODEL_ID: Hugging Face model ID (default: "openai/whisper-large-v3-turbo")
  • APP_MODE: Set to "deployed" for Hugging Face Spaces
  • UI_MODE: Set to "fastapi" for the custom UI

Credits

System Requirements

  • python >= 3.10
  • ffmpeg

Installation

Step 1: Clone the repository

git clone https://github.com/sofi444/realtime-transcription-fastrtc
cd realtime-transcription-fastrtc

Step 2: Set up environment

Choose your preferred package manager:

πŸ“¦ Using UV (recommended)

Install uv

uv venv --python 3.11 && source .venv/bin/activate
uv pip install -r requirements.txt
🐍 Using pip
python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Step 3: Install ffmpeg

🍎 macOS
brew install ffmpeg
🐧 Linux (Ubuntu/Debian)
sudo apt update
sudo apt install ffmpeg

Step 4: Configure environment

Create a .env file in the project root:

UI_MODE=fastapi
APP_MODE=local
SERVER_NAME=localhost
  • UI_MODE: controls the interface to use. If set to gradio, you will launch the app via Gradio and use their default UI. If set to anything else (eg. fastapi) it will use the index.html file in the root directory to create the UI (you can customise it as you want) (default fastapi).
  • APP_MODE: ignore this if running only locally. If you're deploying eg. in Spaces, you need to configure a Turn Server. In that case, set it to deployed, follow the instructions here (default local).
  • MODEL_ID: HF model identifier for the ASR model you want to use (see here) (default openai/whisper-large-v3-turbo)
  • SERVER_NAME: Host to bind to (default localhost)
  • PORT: Port number (default 7860)

Step 5: Launch the application

python main.py

click on the url that pops up (eg. https://localhost:7860) to start using the app!

Whisper

Choose the Whisper model version you want to use. See all here - you can of course also use a non-Whisper ASR model.

On MPS, I can run whisper-large-v3-turbo without problems. This is my current favourite as it's lightweight, performant and multi-lingual!

Adjust the parameters as you like, but remember that for real-time, we want the batch size to be 1 (i.e. start transcribing as soon as a chunk is available).

If you want to transcribe different languages, set the language parameter to the target language, otherwise Whisper defaults to translating to English (even if you set transcribe as the task).