--- title: Real-time Speech Transcription emoji: 🎙️ colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.19.2 app_file: gradio_app.py pinned: false --- # Real-time Transcription with FastRTC This project implements real-time audio transcription using FastRTC and Gradio, deployed on Hugging Face Spaces. ## Features - Real-time audio transcription - Voice Activity Detection (VAD) - Web-based interface using Gradio - Deployed on Hugging Face Spaces ## Prerequisites - Python 3.8 or higher - Hugging Face account and token - Git ## Setup 1. Clone the repository: ```bash git clone cd realtime-transcription-fastrtc ``` 2. Create a `.env` file with your Hugging Face credentials: ``` HUGGINGFACE_TOKEN=your_token_here HUGGINGFACE_USERNAME=your_username_here ``` 3. Install dependencies: ```bash pip install -r requirements.txt ``` ## Deployment 1. Make sure you have set up your `.env` file with the required credentials. 2. Run the deployment script: ```bash python deploy.py ``` The script will: - Check for required environment variables - Install dependencies - Log in to Hugging Face - Create a new Space - Deploy your application 3. Once deployed, your application will be available at: ``` https://huggingface.co/spaces//realtime-transcription ``` ## Local Development To run the application locally: ```bash python app.py ``` The application will be available at `http://localhost:7860` ## Troubleshooting If you encounter any issues during deployment: 1. Check that your Hugging Face token is valid and has the necessary permissions 2. Ensure all dependencies are installed correctly 3. Verify that your `.env` file contains the correct credentials 4. Check the Hugging Face Spaces logs for any deployment errors ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Technical Details - Uses FastRTC for WebRTC streaming - Powered by Whisper large-v3-turbo model - Voice Activity Detection for optimal transcription - FastAPI backend with WebSocket support ## Environment Variables The following environment variables can be configured: - `MODEL_ID`: Hugging Face model ID (default: "openai/whisper-large-v3-turbo") - `APP_MODE`: Set to "deployed" for Hugging Face Spaces - `UI_MODE`: Set to "fastapi" for the custom UI ## Credits - [FastRTC](https://fastrtc.org/) for WebRTC streaming - [Whisper](https://github.com/openai/whisper) for speech recognition - [Hugging Face](https://huggingface.co/) for model hosting **System Requirements** - python >= 3.10 - ffmpeg ## Installation ### Step 1: Clone the repository ```bash git clone https://github.com/sofi444/realtime-transcription-fastrtc cd realtime-transcription-fastrtc ``` ### Step 2: Set up environment Choose your preferred package manager:
📦 Using UV (recommended) [Install `uv`](https://docs.astral.sh/uv/getting-started/installation/) ```bash uv venv --python 3.11 && source .venv/bin/activate uv pip install -r requirements.txt ```
🐍 Using pip ```bash python -m venv .venv && source .venv/bin/activate pip install --upgrade pip pip install -r requirements.txt ```
### Step 3: Install ffmpeg
🍎 macOS ```bash brew install ffmpeg ```
🐧 Linux (Ubuntu/Debian) ```bash sudo apt update sudo apt install ffmpeg ```
### Step 4: Configure environment Create a `.env` file in the project root: ```env UI_MODE=fastapi APP_MODE=local SERVER_NAME=localhost ``` - **UI_MODE**: controls the interface to use. If set to `gradio`, you will launch the app via Gradio and use their default UI. If set to anything else (eg. `fastapi`) it will use the `index.html` file in the root directory to create the UI (you can customise it as you want) (default `fastapi`). - **APP_MODE**: ignore this if running only locally. If you're deploying eg. in Spaces, you need to configure a Turn Server. In that case, set it to `deployed`, follow the instructions [here](https://fastrtc.org/deployment/) (default `local`). - **MODEL_ID**: HF model identifier for the ASR model you want to use (see [here](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=trending)) (default `openai/whisper-large-v3-turbo`) - **SERVER_NAME**: Host to bind to (default `localhost`) - **PORT**: Port number (default `7860`) ### Step 5: Launch the application ```bash python main.py ``` click on the url that pops up (eg. https://localhost:7860) to start using the app! ### Whisper Choose the Whisper model version you want to use. See all [here](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=trending&search=whisper) - you can of course also use a non-Whisper ASR model. On MPS, I can run `whisper-large-v3-turbo` without problems. This is my current favourite as it's lightweight, performant and multi-lingual! Adjust the parameters as you like, but remember that for real-time, we want the batch size to be 1 (i.e. start transcribing as soon as a chunk is available). If you want to transcribe different languages, set the language parameter to the target language, otherwise Whisper defaults to translating to English (even if you set `transcribe` as the task).