Update README.md

97af547 verified 18 days ago

6 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	- de
	- hi
	- fr
	- es
	- zh
	library_name: faster-whisper
	pipeline_tag: automatic-speech-recognition
	model-index:
	- name: whisper-large-v3-turbo-german-faster-whisper
	results:
	- task:
	type: automatic-speech-recognition
	name: Speech Recognition
	dataset:
	name: German ASR Data-Mix
	type: german-asr-mixed
	metrics:
	- type: wer
	value: 2.628 %
	name: Test WER
	base_model:
	- primeline/whisper-large-v3-turbo-german
	tags:
	- whisper
	- speech-recognition
	- german
	- ctranslate2
	- faster-whisper
	- audio
	- transcription
	- multilingual
	---

	# Whisper Large v3 Turbo German - Faster Whisper



	## Overview

	This repository contains a high-performance German speech recognition model based on OpenAI's Whisper Large v3 Turbo architecture. The model has been optimized using CTranslate2 for faster inference and reduced memory usage, making it ideal for production deployments.

	## Original Model

	This model is based on the work from [primeline/whisper-large-v3-turbo-german](https://huggingface.co/primeline/whisper-large-v3-turbo-german) and has been converted to CTranslate2 format for optimal performance with faster-whisper.


	## Model Details




	- Architecture: Whisper Large v3 Turbo
	- Language: Multilingual and German (de) [primary data trained on]
	- Parameters: 809M
	- Format: CTranslate2 optimized
	- License: cc-by-nc-4.0 [![CC BY-NC 4.0](https://licensebuttons.net/l/by-nc/4.0/88x31.png)](https://creativecommons.org/licenses/by-nc/4.0/)

	While this model is optimized for German, it can also transcribe multiple languages supported by Whisper Large v3 Turbo, though accuracy may vary depending on the language.

	---

	> User Benchmark: NVIDIA GeForce RTX 4070 Laptop GPU

	\| Metric \| Value \|
	\|---------------------\|--------------------\|
	\| Audio duration \| 254.71 seconds \|
	\| Transcription time \| 0.57 seconds \|

	* This result was achieved using the NVIDIA GeForce RTX 4070 Laptop GPU (see hardware details above). The model transcribed over 4 minutes of audio in less than a second, demonstrating exceptional performance on this hardware.

	## Performance

	The model achieves state-of-the-art performance on German speech recognition tasks with a Word Error Rate (WER) of 2.628% on comprehensive test datasets.

	## Use Cases

	This model is designed for various German speech recognition applications:

	- Real-time Transcription: Live audio transcription for meetings, lectures, and conferences
	- Media Processing: Automatic subtitle generation for German video content
	- Voice Assistants: Speech-to-text conversion for voice-controlled applications
	- Call Center Analytics: Transcription and analysis of customer service calls
	- Accessibility Tools: Converting spoken German to text for hearing-impaired users
	- Document Creation: Voice-to-text dictation for content creation

	## Installation and Usage

	### Prerequisites

	```bash
	pip install faster-whisper torch
	```

	### Basic Usage

	```python
	from faster_whisper import WhisperModel

	# Load the model
	model = WhisperModel(
	"TheChola/whisper-large-v3-turbo-german-faster-whisper",
	device="cuda", # Use GPU for speed
	compute_type="float16" # Use FP16 for efficiency (can change to "int8" for lower memory)
	)

	# Transcribe audio file
	segments, info = model.transcribe("audio.wav", language="de")

	# Print results
	for segment in segments:
	print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
	```

	### Advanced Usage with Options

	```python
	from faster_whisper import WhisperModel

	# Load the German-optimized Whisper large-v3 turbo model from Hugging Face
	model = WhisperModel(
	"TheChola/whisper-large-v3-turbo-german-faster-whisper",
	device="cuda", # Use GPU for speed
	compute_type="float16" # Use FP16 for efficiency (can change to "int8" for lower memory)
	)

	# Transcribe with additional options
	segments, info = model.transcribe(
	"audio.wav",
	language="de",
	beam_size=5,
	best_of=5,
	temperature=0.0,
	condition_on_previous_text=False,
	vad_filter=True,
	vad_parameters=dict(min_silence_duration_ms=500)
	)

	print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
	print(f"Duration: {info.duration:.2f} seconds")

	for segment in segments:
	print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

	```

	## Model Specifications

	- Input: Audio files (WAV, MP3, FLAC, etc.)
	- Output: German text transcription with timestamps
	- Sampling Rate: 16kHz (automatically resampled if needed)
	- Context Length: 30 seconds per chunk
	- Supported Audio Formats: All formats supported by FFmpeg

	## Hardware Requirements

	### Minimum Requirements
	- CPU: 4 cores, 8GB RAM
	- GPU: Optional, but recommended for faster inference

	### Recommended Requirements
	- CPU: 8+ cores, 16GB+ RAM
	- GPU: NVIDIA GPU with 4GB+ VRAM (RTX 3060 or better)
	- Storage: 2GB free space for model files

	## Performance Benchmarks

	\| Device \| Batch Size \| Real-time Factor \| Memory Usage \|
	\|--------\|------------\|------------------\|--------------\|
	\| CPU (8 cores) \| 1 \| 0.3x \| 2GB \|
	\| RTX 3060 \| 4 \| 0.1x \| 4GB \|
	\| RTX 4080 \| 8 \| 0.05x \| 6GB \|
	\| RTX 4070 Laptop GPU \| 1 \| ~0.002x \| 8GB \|

	## Model Files

	This repository contains the following files:
	- `model.bin` - Main model weights in CTranslate2 format
	- `config.json` - Model configuration
	- `tokenizer.json` - Tokenizer configuration
	- `vocab.json` - Vocabulary mapping
	- Additional configuration files for preprocessing and generation




	## License

	This work is licensed under a [Creative Commons Attribution-NonCommercial 4.0 International License](https://creativecommons.org/licenses/by-nc/4.0/)

	## Changelog

	### v1.0.0
	- Initial release of CTranslate2 optimized model
	- Support for faster-whisper framework
	- Optimized for German speech recognition