KBLab
/

kb-whisper-large

Automatic Speech Recognition

Model card Files Files and versions Community

Leonoravesterbacka commited on 12 days ago

Commit

6aafecc

·

verified ·

1 Parent(s): 33cee58

Update README.md

Files changed (1) hide show

README.md +22 -0

README.md CHANGED Viewed

@@ -33,6 +33,27 @@ Table: **Word Error Rate (WER)** comparison between KBLab's Whisper models and t
 We provide checkpoints in different formats: `Hugging Face`, `whisper.cpp` (GGML), `onnx`, and `ctranslate2` (used in `faster-whisper` and `WhisperX`).
 #### Hugging Face
 Inference example for using `KB-Whisper` with Hugging Face:
@@ -66,6 +87,7 @@ generate_kwargs = {"task": "transcribe", "language": "sv"}
 res = pipe("audio.mp3",
            chunk_length_s=30,
            generate_kwargs={"task": "transcribe", "language": "sv"})
 ```
 #### Faster-whisper

 We provide checkpoints in different formats: `Hugging Face`, `whisper.cpp` (GGML), `onnx`, and `ctranslate2` (used in `faster-whisper` and `WhisperX`).
+### 2025-05-13 Update!
+The default when loading our models through Hugging Face is **Stage 2**.
+As of May 2025 there exists two **Stage 2** versions in addition to the default, namely **Subtitle** and **Strict** that specify the transcription style.
+By specifying `revision="subtitle"` in `.from_pretrained()` the model version with a more condensed style of transcribing is accessed.
+By specifying `revision="strict"` in `.from_pretrained()` the more verbatim-like version of the model is accessed.
+Below is an example of how this argument is passed in the `.from_pretrained()` function
+```python
+import torch
+from datasets import load_dataset
+from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+model_id = "KBLab/kb-whisper-large"
+model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache", revision="strict"
+)
+```
+The verbosity of the transcription styles of the three model versions ranges from the least verbose **Subtitle**, to **Stage 2** (default) to the most verbose **Strict**.
 #### Hugging Face
 Inference example for using `KB-Whisper` with Hugging Face:
 res = pipe("audio.mp3",
            chunk_length_s=30,
            generate_kwargs={"task": "transcribe", "language": "sv"})
+print(res)
 ```
 #### Faster-whisper