File size: 10,032 Bytes
06d3db5 dcdec03 06d3db5 dcdec03 06d3db5 dcdec03 06d3db5 dcdec03 890886f dcdec03 b48d4e2 dcdec03 b48d4e2 dcdec03 06d3db5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
---
datasets:
- rsalshalan/QASR
- DynamicSuperb/DialectIdentification_ADI17
- openslr/librispeech_asr
- LIUM/tedlium
language:
- ar
- en
metrics:
- bleu
- wer
- accuracy
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
- meta-llama/Llama-3.2-1B
pipeline_tag: audio-text-to-text
---
# 🐙 Octopus: Towards Building the Arabic Speech LLM Suite
## 📢 Overview
**Octopus** is a bilingual **Audio-Language Model (Audio-LLM)** family developed to understand, transcribe, translate, and reason over Arabic and English speech.
It unifies audio, text, and reasoning within one multimodal framework, supporting:
- **Automatic Speech Recognition (ASR)** for Arabic & English 🗣️
- **Speech Translation** (Arabic → English and vice versa) 🌍
- **Arabic Dialect Identification (DID)** 🏷️
The lightweight variant, **TinyOctopus**, maintains the same modular design but is optimized for efficiency on smaller GPUs.
## 🧩 Architecture
### Core Components
The **Octopus** family scales across several encoder–decoder configurations, combining complementary strengths in acoustic understanding and text generation.
1. **Audio Encoders**
- **Distil-Whisper (distil-large-v3)** → lightweight frozen encoder producing compact speech embeddings.
- **Whisper-large-v3** → high-capacity encoder for robust transcription and multilingual coverage.
- **BEATs (Microsoft)** → self-supervised audio encoder capturing fine-grained acoustic cues such as timbre and speaker traits.
2. **Alignment & Fusion**
- **Cross-Attention Projection Layer** → a trainable bridge that aligns audio representations with the text-language space through cross-modal attention.
3. **Language / Decoder Models**
- **DeepSeek 1.5B** → efficient generative decoder for reasoning, dialogue, and translation.
- **LLaMA 3.2 1B** → compact Arabic–English language model variant optimized for code-switching and reasoning on limited hardware.
- **ALLaM 13B** → large bilingual decoder offering high-fidelity generation and deeper contextual grounding for Arabic tasks.
Together these components enable the **Octopus** line—from **TinyOctopus** (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full **ALLaM-Octopus** (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.
## 📚 Training Datasets
The **Octopus** models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, totaling **≈25,000 hours** of high-quality data for ASR, translation, and dialect identification.
| **Task / Domain** | **Dataset** | **Train (h)** | **Dev (h)** | **Description** |
|:------------------|:-------------|:--------------:|:------------:|:----------------|
| **ASR (Arabic)** | [QASR](https://arxiv.org/pdf/2106.13000) | 1,880.5 | 9.6 | Broadcast Arabic from Al-Jazeera; multi-dialect with punctuation and speaker tags. |
| | In-house Arabic Corpus | 13,392.1 | 142.7 | Large internal Arabic dataset across Gulf, Levantine, and North-African dialects. |
| **ASR (English)** | LibriSpeech | 960.0 | 10.5 | Read English corpus for ASR benchmarking. |
| | TED-LIUM | 453.8 | 1.6 | English TED-talk recordings for spontaneous speech recognition. |
| **ASR (Ar–En Code Switching)** | Synthetic (In-house TTS) | 119.5 | – | Synthetic bilingual utterances generated via TTS to strengthen mixed-speech robustness. |
| **Translation (Ar→En)** | Translated QASR (via GPT-4o) | 1,858.4 | 9.6 | QASR corpus automatically translated to English for parallel supervision. |
| | Translated In-house Arabic (via GPT-4o) | 7,229.2 | 141.9 | In-house Arabic dataset machine-translated to English via GPT-4o. |
| **Dialect Identification** | [ADI17](https://swshon.github.io/pdf/shon_2020_adi17.pdf) | 2,241.5 | 19.0 | YouTube-sourced Arabic speech across 17 dialects for dialect recognition and adaptation. |
> **Total Coverage:** ≈25,000 hours of speech across Arabic, English, and mixed-language domains — enabling broad generalization for ASR, translation, and dialect identification.
These datasets jointly provide:
- Balanced representation across dialects.
- Both natural and synthetic speech sources for enhanced robustness.
- Parallel Arabic–English pairs enabling bilingual text generation and translation.
## 🧮 Model Weights & Resources
The full set of model weights (including large checkpoints) is publicly available here:
➡️ [Octopus Model Weights](https://drive.google.com/drive/folders/1602VHm77oyQV4p08x5Xug0ziw7u0p2Ju?usp=sharing)
## ⚙️ Installation & Usage
### **💻 Install Dependencies**
```bash
pip install -r requirements.txt
```
## Inference
```bash
from inference import transcribe
audio_path = "path/to/audio.wav" # Replace with your actual audio file
output = transcribe(audio_path, task="asr") # Options: "dialect", "asr", "translation"
print("Generated Text:", output)
```
## 🧪 Evaluation Results
### 🎙️ ASR Performance (WER ↓)
| **Dataset** | **Ar-Octopus** | **Bilingual-Octopus** | **Trans-Octopus** | **Whisper-large-v3** | **SeamlessM4T** |
|:-------------|:---------------:|:---------------------:|:-----------------:|:--------------------:|:----------------:|
| **MGB2 (Arabic)** | 16.5 \| 6.5 | 15.2 \| 6.8 | **13.3 \| 5.9** | 16.2 \| 7.9 | 17.2 \| 8.4 |
| **test-clean (English)** | 82.5 \| 92.4 | **2.6 \| 1.4** | 67.3 \| 79.4 | 2.86 \| 0.98 | 2.68 \| 0.88 |
| **test-other (English)** | 86.9 \| 95.1 | **5.1 \| 3.4** | 71.5 \| 87.8 | 5.00 \| 2.05 | **5.07 \| 1.94** |
| **tedlium (English)** | 101.9 \| 77.4 | **5.1 \| 3.9** | 85.2 \| 63.6 | 11.9 \| 4.4 | 86.5 \| 62.2 |
| **Escwa (Code-Switched)** | 42.5 \| 26.3 | **40.8 \| 27.1** | 41.8 \| 25.1 | 47.3 \| 31.0 | 52.0 \| 35.3 |
| **Mixat-ALL (Code-Switched)** | 22.0 \| 9.0 | **23.4 \| 10.3** | 34.1 \| 10.6 | 29.0 \| 15.0 | 32.8 \| 16.9 |
| **Mixat-CS (Code-Switched)** | 26.4 \| 12.4 | **28.5 \| 14.9** | 27.8 \| 13.3 | 34.8 \| 20.6 | 38.2 \| 21.8 |
| **In-house Long-form** | 25.4 \| 13.0 | 24.9 \| 12.5 | **24.1 \| 12.1** | 26.7 \| 15.2 | 29.3 \| 18.6 |
> **+86 % English improvement** observed with the addition of language-tokens for bilingual and translation variants.
---
### 🪶 Tiny-Octopus & Fine-Tuning (WER ↓)
| **Dataset** | **TinyOctopus LLaMA-3 1B** | **Fine-tuned LLaMA-3 1B** | **TinyOctopus DeepSeek 1.5B** | **Fine-tuned DeepSeek 1.5B** |
|:-------------|:-------------------------:|:-------------------------:|:-----------------------------:|:-----------------------------:|
| **MGB2 (Arabic)** | 22.6 \| 15.7 | 16.1 \| **9.5** | 23.2 \| 15.8 | **15.5 \| 9.2** |
| **test-clean (English)** | 7.5 \| 5.7 | **3.1 \| 1.3** | 7.7 \| 5.8 | 7.6 \| 5.7 |
| **test-other (English)** | 11.3 \| 8.0 | **6.9 \| 3.5** | 11.5 \| 8.2 | 11.3 \| 8.0 |
| **Escwa (Code-Switched)** | 42.5 \| 26.9 | **40.3 \| 24.4** | 43.6 \| 27.8 | 41.8 \| 26.3 |
| **Mixat-All** | 35.2 \| 19.6 | **34.1 \| 19.3** | 37.1 \| 21.1 | 35.5 \| 19.9 |
| **Mixat-CS** | 40.2 \| 24.2 | **36.2 \| 21.4** | 41.2 \| 25.2 | 39.9 \| 24.2 |
| **In-house Long-files** | 44.3 \| 29.1 | **42.8 \| 26.9** | 47.0 \| 32.7 | 43.7 \| 31.5 |
> **Code-Switch TTS** augmentation yielded **≈ 20 % WER reduction** across multilingual evaluation sets.
---
### 🌍 Translation Performance (BLEU ↑ / BERT-F1 ↑)
| **Model / System** | **CoVoST2 (Ar→En)** | **FLEURS (Ar→En)** |
|:--------------------|:------------------:|:-----------------:|
| Whisper-large-v3 | 28.8 / 0.53 | 15.1 / 0.47 |
| SeamlessM4T | 33.7 / 0.55 | **23.9 / 0.56** |
| **Trans-Octopus** | **38.6 / 0.64** | **23.2 / 0.58** |
| TO-LLaMA-1B | 33.9 / 0.61 | 20.5 / 0.53 |
| TO-DeepSeek-1.5B | 33.6 / 0.61 | 20.8 / 0.53 |
> **Trans-Octopus** achieves the best BLEU and BERT-F1 on **CoVoST2** and competitive results on **FLEURS**, surpassing SeamlessM4T in low-resource conditions.
---
### 🏷️ Dialect Identification
For **dialect identification**, the **Tiny-Octopus** models achieved **87 – 89 % accuracy** across all 17 dialects in **ADI-17**.
The confusion matrices reveal clear separation among **Gulf**, **Levantine**, **North-African**, and **Egyptian** clusters — showing that even compact models can internalize subtle dialectal cues when trained in a multitask setting.
## Examples
### Example 1: Arabic Speech Recognition
🎵 **Audio Input (Arabic)**:
<audio controls>
<source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_1_align.wav" type="audio/wav">
</audio>
📝 **User Prompt**:
> Transcribe the audio
or
> قم بتفريغ المقطع الصوتي
💡 **System Response**:
> أهلا بكم مشاهدينا الكرام في حلقة جديدة من برنامج الاقتصاد والناس
🎵 **Audio Input (English)**:
<audio controls>
<source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/4970-29093-0016.wav" type="audio/wav">
</audio>
📝 **User Prompt**:
> Transcribe the audio
or
> قم بتفريغ المقطع الصوتي
💡 **System Response**:
> NO IT'S NOT TOO SOON
---
### Example 2: Arabic to English Translation
🎵 **Audio Input**:
<audio controls>
<source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_21_align.wav" type="audio/wav">
</audio>
📝 **User Prompt**:
> Translate the following Arabic speech into English
or
> قم بترجمة المقطع للإنجليزية
💡 **System Response**:
> I took a loan a certain amount of money to pay off the debt
---
### Example 3: Dialect Identification
🎵 **Audio Input**:
<audio controls>
<source src="https://huggingface.co/ArabicSpeech/Octopus/resolve/main/examples/tYBpZAOFpvk_071631-073831.wav" type="audio/wav">
</audio>
📝 **User Prompt**:
> Identify the dialect of the given speech
or
> ماهي لهجة المتحدث؟
💡 **System Response**:
> KSA
--- |