mistralai
/

Voxtral-Small-24B-2507

Audio-Text-to-Text

Safetensors

vllm

voxtral

Model card Files Files and versions

xet

Community

eustlb HF Staff commited on Jul 18

Commit

7e27748

1 Parent(s): 60d57fe

readme update

Browse files

Files changed (1) hide show

README.md +283 -1

README.md CHANGED Viewed

@@ -52,6 +52,7 @@ Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilin
 The model can be used with the following frameworks;
 - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
 **Notes**:
@@ -327,4 +328,285 @@ print(30 * "=" + "BOT 1" + 30 * "=")
 print(response.choices[0].message.tool_calls)
 print("\n\n")
 ```
-</details>

 The model can be used with the following frameworks;
 - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
+- [`Transformers` 🤗](https://github.com/huggingface/transformers): See [here](#transformers-🤗)
 **Notes**:
 print(response.choices[0].message.tool_calls)
 print("\n\n")
 ```
+</details>
+### Transformers 🤗
+Voxtral is supported in Transformers natively!
+Install Transformers from source:
+```bash
+pip install git+https://github.com/huggingface/transformers
+```
+#### Audio Instruct
+<details>
+  <summary>➡️ multi-audio + text instruction</summary>
+```python
+from transformers import VoxtralForConditionalGeneration, AutoProcessor
+import torch
+device = "cuda"
+repo_id = "mistralai/Voxtral-Small-24B-2507"
+processor = AutoProcessor.from_pretrained(repo_id)
+model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "audio",
+                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",
+            },
+            {
+                "type": "audio",
+                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
+            },
+            {"type": "text", "text": "What sport and what nursery rhyme are referenced?"},
+        ],
+    }
+]
+inputs = processor.apply_chat_template(conversation)
+inputs = inputs.to(device, dtype=torch.bfloat16)
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+print("\nGenerated response:")
+print("=" * 80)
+print(decoded_outputs[0])
+print("=" * 80)
+```
+</details>
+<details>
+  <summary>➡️ multi-turn</summary>
+```python
+from transformers import VoxtralForConditionalGeneration, AutoProcessor
+import torch
+device = "cuda"
+repo_id = "mistralai/Voxtral-Small-24B-2507"
+processor = AutoProcessor.from_pretrained(repo_id)
+model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "audio",
+                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
+            },
+            {
+                "type": "audio",
+                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
+            },
+            {"type": "text", "text": "Describe briefly what you can hear."},
+        ],
+    },
+    {
+        "role": "assistant",
+        "content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
+    },
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "audio",
+                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
+            },
+            {"type": "text", "text": "Ok, now compare this new audio with the previous one."},
+        ],
+    },
+]
+inputs = processor.apply_chat_template(conversation)
+inputs = inputs.to(device, dtype=torch.bfloat16)
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+print("\nGenerated response:")
+print("=" * 80)
+print(decoded_outputs[0])
+print("=" * 80)
+```
+</details>
+<details>
+  <summary>➡️ text only</summary>
+```python
+from transformers import VoxtralForConditionalGeneration, AutoProcessor
+import torch
+device = "cuda"
+repo_id = "mistralai/Voxtral-Small-24B-2507"
+processor = AutoProcessor.from_pretrained(repo_id)
+model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": "Why should AI models be open-sourced?",
+            },
+        ],
+    }
+]
+inputs = processor.apply_chat_template(conversation)
+inputs = inputs.to(device, dtype=torch.bfloat16)
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+print("\nGenerated response:")
+print("=" * 80)
+print(decoded_outputs[0])
+print("=" * 80)
+```
+</details>
+<details>
+  <summary>➡️ audio only</summary>
+```python
+from transformers import VoxtralForConditionalGeneration, AutoProcessor
+import torch
+device = "cuda"
+repo_id = "mistralai/Voxtral-Small-24B-2507"
+processor = AutoProcessor.from_pretrained(repo_id)
+model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "audio",
+                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
+            },
+        ],
+    }
+]
+inputs = processor.apply_chat_template(conversation)
+inputs = inputs.to(device, dtype=torch.bfloat16)
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+print("\nGenerated response:")
+print("=" * 80)
+print(decoded_outputs[0])
+print("=" * 80)
+```
+</details>
+<details>
+  <summary>➡️ batched inference</summary>
+```python
+from transformers import VoxtralForConditionalGeneration, AutoProcessor
+import torch
+device = "cuda"
+repo_id = "mistralai/Voxtral-Small-24B-2507"
+processor = AutoProcessor.from_pretrained(repo_id)
+model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
+conversations = [
+    [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "audio",
+                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
+                },
+                {
+                    "type": "audio",
+                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
+                },
+                {
+                    "type": "text",
+                    "text": "Who's speaking in the speach and what city's weather is being discussed?",
+                },
+            ],
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "audio",
+                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
+                },
+                {"type": "text", "text": "What can you tell me about this audio?"},
+            ],
+        }
+    ],
+]
+inputs = processor.apply_chat_template(conversations)
+inputs = inputs.to(device, dtype=torch.bfloat16)
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+print("\nGenerated responses:")
+print("=" * 80)
+for decoded_output in decoded_outputs:
+    print(decoded_output)
+    print("=" * 80)
+```
+</details>
+#### Transcription
+<details>
+  <summary>➡️ transcribe</summary>
+```python
+from transformers import VoxtralForConditionalGeneration, AutoProcessor
+import torch
+device = "cuda"
+repo_id = "mistralai/Voxtral-Small-24B-2507"
+processor = AutoProcessor.from_pretrained(repo_id)
+model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
+inputs = processor.apply_transcrition_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id)
+inputs = inputs.to(device, dtype=torch.bfloat16)
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+print("\nGenerated responses:")
+print("=" * 80)
+for decoded_output in decoded_outputs:
+    print(decoded_output)
+    print("=" * 80)
+```
+</details>