Update README.md
Browse files
README.md
CHANGED
@@ -1,199 +1,133 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
4 |
---
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
-
|
25 |
-
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
##
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
[
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
#### Summary
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
## Model Examination [optional]
|
136 |
-
|
137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
138 |
-
|
139 |
-
[More Information Needed]
|
140 |
-
|
141 |
-
## Environmental Impact
|
142 |
-
|
143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
144 |
-
|
145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
146 |
-
|
147 |
-
- **Hardware Type:** [More Information Needed]
|
148 |
-
- **Hours used:** [More Information Needed]
|
149 |
-
- **Cloud Provider:** [More Information Needed]
|
150 |
-
- **Compute Region:** [More Information Needed]
|
151 |
-
- **Carbon Emitted:** [More Information Needed]
|
152 |
-
|
153 |
-
## Technical Specifications [optional]
|
154 |
-
|
155 |
-
### Model Architecture and Objective
|
156 |
-
|
157 |
-
[More Information Needed]
|
158 |
-
|
159 |
-
### Compute Infrastructure
|
160 |
-
|
161 |
-
[More Information Needed]
|
162 |
-
|
163 |
-
#### Hardware
|
164 |
-
|
165 |
-
[More Information Needed]
|
166 |
-
|
167 |
-
#### Software
|
168 |
-
|
169 |
-
[More Information Needed]
|
170 |
-
|
171 |
-
## Citation [optional]
|
172 |
-
|
173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
174 |
-
|
175 |
-
**BibTeX:**
|
176 |
-
|
177 |
-
[More Information Needed]
|
178 |
-
|
179 |
-
**APA:**
|
180 |
-
|
181 |
-
[More Information Needed]
|
182 |
-
|
183 |
-
## Glossary [optional]
|
184 |
-
|
185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
186 |
-
|
187 |
-
[More Information Needed]
|
188 |
-
|
189 |
-
## More Information [optional]
|
190 |
-
|
191 |
-
[More Information Needed]
|
192 |
-
|
193 |
-
## Model Card Authors [optional]
|
194 |
-
|
195 |
-
[More Information Needed]
|
196 |
-
|
197 |
-
## Model Card Contact
|
198 |
-
|
199 |
-
[More Information Needed]
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- ms
|
4 |
+
- en
|
5 |
+
- zh
|
6 |
+
base_model:
|
7 |
+
- mesolitica/Malaysian-Qwen2.5-7B-Audio-Instruct
|
8 |
---
|
9 |
+
# Malaysian-Audio-Qwen2.5-7B-Speech-Instruct
|
10 |
+
|
11 |
+
Speech model on top of [mesolitica/Malaysian-Qwen2.5-7B-Audio-Instruct](https://huggingface.co/mesolitica/Malaysian-Qwen2.5-7B-Audio-Instruct). It designs for various audio tasks and suitable for voice assistant general question answer.
|
12 |
+
|
13 |
+
## How we trained it?
|
14 |
+
|
15 |
+
Speech instructions, actual conversations related to coding, politics, chat assistant and general QA.
|
16 |
+
- We use freezed Whisper Large V3 Encoder without any pooling, means 30 seconds audio consumed 1500 tokens or 1 token equal to 0.02 seconds.
|
17 |
+
- Projection, Embedding and LM Head layers are done in full parameter finetuning.
|
18 |
+
- LoRA for other linear layers with rank 64 and alpha 128.
|
19 |
+
- Training done in multipacking with 10240 context length.
|
20 |
+
- WanDB at https://wandb.ai/huseinzol05/lora-embedding-64-audio-qwen2.5-7b-malaysian-10k-stage2
|
21 |
+
- Revision [513a900f40d372e8d7eb774e0561af043c704449](https://huggingface.co/mesolitica/Malaysian-Audio-Qwen2.5-7B-Instruct/commit/513a900f40d372e8d7eb774e0561af043c704449)
|
22 |
+
|
23 |
+
#### Dataset
|
24 |
+
|
25 |
+
1. [mesolitica/Malaysian-UltraChat-Speech-Multiturn-Instructions](https://huggingface.co/datasets/mesolitica/Malaysian-UltraChat-Speech-Multiturn-Instructions), 1 epoch.
|
26 |
+
2. [mesolitica/Malaysian-Multiturn-Chat-Assistant](https://huggingface.co/datasets/mesolitica/Malaysian-Multiturn-Chat-Assistant), 1 epoch.
|
27 |
+
3. [mesolitica/Malaysian-Speech-Instructions](https://huggingface.co/datasets/mesolitica/Malaysian-Speech-Instructions), 1 epoch.
|
28 |
+
4. [mesolitica/Malaysian-Reasoning-Speech-Instructions](https://huggingface.co/datasets/mesolitica/Malaysian-Reasoning-Speech-Instructions), 1 epoch.
|
29 |
+
5. [mesolitica/Malaysian-Speech-Description-Timestamp-Instructions](https://huggingface.co/datasets/mesolitica/Malaysian-Speech-Description-Timestamp-Instructions), random sampling 0.2 epoch.
|
30 |
+
6. [mesolitica/Cantonese-Radio-Description-Instructions](https://huggingface.co/datasets/mesolitica/Cantonese-Radio-Description-Instructions), random sampling 0.2 epoch.
|
31 |
+
7. [mesolitica/Emilia-Mandarin-Description-Instructions](https://huggingface.co/datasets/mesolitica/Emilia-Mandarin-Description-Instructions), random sampling 0.2 epoch.
|
32 |
+
8. [mesolitica/Malaysian-SFT/combined-malaysian-sft-5k-sample.jsonl](https://huggingface.co/datasets/mesolitica/Malaysian-SFT/blob/main/combine/combined-malaysian-sft-5k-sample.jsonl), text corpus, 1 epoch.
|
33 |
+
9. [mesolitica/Malaysian-Instructions/voice_assistant](https://huggingface.co/datasets/mesolitica/Malaysian-Instructions/viewer/default/voice_assistant), text only instructions, 1 epoch.
|
34 |
+
10. [mesolitica/Malaysian-Instructions/mixed_manglish](https://huggingface.co/datasets/mesolitica/Malaysian-Instructions/viewer/default/mixed_manglish), text only instructions, 1 epoch.
|
35 |
+
11. [mesolitica/Malaysian-Instructions/manglish](https://huggingface.co/datasets/mesolitica/Malaysian-Instructions/viewer/default/manglish), text only instructions, 1 epoch.
|
36 |
+
12. [mesolitica/Malaysian-Instructions/longer_respond](https://huggingface.co/datasets/mesolitica/Malaysian-Instructions/viewer/default/longer_respond), text only instructions, 1 epoch.
|
37 |
+
|
38 |
+
With total 3.14B tokens (include text only instructions) or 9584.595 audio hours.
|
39 |
+
|
40 |
+
## Benchmark
|
41 |
+
|
42 |
+
## How to use
|
43 |
+
|
44 |
+
```python
|
45 |
+
from transformers import AutoFeatureExtractor, AutoModelForCausalLM, AutoTokenizer
|
46 |
+
from transformers import TextStreamer
|
47 |
+
import torch
|
48 |
+
import librosa
|
49 |
+
import math
|
50 |
+
|
51 |
+
def process(messages, device = 'cpu', dtype = torch.float32):
|
52 |
+
audios = []
|
53 |
+
for message in messages:
|
54 |
+
if isinstance(message["content"], list):
|
55 |
+
for ele in message["content"]:
|
56 |
+
if ele["type"] == "audio":
|
57 |
+
audios.append(ele['audio_url'])
|
58 |
+
|
59 |
+
y = []
|
60 |
+
audio_length = 0
|
61 |
+
for f in audios:
|
62 |
+
y_, _ = librosa.load(f, sr = 16000)
|
63 |
+
y.append(y_)
|
64 |
+
audio_length += min(3000, math.ceil(len(y_) / feature_extractor.hop_length))
|
65 |
+
audio_length = (audio_length - 1) // 2 + 1
|
66 |
+
|
67 |
+
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
|
68 |
+
expanded_audio_token = audio_token * audio_length
|
69 |
+
text = text.replace(audio_token, expanded_audio_token)
|
70 |
+
inputs = tokenizer(text, return_tensors = 'pt')
|
71 |
+
input_ids = inputs['input_ids']
|
72 |
+
inputs_audio = feature_extractor(
|
73 |
+
y,
|
74 |
+
return_attention_mask=True,
|
75 |
+
padding="max_length",
|
76 |
+
sampling_rate=16000,
|
77 |
+
return_tensors = 'pt'
|
78 |
+
)
|
79 |
+
input_features = inputs_audio['input_features']
|
80 |
+
feature_attention_mask = inputs_audio['attention_mask']
|
81 |
+
|
82 |
+
return {
|
83 |
+
'input_ids': input_ids.to(device),
|
84 |
+
'input_features': input_features.to(dtype).to(device),
|
85 |
+
'feature_attention_mask': feature_attention_mask.to(device)
|
86 |
+
}
|
87 |
+
|
88 |
+
tokenizer = AutoTokenizer.from_pretrained('mesolitica/Malaysian-Qwen2.5-7B-Speech-Instruct')
|
89 |
+
feature_extractor = AutoFeatureExtractor.from_pretrained('mesolitica/Malaysian-Qwen2.5-7B-Speech-Instruct')
|
90 |
+
streamer = TextStreamer(tokenizer)
|
91 |
+
audio_token = "<|AUDIO|>"
|
92 |
+
model = AutoModelForCausalLM.from_pretrained(
|
93 |
+
'mesolitica/Malaysian-Qwen2.5-7B-Speech-Instruct',
|
94 |
+
torch_dtype = 'auto', trust_remote_code = True,
|
95 |
+
).cuda()
|
96 |
+
|
97 |
+
messages = [
|
98 |
+
{"role": "user", "content": [
|
99 |
+
{"type": "audio", "audio_url": 'speech/coding-pca-pytorch.mp3'},
|
100 |
+
{"type": "text", "text": "is the audio about chicken"},
|
101 |
+
]},
|
102 |
+
]
|
103 |
+
|
104 |
+
inputs = process(messages, device = model.device, dtype = model.dtype)
|
105 |
+
with torch.no_grad():
|
106 |
+
generate_kwargs = dict(
|
107 |
+
**inputs,
|
108 |
+
max_new_tokens=512,
|
109 |
+
top_p=0.95,
|
110 |
+
top_k=50,
|
111 |
+
temperature=0.9,
|
112 |
+
do_sample=True,
|
113 |
+
repetition_penalty=1.05,
|
114 |
+
streamer=streamer
|
115 |
+
)
|
116 |
+
generation_output = model.generate(**generate_kwargs)
|
117 |
+
```
|
118 |
+
|
119 |
+
Output,
|
120 |
+
|
121 |
+
```
|
122 |
+
<|im_start|>system
|
123 |
+
You are a helpful assistant.<|im_end|>
|
124 |
+
<|im_start|>user
|
125 |
+
Audio 1: <|audio_bos|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|audio_eos|>
|
126 |
+
is the audio about chicken<|im_end|>
|
127 |
+
<|im_start|>assistant
|
128 |
+
The audio is a conversation where someone asks about applying PyTorch for implementing dimensionality reduction techniques with Principal Component Analysis (PCA). The context suggests an academic or technical discussion, possibly in a classroom or seminar setting. The tone of the speaker is likely educational and engaged, seeking to understand how to effectively use PyTorch for data analysis tasks involving dimensionality reduction.<|im_end|>
|
129 |
+
```
|
130 |
+
|
131 |
+
**The generation is stochastic, expect some randomness when run locally**.
|
132 |
+
|
133 |
+
You can try more audio examples at https://github.com/mesolitica/malaya/tree/master/session/audiollm/speech
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|