Malaysian-Audio-Qwen2.5-7B-Instruct

Speech model on top of mesolitica/Malaysian-Qwen2.5-7B-Instruct.

How we trained it?

We trained 2 stages,

First stage

Speech understanding, this is to introduce speech dataset to the LLM.

  • We use freezed Whisper Large V3 Encoder without any pooling, means 30 seconds audio consumed 1500 tokens.
  • Projection, Embedding LM Head layers are done in full parameter finetuning.
  • LoRA for other linear layers with rank 64 and alpha 128.
  • Training done in multipacking with 8192 context length.
  • WanDB at https://wandb.ai/huseinzol05/lora-embedding-64-audio-qwen2.5-7b-malaysian-8k

Dataset

  1. mesolitica/AudioSet-Audio-Instruction, 1 epoch.
  2. mesolitica/Classification-Speech-Instructions, 1 epoch.
  3. datasets/mesolitica/Animal-Sound-Instructions, 3 epoch.
  4. mesolitica/Transcription-Instructions, 1 epoch.
  5. mesolitica/Speaker-Diarization-Instructions, 4 epoch.
  6. mesolitica/Speech-Translation-Instructions, 2 epoch.
  7. mesolitica/CoVoST2-Instructions, 1 epoch.
  8. mesolitica/MusicBench-Instructions, 2 epoch.
  9. mesolitica/Classification-Speech-Adversarial-Instructions, 1 epoch.
  10. mesolitica/AudioSet-Audio-Adversarial-Instructions, 1 epoch.
  11. mesolitica/Sampling-Multitask-National-Speech-Corpus-v1, 1 epoch.
  12. mesolitica/Malaysian-Speech-Description-Timestamp-Instructions, 1 epoch.
  13. mesolitica/Cantonese-Radio-Description-Instructions, 1 epoch.
  14. mesolitica/Emilia-Mandarin-Description-Instructions, 1 epoch.
  15. mesolitica/Zeroshot-Audio-Classification-Instructions, 1 epoch.

With total 6.71B tokens or 25557.47 audio hours.

Second stage

Speech QA, actual conversations related to coding, politics, chat assistant and general QA.

  • We use freezed Whisper Large V3 Encoder without any pooling, means 30 seconds audio consumed 1500 tokens.
  • Projection, Embedding and LM Head layers are done in full parameter finetuning.
  • Training done in multipacking with 10240 context length.
  • LoRA for other linear layers with rank 64 and alpha 128.

Dataset

  1. mesolitica/Malaysian-UltraChat-Speech-Multiturn-Instructions, 1 epoch.
  2. mesolitica/Malaysian-Speech-Instructions, 1 epoch.
  3. mesolitica/Malaysian-Reasoning-Speech-Instructions, 1 epoch.
Downloads last month
0
Safetensors
Model size
8.25B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mesolitica/Malaysian-Audio-Qwen2.5-7B-Instruct

Finetuned
(2)
this model