Compared to using audio to text and the qwen2 -7b model, does this model have any unique advantages?
#18
by
yinjun113
- opened
After a brief look at the piplines of this model, it seems that it is a combination of audio to text and qwen 7b model. If audio to text is used, it seems that more delicate results can be obtained, such as more accurate text conversion by specifying the audio model to text, or extracting user tone, gender, voiceprint, and age. Compared to others, what are the unique advantages of using qwen2 audio?
The biggest adantage is end-to-end in one model