Update pipeline tag to audio-text-to-text
This PR updates the pipeline_tag
metadata from audio-to-audio
to audio-text-to-text
. This change more accurately reflects the model's capabilities, as described in its paper abstract and model card, indicating its multimodal input (audio and text) and text generation capabilities. This ensures better discoverability for users searching for models with this specific functionality on the Hugging Face Hub.
Hey! I appreciate your suggestion. I was deliberating about this when deciding the tag, but I felt that it is important to highlight that model can generate speech (tokens) as output as well unlike some "Speech-aware LM" such as QwenAudio. Ideally the correct tag would arguably be audio-text-to-audio-text
but such a tag does not exist to the best of my knowledge. Happy to hear your suggestions