google/gemma-3n-E4B-it · how to let the model transcribe long audio files ?

The gemma 3n developer guide states :

At launch time, the Gemma 3n encoder is implemented to process audio clips up to 30 seconds.

 However, this is not a fundamental limitation. The underlying audio encoder is a streaming encoder,
 capable of processing arbitrarily long audios with additional long form audio training.
 Follow-up implementations will unlock low-latency, long streaming applications.

was hoping if there is added support for long form transcription as well finetuning support for long form transcription.

seems like the mistral team with the release of voxtral has cracked a smart way of processing audio in a single forward pass of the Language decoder.
See the detailed explanation on page 2 of this paper.

would be happy to raise a PR and contribute on the same.