ModernBART wen?
Title is /j, but in all seriousness is there any interest out there in producing a BART/T5-like encoder-decoder model with the improvements here? (flash attn, rope, etc)
(misclick xD)
The encoder-decoder models could even use the current checkpoint, if modernBERT is supported:
https://github.com/huggingface/transformers/issues/35385
https://discuss.huggingface.co/t/training-modernbert-gpt2/134398/2
The encoder-decoder models could even use the current checkpoint, if modernBERT is supported:
https://github.com/huggingface/transformers/issues/35385
https://discuss.huggingface.co/t/training-modernbert-gpt2/134398/2
Similarly, it would be nice if they added support for llama codebase/arch to be used as the decoder in EncoderDecoder models, so that smolLM2 etc. could be used. Since modernBERT's tokenizer is based on Olmo's, adding support for Olmo would also be good, it might be possible to use only 1 tokenizer for encoding and decoding with Olmo 1b as the decoder, etc.
Title is /j, but in all seriousness is there any interest out there in producing a BART/T5-like encoder-decoder model with the improvements here? (flash attn, rope, etc)
I've messed around a bit in creating a (more) modern T5 with better data, ctx length, tokenizer, etc with medium-ish results. The improvements were decent, and it might need more scaling in terms of data/compute etc, but the prelim results didn't impress me enough to invest in that yet. You can find some of them here. Note that the core T5 architecture is the same, so no custom code needed
- codebase I used for pretraining: https://github.com/pszemraj/nanoT5/tree/fineweb-edu-test
- other codebases worth looking at that built on nanoT5 implementing some more substantial updates to the arch: https://github.com/catie-aq/flashT5 and https://github.com/Knowledgator/TurboT5
if anyone is interested in collaborating on encoder-decoder model updates/pretraining feel free to reach out on Discord (username is same as my hf)