lodestones/Chroma · T5XXL-Unchained

it's unnecessary to extend t5 vocab

the link you shared is just a method to extend t5 embedding vocabulary with random tensor
and it has to be thoroughly trained first.

this method is redundant because MMDiT is partially language model already (it's both language and image expert model)
so finetuning the MMDiT only is sufficient.

besides, T5 arch is numerically unstable because it's based on old transformer arch prior to numerically stable QK norm and attention scaling. I don't want to exacerbate the problem.