Can anyone give me a hint on how google T5 model is involved in the generation process?
#3
by
junyaoren
- opened
Can anyone give me a hint on how google T5 model is involved in the generation process? Since during inference, this model was downloaded? Is it used for prompt upsampling?
The T5-XXL is used for the linguistic context and text conditioning of text inputs.
In the architecture, each transformer block uses a sequential self attention layer(for spatiotemporal tokens), followed by a cross-attention layer(here semantic context is integrated using T5-XXL), followed by a FFN.
You can refer the "Cross-attention for text conditioning" in the architecture section of the cosmos paper if you like
https://research.nvidia.com/publication/2025-01_cosmos-world-foundation-model-platform-physical-ai
Hope this helped. π