nvidia/Cosmos-1.0-Diffusion-7B-Text2World · Can anyone give me a hint on how google T5 model is involved in the generation process?

The T5-XXL is used for the linguistic context and text conditioning of text inputs.
In the architecture, each transformer block uses a sequential self attention layer(for spatiotemporal tokens), followed by a cross-attention layer(here semantic context is integrated using T5-XXL), followed by a FFN.

You can refer the "Cross-attention for text conditioning" in the architecture section of the cosmos paper if you like

https://research.nvidia.com/publication/2025-01_cosmos-world-foundation-model-platform-physical-ai

Hope this helped. 😃