Learning Video Context as Interleaved Multimodal Sequences
Paper
• 2407.21757 • Published
None defined yet.
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands