Arsitektur multimodal terpadu untuk pemahaman spasial
Generate detailed descriptions from images and videos
Generate 3D models from text or images