Running on Zero 12 12 Explainable-Vision-Language-Model 🥶 Generate a video visualizing how a multimodal model attends to an image while generating text