--- license: apache-2.0 pipeline_tag: image-text-to-text --- Moondream is a small vision language model designed to run efficiently on edge devices. [Website](https://moondream.ai/) / [Demo](https://moondream.ai/playground) / [GitHub](https://github.com/vikhyat/moondream) This repository contains the latest (**2025-01-09**) release of Moondream, as well as [historical releases](https://huggingface.co/vikhyatk/moondream2/blob/main/versions.txt). The model is updated frequently, so we recommend specifying a revision as shown below if you're using it in a production application. **Usage** ```python from transformers import AutoModelForCausalLM, AutoTokenizer from PIL import Image model = AutoModelForCausalLM.from_pretrained( "vikhyatk/moondream2", revision="2025-01-09", trust_remote_code=True, # Uncomment to run on GPU. # device_map={"": "cuda"} ) # Captioning print("Short caption:") print(model.caption(image, length="short")["caption"]) print("\nNormal caption:") for t in model.caption(image, length="normal", stream=True)["caption"]: # Streaming generation example, supported for caption() and detect() print(t, end="", flush=True) print(model.caption(image, length="normal")) # Visual Querying print("\nVisual query: 'How many people are in the image?'") print(model.query(image, "How many people are in the image?")["answer"]) # Object Detection print("\nObject detection: 'face'") objects = model.detect(image, "face")["objects"] print(f"Found {len(objects)} face(s)") # Pointing print("\nPointing: 'person'") points = model.point(image, "person")["points"] print(f"Found {len(points)} person(s)") ```