File size: 2,154 Bytes

---
license: apache-2.0
pipeline_tag: image-text-to-text
---

Moondream is a small vision language model designed to run efficiently everywhere. 

[Website](https://moondream.ai/) / [Demo](https://moondream.ai/playground) / [GitHub](https://github.com/vikhyat/moondream)

This repository contains the 2025-04-14 **4-bit** release of Moondream. On an Nvidia RTX 3090, it uses 2,450 MB of VRAM and runs at a speed of 184 tokens/second. We used quantization-aware training techniques to build this version of the model, allowing us to achieve a 42% reduction in memory usage with only an 0.6% drop in accuracy.

There's more information about this version of the model in our [release blog post](https://moondream.ai/blog/smaller-faster-moondream-with-qat). Other revisions, as well as release history, can be found [here](https://huggingface.co/vikhyatk/moondream2).

### Usage

Make sure to install the requirements:

```
pip install pillow torchao
```

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model = AutoModelForCausalLM.from_pretrained(
    "moondream/moondream-2b-2025-04-14-4bit",
    trust_remote_code=True,
    device_map={"": "cuda"}
)

# Optional, but recommended when running inference on a large number of
# images since it has upfront compilation cost but significantly speeds
# up inference:
model.model.compile()

# Captioning
print("Short caption:")
print(model.caption(image, length="short")["caption"])

print("\nNormal caption:")
for t in model.caption(image, length="normal", stream=True)["caption"]:
    # Streaming generation example, supported for caption() and detect()
    print(t, end="", flush=True)
print(model.caption(image, length="normal"))

# Visual Querying
print("\nVisual query: 'How many people are in the image?'")
print(model.query(image, "How many people are in the image?")["answer"])

# Object Detection
print("\nObject detection: 'face'")
objects = model.detect(image, "face")["objects"]
print(f"Found {len(objects)} face(s)")

# Pointing
print("\nPointing: 'person'")
points = model.point(image, "person")["points"]
print(f"Found {len(points)} person(s)")
```