File size: 2,154 Bytes
1c87faa a89c592 6e1575b c19ca21 1c87faa c19ca21 1c87faa 4489767 1c87faa 6e1575b 1c87faa c19ca21 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
---
license: apache-2.0
pipeline_tag: image-text-to-text
---
Moondream is a small vision language model designed to run efficiently everywhere.
[Website](https://moondream.ai/) / [Demo](https://moondream.ai/playground) / [GitHub](https://github.com/vikhyat/moondream)
This repository contains the 2025-04-14 **4-bit** release of Moondream. On an Nvidia RTX 3090, it uses 2,450 MB of VRAM and runs at a speed of 184 tokens/second. We used quantization-aware training techniques to build this version of the model, allowing us to achieve a 42% reduction in memory usage with only an 0.6% drop in accuracy.
There's more information about this version of the model in our [release blog post](https://moondream.ai/blog/smaller-faster-moondream-with-qat). Other revisions, as well as release history, can be found [here](https://huggingface.co/vikhyatk/moondream2).
### Usage
Make sure to install the requirements:
```
pip install pillow torchao
```
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
model = AutoModelForCausalLM.from_pretrained(
"moondream/moondream-2b-2025-04-14-4bit",
trust_remote_code=True,
device_map={"": "cuda"}
)
# Optional, but recommended when running inference on a large number of
# images since it has upfront compilation cost but significantly speeds
# up inference:
model.model.compile()
# Captioning
print("Short caption:")
print(model.caption(image, length="short")["caption"])
print("\nNormal caption:")
for t in model.caption(image, length="normal", stream=True)["caption"]:
# Streaming generation example, supported for caption() and detect()
print(t, end="", flush=True)
print(model.caption(image, length="normal"))
# Visual Querying
print("\nVisual query: 'How many people are in the image?'")
print(model.query(image, "How many people are in the image?")["answer"])
# Object Detection
print("\nObject detection: 'face'")
objects = model.detect(image, "face")["objects"]
print(f"Found {len(objects)} face(s)")
# Pointing
print("\nPointing: 'person'")
points = model.point(image, "person")["points"]
print(f"Found {len(points)} person(s)")
``` |