Getting Started with Moondream on the Pi 5 | Human-Like Computer Vision

By Jaryd

Updated 15 September 2025

In this guide, we will be exploring Moondream, a computer vision model that analyses images and applies human reasoning to them, allowing you to ask questions about the image in natural human language. We are going to be learning what Moondream is and how it is different to models like YOLO, the limitations of the model, and how you can apply it in your maker projects. If you have never used a Vision Language Model like Moondream before, this is going to be a mind-blowingly cool guide. Let's get into it!

What Exactly is Moondream?

Moondream is a vision-language model (VLM) - a newer type of model that can analyse images and connect them with regular human language. In short, it can look at a picture, process what's in it (like objects, people, or actions), and then link that understanding to words. This means you can feed a VLM an image and ask it to do a variety of tasks, such as describing what's in an image, checking what things in the image are doing, or answering questions about what is in an image. For example, we can feed it an image of a person and ask what they are wearing:

PROMPT
Describe the person in this image.

RESULT
The image shows a man standing in an office hallway. He is wearing a dark blue zip-up jacket, a white t-shirt, beige cargo shorts, and red sneakers with white socks. His arms are slightly bent, and his legs are apart, giving him a relaxed posture. The background features a plain white wall and a gray carpeted floor, with a door visible behind him. The man has curly hair and a beard, and he appears to be casually dressed.

This answer was processed and generated entirely on Pi - how incredible is that? We just got a Pi to look at an image and describe what a person is wearing without any human guidance, VLMs like Moondream really do feel like magic. Moondream can do more than just rating a fit, here are some examples of it in action:

an image of a fridge door open with several food items inside.

PROMPT
Are there any beverages besides milk visible in the refrigerator?

RESULT
No, there are no beverages visible in the refrigerator besides the milk.

An image of a garden with a chicken coop.

PROMPT
Are the chickens inside or outside the coop?

RESULT
The chickens are inside the coop.

an image of a garden with a chicken coop. Chickens are outside the coop and in the garden.

PROMPT
Are the chickens inside or outside the coop?

RESULT
The chickens are outside the coop, standing on a dirt path.

An image of the Core Electronics Warehouse.

PROMPT
My friend works here, what do you think he does?

RESULT
Given the image, it is likely that your friend works in a warehouse or distribution center, handling and organizing various boxes, packages, and other materials. The large warehouse filled with shelves filled with boxes suggests he might be responsible for receiving shipments, organizing inventory, packing products for shipping, or performing inventory management tasks. The presence of a person in the warehouse also supports the idea that he could be involved in operational or administrative tasks related to the storage and handling of goods.

An image of a bed.

PROMPT
Is the dog on the bed?

RESULT
No, there is no dog on the bed in the image.

An image of a bed with a dog laying on it.

PROMPT
Is the dog on the bed?

RESULT
Yes, the dog is laying on the bed.

As we can see, Moondream is an incredibly flexible model with an entire world of possibilities, making it an incredible tool to have in your maker toolbox. Moondream breezes through these tasks, and although it can be a bit hit and miss (sometimes it would detect the mustard jar as a juice box), it is really reliable for simpler tasks like checking if the dog is on the bed. Now, you could achieve a lot of these tasks with a well-trained YOLO model. We could get a YOLO model trained to detect a bed and a dog, get the position of the two objects when detected, and do a bit of math to guess if the dog is on the bed. But what if the dog is passing by the bed, and from the perspective of the camera, the positions of the objects overlap? We would get a false positive. Moondream doesn't suffer from this as it analyses the whole image and uses contextual clues to answer this question. Also, why spend the time training lots of custom models for each specific task when one general-purpose model can do most of them?

If Moondream is so great, why don't we just use it instead of YOLO then? Well, it has one massive drawback. Processing time. On the Pi 5, you can get a YOLO model to run at 1 to 50 fps, depending on the model size and resolution. Moondream, on the other hand, you can expect 8 to 90 seconds to process a single image! This is because Moondream is an entirely different tool, with entirely different applications. If you wanted to run object detection on a robot driving around a warehouse, you would likely use a YOLO model. If instead, you wanted a way to autonomously monitor your home security system to check if the washing is on the line, if a package has been delivered, or if you took the bins to the curb, you would use Moondream, as these are less time-sensitive tasks. Moondream is a slower, but more complex image analysis method.

Now, it isn't complete magic, and we wish to stress this; it does have its limits. If we run the dog example again, but with a bit more of an involved question, we get some strange outputs:

An image of a bed with a dog laying on it.

PROMPT
Is the dog on the bed? If so, is it my white dog Merkel, or my brown dog Charlie?

RESULT
The dog is on the bed, but it is not possible to determine from the image whether it is your white dog Merkel or a brown dog named Charlie. The image does not provide enough information to identify the dog's breed or color.

But strangely enough, it is has no issue identifying if a dog is on the bed, or what the colour of the dog is:

An image of a bed with a dog laying on it.

PROMPT
What colour is the dog on the bed?

RESULT
The dog on the bed is white.

If you are chasing repeatable and reliable answers with Moondream, it may be worth asking multiple simple questions, rather than one complex one. As we will look at later, asking multiple questions about an image does not increase processing time very much either. Another thing to understand is that moondream has limits of training knowledge as well. If we ask it the make, model and colour of this very commonly photographed car, it has no issue:

An image of a yellow mini coupe parked on an european street.

PROMPT
What is the make, model and colour of this car?

RESULT
The car is a Mini Cooper, and it is bright yellow in colour.

But if we were to ask the same question on an image of this iconic Australian car (that wasn't seen much outside of Australia and hence won't have much training data available):

An image of a ford AU falcon parked in an Australian street.

PROMPT
What is the make, model and colour of this car?

RESULT
{

"make": "Ford",

"model": "Astra",

"colour": "Silver"

}

We are confidently given an incorrect answer (also in a strange output format). It is a Ford, and it is Silver, but there is no Ford Astra. Funny enough, the Astra is another car only on the Australian market, so it was close!

Overall, Moondream does have limits, and you should always be aware of them when using it in your project - it may be helpful to approach Moondream with the mindset that it is a bit of an idiot that has to be guided sometimes. And that's because that is exactly what Moondream is, it is small and dumb enough to make it capable of running on the Pi, but just smart enough to have reasonable use cases for your projects.

What You Will Need

To follow along with this guide, you will need a:

Raspberry Pi 5 - Either a 8GB, or 16GB model will work. Running Moondream with a desktop environment and a few other tasks in the background utilised nearly 7GB of RAM! So ensure that you are using a Pi with 8GB or more.
Method of cooling the Pi. For the Pi 5, we have never been wronged by the active cooler.
Pi Camera - We are using the Camera Module V3. The first lot of examples we will be looking at will use an image already captured and stored on the Pi. Then, at the end, we will take photos on the Pi itself and process them.
Adapter Cable - The Pi 5 comes with a uniquely-sized CSI camera cable, and your camera may come with the older, thicker one, so it's worth double-checking. The Camera Module V3 WILL come with the required adapter.
Power Supply
Micro SD Card - At least 32GB in size.
Monitor and Micro-HDMI to HDMI Cable
Mouse and Keyboard

Installing Moondream

Starting in a fresh installation of Pi OS, we are going to go ahead and update our packages with:

sudo apt update

sudo apt upgrade

Then we will go ahead and create a new virtual environment (venv) called "moondream" with the following command:

python3 -m venv --system-site-packages moondream

This "--system-site-packages" ensures that we include any packages that are already installed on our Pi. We do this as we will need to use Picamera2, and it is extremely difficult to install. It comes pre-installed with PiOS, so it's easiest to just include it when we create the venv.

Then we will get our terminal to start working in this venv:

source moondream/bin/activate

You should see the name of the venv in brackets on your new line, like in the image on the right. If you ever close the terminal and want to return here (maybe to install some additional packages), run the source command above again.

an image of the pi terminal creating a venv

Then we will go ahead and install the Moondream Python package. We will be installing a slightly older version of the package, as newer versions have made it harder to run a model locally on the Pi. This older package will have minimal effect on performance.

pip install moondream==0.0.6

When we include the system packages in creating the venv, we have introduced a very new version of Numpy with Picamera2. However, Moondream requires a slightly older version, which we will need to manually install:

pip3 install numpy==1.26.4

And that is all that is required package-wise!

an image of the pi terminal installing moondream pypi package

We will also need to tell our IDE Thonny to use the virtual environment we just created and installed our packages into.

The first time you open Thonny, it will be in the basic, streamlined mode. In the top right, press the button to use the default, advanced mode, and then restart Thonny.

Once you have restarted, select Run > Configure Interpreter and then under Python Executable, press the three dots to select a new one.

We will now need to navigate to /home/pi/moondream/bin. In this folder, you will find a file called "Python", select it and hit okay.

a screenshot showing the correct venv path selected in thonny.

One final step, we need to download the Moondream model itself, which we can do with a simple wget command. We actually have two choices in model size here, dumb and dumber:

2b model. This is the larger and more capable model with better reasoning and contextual analysis. The fastest possible frame processing time is about 20 seconds. All of the examples in the previous section were analysed with the 2b model. Download it with:

wget https://huggingface.co/vikhyatk/moondream2/resolve/9dddae84d54db4ac56fe37817aeaeb502ed083e2/moondream-2b-int8.mf.gz?download=true

0.5b model. This is the smaller and less capable model with limited thinking abilities. The fastest possible frame processing time, however, is about 8 seconds. You can download it by entering:

wget https://huggingface.co/vikhyatk/moondream2/resolve/9dddae84d54db4ac56fe37817aeaeb502ed083e2/moondream-0_5b-int8.mf.gz?download=true

Download both models and give them a go if you wish!

These models will download as .zip files into your Pi's home folder. Right-click the file(s) and hit extract here to unzip them. This process may take a good minute as these are large files.

Once you have the raw .mz file(s) (Moondream's model format), create a new folder somewhere accessible. We created ours on the desktop. Copy the model files into this folder. In this folder, also create a new folder called "images". We will be placing any images we wish to analyse in this folder. Your project folder should look something like the image on the right.

We are now ready to get into using Moondream!

a screenshot of the project folder structure for moondream to use.

Using Moondream with Python

Open up a new script in Thonny, and paste in the following demo code. Save the script to the project folder we just created. If the script is not saved to this folder, it won't be able to read the models or images.

import moondream as md
from PIL import Image
import time

# Load the local .mf file
model = md.vl(model="./moondream-0_5b-int8.mf")

# Load your image
image = Image.open("./images/crash1.jpg")

# Encode the image 
encoded_image = model.encode_image(image)

# Ask questions about the image
answer = model.query(encoded_image,"Is the car driveable?")["answer"]
print(answer)

As we can see, there is actually not that much code involved in running this mode. We start by importing the needed Moondream package and Python Imaging Library (PIL):

import moondream as md
from PIL import Image
import time

Then we load our model. By default, the code loads the 0.5b model, but you can change it to instead load the 2b model by changing the name to "./moondream-2b-int8.mf". This name is simply going to match the name of the model in your project folder.

# Load the local .mf file
model = md.vl(model="./moondream-0_5b-int8.mf")

Then we load the image from our images folder, and encode it. This encoding process is a neccesary step as prepares the image to be analysed by Moondream. This only needs to be done once per image.

# Load your image
image = Image.open("./images/crash1.jpg")

# Encode the image 
encoded_image = model.encode_image(image)

And finally, we take that encoded image and use Moondream to ask a queston about it. Once an image is encoded, you can ask as many questions as you want about it.

# Ask questions about the image
answer = model.query(encoded_image,"Is the car driveable?")["answer"]
print(answer)

With our question loaded, lets see how the 0.5b model performs with this question. Based on the image name and question, there is clearly one apropriate answer.

An image of a ford AU falcon parked in an Australian street.

PROMPT
Is this car drivable?

Model Size

0.5b

RESULT
Yes, the car is a small, red, and white vehicle that can be driven and driven.

As you can see, a bit of a nonsensical and wrong answer. If instead, we tried the 2b model with the same question and image, we get:

An image of a ford AU falcon parked in an Australian street.

PROMPT
Is this car drivable?

Model Size

RESULT
No, the car is not driveable. It is a wrecked car that is sitting in the grass.

Much better of an answer. We ran this example to demonstrate the differences of the 2b and 0.5b model. This is a harder question that requires a bit more contextual clues and thinking about whats going on. As a result, the smaller model struggles with this and gives us a nonsensical answer, but the more powerful 2b model passes with ease. The 0.5b model does better on simple questions that can be answered without trying to infer any information - things that are very aparent. For example, it will happily answer the colour of the car, what the surrounding environment is in, or a question like the following:

An image of a ford AU falcon parked in an Australian street.

PROMPT
Is the car sitting in the grass?

Model Size

0.5b

RESULT
Yes, the car is sitting in the grass.

Now, detecting car crashes in the mountains is a little removed from your usual maker project, so let's fire up a practical example like checking if we took out the recycling bins. Using the same code, we took some photos of our Australian rubbish bins out on the curb - with a similar angle and conditions that you might use in a real-world example. Using the 0.5b model, we analyse the following images of different bin configurations to check if the recycling bin is out.

An image of a curb side with australian bins on it.

PROMPT
Answer only yes or no. Is there a bin on the curb with a yellow lid?

Model Size

0.5b

RESULT
No

An image of a curb side with australian bins on it.

PROMPT
Answer only yes or no. Is there a bin on the curb with a yellow lid?

Model Size

0.5b

RESULT
Yes

An image of a curb side with australian bins on it.

PROMPT
Answer only yes or no. Is there a bin on the curb with a yellow lid?

Model Size

0.5b

RESULT
No

As you can see, the 0.5b model is not entirely useless. Although it did take a tad bit of prompting to figure out one that worked. We found that we had to mention "on the curb" and we initially asked it to find the green bin with a yellow lid, but it would always return a positive detection, regardless of whether there was a bin or not. In this case, the difference between an 8-second processing time and a 20-second processing time is not mission-critical, so it would be most practical to just use the 2b model, which seemed to always get the answer correct without "babying" it as much.

In these examples, we also told it to "Answer only yes or no." There are two reasons we do this. First, it makes it easy to actually use the response in your project. This answer is being stored as a string variable, and you can write code to check; if it says yes - do something, if no - do something else. You can also ask questions like "Answer only with a single number, how many bottles are in this image?" to get an easy-to-use variable.

The second reason is that it GREATLY speeds up processing time. Much of the time that the Pi spends processing an image is in generating the text response. So the less characters to generate, the less time spent per image. To analyse this, we have some code here that times each step of the process:

import moondream as md
from PIL import Image
import time

# Start the timer
start_time = time.time()

# Load the local .mf file
model = md.vl(model="./moondream-2b-int8.mf")

model_time = time.time()
print(f"Model Load Time: {model_time - start_time:.2f} seconds")

# Load your image
image = Image.open("./images/flower2.jpg")
# Encode the image 
encoded_image = model.encode_image(image)

encode_time = time.time()
print(f"Image Encode Time: {encode_time - model_time:.2f} seconds")

# Ask questions about the image
answer = model.query(encoded_image,"Answer only yes or no, are there flowers in this image?")["answer"]
print(answer)

answer_time = time.time()
print(f"Answer Time: {answer_time - encode_time:.2f} seconds")
print(f"\nTotal Image Time: {answer_time - model_time:.2f} seconds")

If we run this code on the 2b model, we get the following breakdown of how long each step took with a "Answer only yes or no." question:

Model Load Time: 23.47 seconds
Image Encode Time: 18.61 seconds
Answer Time: 1.34 seconds

First of all, the model load time. This is an unavoidable toll that has to be paid - but it only has to be paid once! Whenever you run a script, the model must be loaded, but once it is loaded, you can analyse as many images as you want in a while true loop. The only time you will need to load the model again is when you restart the script.

Then there is the image encode time, this is another largely unavoidable toll to be paid, and it has to be paid for every frame or image you want to analyse. This is also already minimised by feeding it the correct resolution image. Every image we have fed Moondream so far has been 512x512 pixels or smaller. If you feed it an image larger than this, it can take 30 to 45 seconds to encode an image. We guess that the larger images require downsampling to Moondream's processing resolution (likely 512x512 pixels). So very important, ensure that you feed it images that have already been shrunk to 512x512 or smaller.

Then we have the answer time. This is the thing we can control the most by prompting Moondream to limit the text output size. The time you see above is the quickest you can get with a tiny answer like a yes or no, or a single number. If we instead asked it a question where it generates a paragraph worth of text, this answer time could take longer than 40 seconds. An example of this:

The side of a house with branches of a flowering bush are in shot.

PROMPT
I am vision impaired, describe to me this image in detail.

Model Size

RESULT
The image depicts a vibrant scene of green plants with red flowers growing against a tan or light brown wall. The plants have leaves in various shades of green, some with reddish tones, and some showing signs of drying or wilting. Two prominent red flowers are located near the center, with additional flowers scattered throughout the foliage. The plants are positioned in front of the wall, partially obscuring it. Behind the plants, a white window with horizontal blinds is visible, adding contrast to the scene. The overall composition highlights the natural beauty of the plants and the wall.

It is worth knowing that once you have encoded an image, you can ask multiple questions about it. If we were to ask one yes or no question, the total processing time of an image would be 19.95 seconds (image encode time + answer time). If we were to ask 5 questions about the image, it would instead take around 25.31 seconds.

After running the same code with the 0.5b model, we get the following timing breakdown:

Model Load Time: 10.68 seconds
Image Encode Time: 8.75 seconds
Answer Time: 0.63 seconds

As you can see, the 0.5b model is roughly two times quicker across the board - a worthwhile consideration if you are chasing something a little faster. But often the difference between an 8-9 second processing time, and a 20 second processing time is minimal - the 2b model may be your weapon of choice the majority of the time.

Processing Live Images with the Camera

Let's finish this guide by looking at how we can process images coming directly from a camera module plugged into the Raspberry Pi. First of all, ensure you plug your camera cable in the correct orientation, or you may be in for some head scratching.

The image on the right shows the correct way.

Then create a new script, paste in the following code, and save it to the project folder:

import moondream as md
from PIL import Image
import time
from picamera2 import Picamera2

# Initialize the Picamera2
picam2 = Picamera2()
picam2.configure(picam2.create_still_configuration(main={"size": (512,512), "format": "RGB888"}))
picam2.start()

# Load the moondream model
model = md.vl(model="./moondream-2b-int8.mf")

while True:
    # Capture image as numpy array
    image_array = picam2.capture_array()

    # Convert numpy array to PIL Image
    image = Image.fromarray(image_array)

    print("Photo captured. Starting query...")

    # Encode the image
    encoded_image = model.encode_image(image)

    # Ask questions about the image
    answer = model.query(encoded_image, "Is there a man with a blue jacket in this image?")["answer"]
    print(answer)

This code is practically the same as the one we have been using, but instead of sourcing the image from the project folder, it uses the camera. It also places it all inside a while true loop, meaning that it will keep taking and analysing photos continuously.

Where to From Here?

We now have a system where we can take an image from a camera, or another source, and deeply analyse it with a prompt in natural human language, and get a response back also in natural human language. All of this runs locally, offline, and on a Pi, which is incredible to think about! Moondream is just such a fantastically cool VLM, and we would like to congratulate the Moondream team for developing such a light-weight yet capable model.

We are also only scratching the surface of what Moondream is capable of; there are a variety of other functions, such as object detection and location, and image captioning. These are more or less applicable to maker projects, but are definitely worth checking out on the PyPi page.

If you are chasing a more real-time solution, check out our YOLOE guide. YOLOE is a promptable YOLO object detection model that lets you describe what to locate. Again, it is a different tool from what we looked at here with different uses, but you could reasonably get 5-10fps if your project needs something a little faster.

Another thing worth checking out is setting up a Moondream server. If you take a look at the PyPi page, you will find some documentation on using the API of a Moondream server. At the time of writing, Moondream offers a very generous free cloud server usage; if not, you can also find some good documentation to host your own locally. We haven't used this very much, and its outside the scope of this video, but you could reasonably get processing times down to one or two seconds.

And if you are going to deploy a system like this in the field, ensure you run lots of testing to see if your prompts are reliable. Take lots of images to get a good sample and run your prompts on them. Don't run the same prompt on the same image, as you will likely get the same response every time; test with different images.

Well, that wraps us up for this guide. If you made anything cool with this, found any good prompting tips, or just need a hand with anything from this guide, feel free to drop a comment in the forum topic below. Until next time, happy making!

Have a question? Ask the Author of this guide today!

Tags:

Getting Started with Moondream on the Pi 5 | Human-Like Computer Vision

Contents:

What Exactly is Moondream?

What You Will Need

Installing Moondream

Using Moondream with Python

Processing Live Images with the Camera

Where to From Here?

Comments

Follow us on instagram

About Us

Resources