In this video, we explore Moondream - a Vision Language Model that perceives the world like humans do. Moondream can be asked questions in natural language, and it generates responses in natural language as well. It uses context and an understanding of the image as a whole to do more advanced tasks than object detection. We will also dive into how you can use it practically in maker projects and how to set it up on the Pi 5.

Transcript

This is an image we took just a few minutes ago, and I'm going to go ahead and ask my Pi 5 here, in plain English, to describe everything that the person is wearing and their features. And after a short while of processing, the Pi is able to describe what's going on. This entire process was all generated right here entirely on the Pi, using a vision language model called Moondream. It can obviously do a lot more than just rating a fit. It can check if the laundry is still on the line. Is there a package at my door? Did I leave the fridge open? Is the dog on the bed?

These would be quite difficult things for, you know, another traditional model like YOLO to answer. But Moondream, we just need to politely ask it in plain English. How wickedly cool. Welcome back to Core Electronics. Today we are looking at Moondream, a tiny vision language model that is powerful enough to answer some pretty tricky computer vision questions, but efficient enough that you can run it locally on a Pi. Now, this is a computer vision tool that might be a bit different to what you're used to. It's super easy to use, but most of the challenge is understanding what it can actually do.

What's the actual application of this? Moonbeam is a vision language model, a type of model that is trained to look at an image and understand it in sort of the way that a human does. It's image understanding in natural language. It kind of feels like magic. You know, the Pi is actually alive and looking at it and, you know, actually describing it like a human does. Think of it like that as well. You can show a picture to a person, or the Pi in this case, and then ask a question about the image and getting the person to reply. So you could ask, is there a car in this image? And it would say, yes, there is.

Now, you could do this with YOLO as well. YOLO is trained to identify what a car looks like and can detect and locate it in an image. However, let's say we wanted to check, is there a person in the car? Or what color is the car? Or is the car on a road? Or what is the make and model of the car? There is no way that a YOLO model could do these out of the box. But Moondream, well, it isn't always perfect, but it has a far better chance of actually answering these questions. It isn't looking for a specific set of objects. It's looking at the entire image and kind of deducing what's going on. It's got a bit more advanced reasoning built into it.

So why don't we just use Moondream for everything if it's so superior? Well, it's far, far more powerful than things like YOLO, but it runs far, far slower. You can get a YOLO model running on a Pi at maybe 10 to 50 FPS if you really tried. Moondream, on the other hand, takes anywhere between 8 seconds and 90 seconds to process an image, depending on the question and what you're actually doing with it. Now, that seems really slow, and it is, but this is a completely different tool with completely different use cases. If you've used YOLO before, that's like handing you a fork, and then suddenly I'm handing you a spoon that is Moondream.

If I had a robot driving around looking for people, avoiding obstacles, you know, all that sort of stuff, I would use something like YOLO without a doubt. You know, you really want that real-time detection. But if I wanted to analyze my, you know, home security footage to see, did I leave the washing out on the line, or to automatically check if the right colored bins are out this week, or if the driveway at home is free, or another bin example, does the bin need to be emptied? These sorts of things, I'd probably, you know, maybe use Moonglare. Yes, if you really tried, you could train a YOLO model to do a lot of these jobs. However, speed isn't an issue here. A 30-second, you know, processing time for these tasks is more than fine.

And why train, you know, lots of little specific models when one big model can do the job? Now, it isn't complete magic. It does have its limits. If I ask it for the make and model of this Mini Coupe, it works usually most of the time. And, you know, for really, really common cars, you can get, you know, some sort of recognition out of it. But if I ask it to identify an Aussie Holden, it probably doesn't have a clue what's going on. Most likely a lack of training data and processing ability here. There are crazy powerful vision language models out there that could probably do that, but no way they would run on a Pi.

And that's probably a good way of thinking about Moondream. It's slow enough that it's able to run on a Pi, but it's just powerful enough that it's not completely useless. It is definitely handy and capable, but keep the requests somewhat simple. It's a little bit dumb sometimes. It's really something you've got to play around with to figure out its limits. It also comes in two sizes, dumb and dumber. These are the 2B and 0.5B models respectively, with the B standing for billions of parameters. It's the same model, just in two different sizes. The 2B model is the larger, smarter, more capable one that gives more accurate answers, but the fastest possible response time is about 22 to 25 seconds.

The 0.5B model is dumber, and it can really feel like it's trying to rage bait you sometimes, but you can get response times down to about 8 to 10 seconds. Best way to demonstrate this is by actually using it, so let's go ahead and boot it up. We are of course going to be using the Pi 5 because it's the fastest Pi, and you're either going to need an 8 or 16 gigabyte model here. These models are quite RAM intensive for something so small. The 2B model running on the Pi with all the desktop background tasks and whatnot going on at once.

Using about 7 gigabytes in total. You're also going to need a way to keep your pipe cool on this one because you may have long periods of maximum CPU utilization. The official active cooler is more than good enough though. And if you want to capture images on the Pi itself, you're also going to need a camera and cable. We're going to be using the camera module 3 here. You can find everything you need in the written guide link below, which is also going to have all the commands and code that we're going to be using if you want to follow along.

Alrighty, I'm here in Pi OS and to start off, I'm just going to create a new virtual environment and then go ahead and enter it. We're going to ensure that we include system site packages here because Pi Camera 2 is a little bit difficult to install and it already comes on the Pi, so we'll just ensure we include it there. Alrighty, we're going to go ahead and start by installing the Moondream package like so. Now we are using an older version of this package because it's a bit easier to run the model on the Pi with it. The newer one kind of restricts you a bit more.

Once that's done, we're going to need to go and specifically install a version of NumPy that's going to keep Moondream and Pi Camera happy like so. And that's really it package-wise. Credit to the Moondream team, super simple installation. Now we're going to need to go ahead and download the models though, which you can do with some wget commands and you can choose to download either the 0.5B or the 2B model here. And these are going to be 500 megabytes and nearly 2 gigabytes respectively. Let's start with the 0.5B model. I'm going to paste this in because I'm not writing this out by hand. Let that go ahead and download.

Just flexing our internet speed. This might be pretty average for the rest of the world, but for Australia, this is pretty lightning quick speed we have here at Core Electronics. Just saying. Sweet. And then we're going to go ahead and download the 2 billion model like so as well. Again, I'm not typing that out. It's a very long URL. There we go. Alrighty, with them downloaded, I'm going to go ahead and create a folder on my desktop just so we can access all this nice and easily. And then I'm going to go into my home folder and we're going to see the two files that we downloaded here. I'm going to go ahead and extract here. This might take a little bit of time on your Pi. It's not very fast at this.

There we go. And we're going to go ahead and do the 0.5B model as well. There we go. We're just going to go ahead and drag those two .mf files into our Moondream folder like so. In our project folder, I'm going to go ahead and create a new folder as well. And I'm going to call this just images. Now, any images or photos you want to analyze, pop them in here because that's what the code is going to be looking for, whatever's in this file. And then I'm going to go ahead and open up Thonny. We're then going to need to hit run, configure interpreter. We're going to set up Thonny to use the virtual environment we just created.

Click those three dots, go home. We should have a Moondream folder, which is the virtual environment we made. Go to bin and then look for the file called just Python. Click it like so. As you can see, we've got a path set. Click OK and we are all set up. And I'm going to paste in the first bit of demo code. I'm then going to ensure that I go ahead and save that into the same Moongleam project folder that we just created. As you can see, very little code to actually operate the Moondream package. We start by importing the Moondream package and the Python imaging library. And then we go ahead and load our model. This is the 0.5B model. If we wanted to load the 2B model, we just swap it out like so.

This is just the same name as whatever the model is called in our project folder. And then we go ahead and load in an image from the images folder. And then we're going to encode that image, which is just going to prepare it to be analyzed by Moondream. And then we go ahead and query the model with a question. And you can put whatever you want here, just plain English text. And I've just gone ahead and put a few test images in our images folder like so, just so we can do some demos. So let's go back and use the 0.5B model. And I'm going to ask it a question. Is the car drivable? As you can probably tell by the name of the image that we're loading, there's an obvious answer to this question.

Let's give that a run. And we're just going to fast forward till it's done. Going to be a lot of that in this video. And there we have our answer. Yes, the car is a small red and white vehicle that can be driven and driven. If we go ahead and look at the image, what was it called? Crash one. That's probably not a car you want to be driving. And we did this to show you that 0.5B is a little bit dumb. I'm now going to go back and ask the exact same question, but I'm going to use the 2B model instead. And that is probably a bit better of an answer. This is an example of a harder question. Knowing whether the car is safe to drive or not requires a bit more thinking about what's actually going on.

And the smaller model struggles with this because it can't understand all the context as well. If we instead asked it some simpler questions like what color is the car or is the car in the grass? It gives much better answers because these are simpler questions that you can just look at and deduce without much.

Now, detecting car crashes in the mountains is probably a little bit removed from your usual maker project, so let's fire up a practical example like checking if we took out the recycling bin or not. Quite a practical image here. This is what your project might see, you know, from your home security system or just the camera on the Pi. So, I'm going to start by telling it to analyze an image with all the bins out, including the recycling one, and I'm just going to ask it, is there a bin with a yellow lid? Let's give that a run, and I'll see you in about nine seconds.

Beautiful. Yes, there is a bin. It's maybe getting a little bit confused because it thinks, oh, it's a bin. It must have rubbish in it. It doesn't actually have rubbish in it, but it says yes. Let's give it a go with an image without any bins in it, and I'm going to get a little bit tricky here and do something that we'll look at in a minute, and I'm going to start by saying, answer only yes or no. Is there a bin on the curb? You can probably guess why we might be doing this. Beautiful. No, there isn't. We're going to do one more test on this. This one has two bins out, but it's not a recycling bin. There's no yellow lid, and let's go ahead and run that.

Ah, how crazy is that? We're just feeding it an image, and it's able to look at it and go, no, there isn't a bin there with a yellow lid. So as you can see, the 0.5B model is not completely useless. It does well in some light tasks, but I found that about two-thirds to three-quarters of the time, I would need to whip out the 2B model to get an accurate result. Honestly, if you can deal with the roughly three times longer processing times, just use the 2B model if you can. Speaking of longer processing times, we also have a second version of this code with some timing measurements built into it.

I'm going to go ahead and paste that in like so, and ensure that I save it to the same file, or we're not going to be able to read any images. All right, we're going to go ahead and use the 2B model to analyze this picture of flowers, and we're just going to ask, are there flowers in this image? Go ahead and run that, and we should see a breakdown of all the timings. And once that's done, we have a breakdown here of the three big steps and how long they took. The first one is our model load time. This is our most time-intensive part of this entire process, but we only need to do it once.

If you had this in a while true loop analyzing a frame every 10 seconds, the code would start by loading the model. Once it's loaded, you can run and analyze as many frames as you want. The next one is the encode time. This is the biggest killer of it all. It's the one we can't really minimize on. Here it's 18 and a half seconds. There's not really many ways that we found to speed this up. And then we have the answer time, which is about eight seconds here. Now, this is actually the part that you can control the most. Most of that time is actually spent generating the text to give to you. So the less text we generate, the quicker it is.

So we were getting clever before, and we said answer only yes or no. Let's give that a go. So answer only yes or no. Are there flowers in this image? Give that a run. And look at that. With a one-word response, we've got our answer time down to 1.34 seconds. Down here, we've also got our total image time, which is just the encode time plus the answer time. So once the model's booted up, how long realistically does it take to analyze a frame? 20 seconds is fairly reasonable for the 2B model, but with a one-word reply with our already minimized encode time, this is the quickest that the 2B model can get.

And just so we can get an idea of our 0.5B model as well, we're just going to go ahead and run that. And as you can see, we are much quicker off the mark there, about 10 seconds to boot up the model, about 8 to 10 seconds to encode the image, and then our answer time is half a second. So 2B model, about 20 seconds to 25 seconds, depending on the question you ask it. And then our 0.5B model is 8 to 10 seconds, again, depending on what you ask it. These are the quickest times that you can realistically expect. Something that's also worth knowing, once you've encoded the image, you've prepared that frame for processing.

So if you ask it one question, that'll take half a second with the 0.5B model. If you ask it another question, it's only going to take an extra 0.5 seconds. So once you've encoded an image, ask as many questions about it as you want. And just for the fun of it, I'm going to go back to the 2B model here, and I'm going to ask a very, very in-depth, detailed question to see how long that takes. And there we go. We have a really long and detailed answer, but it took 40 seconds just to generate that, which is crazy. So less text generation, less work, and I think most of the time you'll be using the answer only yes or no.

There's also another upside of doing that. It makes it easier to actually use it in your project. That answer is going to be stored in a variable, and you can very easily say, if answer yes, do something. If answer no, do something else. It's easy to write logic around. Another great practical example, let's check for a package on the porch from something like our home camera system. I've got some images here. Let's take a look. See, porch one all the way to porch five with different packages on it. This is probably something, again, what you would see in the field from your

security camera system if you've got it set up at home or, you know, your Pi camera pointed to there. It's a real world example. So let's try and use the 0.5B model first. And we're just going to ask, is there a package on the porch like so? And as you can see, there's definitely not a package there. It might be mistaking the pillars or the sides. They look kind of box shaped. Maybe it's a limitation of like the processing power of the 0.5B model. We can try and get tricky with our prompt though. Let's see if we can do something here with that. This is a little trick that seems to actually work quite well. Only answer if you are 100% certain. Let's ask, what does the package look like?

Another weird answer. Let's try the good old trick of answer only yes or no. This seems to make it more reliable as well sometimes. Get rid of that. Let's maybe make that a delivery package. Give it a bit more context of what we're looking for. Hey, look at that. We got a right answer. But let's go and actually check that on an image with a parcel on it to see if it gives us yes, because otherwise we might have something that just says no all the time. Okay, yeah, it's saying no. Let's test another image. That's porch two. Let's test porch three. Did we just create a no bot? We just created a no bot.

For some reason, the 0.5B model struggles a bit with trying to detect if something is there or not. If it is there, it'll detect it really well and it'll be able to describe it to you. But if it's not there, it'll still detect it and then make up something to be like, oh, there's a parcel there somewhere. I don't know. Sometimes I'd ask it, is there a parcel there? And it said, yes, there is a parcel there. There's a tree in the garden. Let's instead bring out the big guns and give the 2B model a go. And we don't really, I don't think we're going to need any of this extra thing. We can just ask it, is there a delivery on the porch or not? Let's run that. Might play some quop real quick.

Beautiful. No, there isn't a package. And just for the fun of it, let's try porch five as well. Obviously, if you're setting this up in the real world, go and run lots of tests. Check that your prompt actually works and is reliable, but don't use the same image. If you use the same image with the same prompt, you're going to get the same output every time. So go and grab a whole bunch of sample images and test it with the prompt to see if your system is going to be reliable. And that is our 2B model. Doesn't even stress at this. We don't really need to prompt it with anything funky. If we really put in the time, we might've gotten the 0.5B model to work with some really clever prompting.

There's been some times that I've gotten the 0.5B model to get reasonably accurate with some clever prompting, but look, 2B model just works. 0.5B model can work. You just need to really baby it a bit more. And just as a really funny side tangent, I was trying to get the 0.5B model to detect, is there rubbish in the bin here? And we've got a few images to test it out on. And I would ask it, is there rubbish in the bin? And it'd be like, yes, there is rubbish because it's in the bin. And I'm like, wait, do you see any rubbish in this image? And it'd go, yes, there's rubbish in the bin because it's a bin. No, do you see any rubbish on the bin or around the bin? Do you explicitly see rubbish in the bin? Only answer yes if you're 100% correct.

And it would go, well, there's a bin in the image, so there must be rubbish in there. If there was rubbish in the bin though, it would be able to describe all the rubbish in the bin just happily. But yeah, you had to whip out the 2B model for this one. Alrighty, we have one final bit of demo code here. This time we're going to be taking images directly from the Pi camera and asking questions about it. It's pretty much the same as the previous code. It's just taking the photo instead of reading it from a file. And now it's inside of a loop. It's pretty much the same code as last time, except we just initialize our camera now. We load the model and then we're going to be taking that photo inside of a while true loop. We're going to encode it, ask a question about it.

And we're just going to be asking answer only yes or no. Is there a smiling man in this image? Something very, very important about this code. Please pay attention to this. Hey, attention. All right. This code is going to be taking images at 512 by 512 pixels. If we take a look at the resolution of all the images we've been looking at so far, they are all 512 by 512 pixels as well. If you are processing images with Moonbeam on the Pi, ensure that they are at or smaller than 512 by 512. That's the maximum size I would go. Not 100% sure why, but if they're bigger than this, the encoding time takes three to four times longer.

Editor, please bring up the graph. Thank you. It's probably because it's got to down sample that image to the model's processing resolution, which I'm guessing is 512 by 512 pixels. Nonetheless, just ensure the resolution of the images you're using is 512 by 512 or smaller, and you have the fastest possible encoding times. All righty. To start off, I might just see how the 0.5B model goes with this one and let that run. I'm just going to sit in front of the camera and not smile. We're not going to be able to see what the camera is showing.

I'm just gonna see the yes or no result. No, there's nobody smiling in it. I'm gonna keep frowning again, and we should see a no. Wait, no, so it just took the camera, and now it's processing. So it's gonna say no one more time, but I'm gonna start smiling. Took the photo. Let's see if it gives a yes. Hey, look at that! We got a yes out of that. How crazy is that, that we just change one line in our code, and we can get our Raspberry Pi to detect whether somebody is smiling or not. How flexible is this model? And that was the 0.5B, which is pretty dumb as well. Like, oh, how cool is that?

And that is Moonscreen. As you can see, it fills a bit of a unique role in computer vision in maker projects. And honestly, the hardest thing about using Starscreen is understanding its role and what you can actually use it. We've probably just scraped the surface here of what you can actually do with it. There is definitely some applications out there a bit more clever than checking is the work fridge open or not. Again, it is slow. I wouldn't call this anything near real-time processing. But if you are looking for a rabbit hole to go down, that we're not going to cover here, but we're just going to mention so that you know it exists.

You can set up the Pi to send an image to a Megatron processing server and get back the detection data. At the time of writing, Moonbeam offers a very generous cloud-based service for free for this, but they also make it really easy to set up your own server locally, especially on Mac or Linux. With a decent setup, you could feasibly get a response time with the 2B model down to, I want to say, like a second or something crazy like that. Again, we're not going to cover it, but the Moonscreen package we use here does have the ability to do it. Go and check out the API if you're interested.

I am just, as usual, blown away at the state of things right now. This really feels like it works the way someone who is, you know, maybe naive about computer vision would imagine that it works. You know, like just feeding an image to a Pi and being like, Jarvis, what is the make and model of this car? How many occupants is there? Is it in the grass? You know, it just works. Like the fact that you can ask these questions, it's incredible. And the fact that this can all be run on a Pi as well, crazy. Just fantastic work to the Moondream team. Congrats. Like getting a model that powerful on a Pi is incredible.

If you do make anything cool with this or you find any good prompting tips, head on over to the community forums and let us know about it. There will be a dedicated topic for this video. There you'll also be able to get a hand with anything we covered in this video. Until next time though, happy making.

Feedback

Please continue if you would like to leave feedback for any of these topics:

  • Website features/issues
  • Content errors/improvements
  • Missing products/categories
  • Product assignments to categories
  • Search results relevance

For all other inquiries (orders status, stock levels, etc), please contact our support team for quick assistance.

Note: click continue and a draft email will be opened to edit. If you don't have an email client on your device, then send a message via the chat icon on the bottom left of our website.

Makers love reviews as much as you do, please follow this link to review the products you have purchased.