Computer vision has become more accessible, and in this video, we're setting up the YOLO Object Detection Model on a Raspberry Pi 5. We'll learn how to optimize this model for better performance on the Pi, control hardware with detection results, and explore YOLO World, an open vocabulary model that identifies custom objects you describe. To follow along, you'll need a Pi 5, as the Pi 4's slower speeds aren't ideal for processing-intensive vision models like YOLO. You'll also need a camera, such as the camera module 3, and possibly an adapter cable since the Pi 5 has a smaller camera connector. Additionally, you'll need a microSD card for Pi OS, a keyboard, mouse, monitor, and a cooling solution for the Pi. We'll provide links to all necessary items in the written guide below, along with all the code used in this video.
For hardware assembly, connect the camera to the Pi with the ribbon cable, ensuring it's plugged in correctly, as it only works in one orientation. Be cautious with ribbon cables to avoid damage. You'll also need to mount the camera. First, use another computer to install Pi OS onto the microSD card using Raspberry Pi Imager. The process is straightforward, and assistance is available in our written guide. After installing Pi OS, complete the first-time setup and connect to the internet. Now, on the desktop, create a virtual environment, an isolated space to work in without risking conflicts with the rest of Pi OS. Open a terminal window and create the virtual environment, which we'll call YOLO Object. The written guide contains all the code and commands for easy copying and pasting. Enter the virtual environment by running a specific command, and if you close the terminal, you can re-enter by running the source command again.
With the environment set up, ensure the package manager is up to date. Install the Ultralytics package, which includes OpenCV and other necessary components for YOLO. This installation may take 10 to 20 minutes, so take a break. Errors may occur, but rerunning the install command should resolve them. While waiting, let's explain OpenCV, YOLO, and the Cocoa Library. OpenCV is a framework for vision-related tasks, akin to kitchen tools for food preparation. YOLO is the computer vision model, like a cook in the kitchen, doing the actual work. The Cocoa Library is the training data YOLO uses, similar to a cookbook. Once installed, reboot the Pi. Open Thonny to run our code, and set it up to use the virtual environment. Switch to regular mode, reboot, and configure the interpreter to use the Python executable from the YOLO object folder. Now, paste in the demo code, which is explained in the written guide. Save it to the desktop in a new folder called YOLO Object Detection, as the script may download several models, which can get messy.
Run the demo code, and the Ultralytics library will automatically download necessary models. The detection shows objects with a confidence rating, but the frame rate may be choppy, which we'll address. The Cocoa library's limited detection capabilities mean it won't recognize all objects, like glasses or items in the background. We'll discuss ways to improve speed and accuracy. You can change the YOLO model by modifying a specific line. YOLOv8 offers various sizes, from Nano to extra large. Using a larger model may result in a slower performance and a bigger download, depending on the model size.
You can actually see how long it's taking to process one frame in the shell. It's taking 13 seconds here, but look at that. It's able to pick up the cup on the bench there because it's able to more accurately detect stuff, especially when they are further away from the camera. More complex scenes are done a lot better with extra-large models. To exit this, all we need to do is hit Q. This is how we improve our quality. You've got a scale from your Nano to extra-large model that runs the quickest on the Nano side but is the most powerful on the extra-large side. To use our kitchen analogy, the size of the model is kind of telling the cook how quickly they should do things. The Nano tells them to do it really fast, as quick as possible, but that might produce sloppy results. The extra-large is telling them to really take their time with it and do it really well.
The other thing we can change is the YOLO version. We're currently using YOLO v8, but let's say you wanted to go back to an older version for some reason. Swap out that v8 for a v5 and that's it. It's going to download and use that model. In our kitchen analogy, this is the skill level of the cook. V8 is a better-trained cook that is more skillful than v5 and can cook better food in less time. The beauty of this code is that, say, a year after this video is released, version 11 comes out, you should be able to just slap v11 in there and start using the newest YOLO version. It's worth noting, though, that at the time of writing this video, v10 does currently exist, but we're sticking to v8 because it's not much more performance-wise, and v8 works better on the Pi.
Another option we have is to convert the model to a more efficient format. By default, the model will come in something called a Pi torch format, but we can convert it to something called an nCNN format, which is more optimized for ARM processors like the Pi. To do so, we're going to create a new script, paste in the conversion code, and then we're going to need to save it into the same file as our other code. All you need to do is specify the model that you want to use and the format to output it in. If you run it, it should only take between 15 seconds and maybe a few minutes if you're doing this to a large model. If we go to the folder that our script is in, you can now see our converted model here. It might help to copy and paste the name of it real quick. All we need to do is go back to our main code in here, specify to use this one that we just created.
If we run that code, we should be getting a lot better FPS now, about fourfold. We're getting about six FPS here on average. As you can see, this is a lot more usable of an FPS. If your model can be converted to nCNN, you definitely should. It's just a free performance boost. All of the V8 models support this, but not every other model you find will support it. For example, the V10s won't support it, which is why we're not using it here. The other way we can increase FPS is by changing the processing resolution. This method severely decreases performance, but as you can see, we can get a good 20 to 30 FPS with it. At these FPS, though, the model does start to get a bit janky, and it might not work very well. The detection is really hit and miss. We'll have a section in the written guide if you want to give this a go because there are a few catches with it.
You can change the model size, Nano being the fastest, extra-large being the most powerful. You can convert that model to nCNN to speed it up. If you want to go and look at how to change the resolution, you can do that to get high FPS at low performance. If you want to go one step further, a little bit of an overclock may see some gains in FPS here. Now we've already talked about some of the different models we can use, but there is one that is so incredibly cool, it deserves its own section. That is YOLO World. Let's say I wanted to detect these glasses. Now, that is not in the Cocoa library that YOLO is pre-trained on, so it's not going to detect it. We could feed in images of glasses and retrain it to recognize them, but that's a bit of a pain.
This is where YOLO World comes in. It's an open vocabulary model and only came out in 2024. It's really fresh, and you can tell it what to look for, and it tries to find it. We're going to paste in our demo code, and just a warning, the first time you run this, it might take a little bit because it is going to download about half a gigabyte of libraries and dependencies and models here. In our code, all we need to do is in this field here, tell it what to look for, and that's it. It's going to see if it can find it in the image. We're going to tell it to look for glasses, and as you can see, we're detecting glasses without the need for retraining at all. We just change the prompt, and it will look for something else. This is some really, really futuristic stuff.
Now, this isn't a silver board solution to everything. It probably can't recognize different models of cars, but common household objects like glasses or a camera, it does a really good job on. Sometimes it's weird, like it couldn't really recognize a hammer, which is a common object. I had to tell it to look for a hammer with an orange handle, and I think it's just looking for the colour orange there. Sometimes it found it, sometimes it didn't. It's a bit dodgy. Yes, you can prompt it with a description instead of just one word. For more obscure things, you're at the mercy of the algorithm, but we've gotten it to recognize some pretty crazy things. Just a few things: it only comes in a small, medium, and large size, and you can't convert it to NCNN. You'll need to drop the resolution to get some better performance. Just have a play around with it, though.
It is a bit slower than YOLOv8, but it is so damn powerful for the ability to look for something completely new without having to retrain it. It's just good fun, and it can be purposed for some really niche projects. So see if you can make it fit your needs. Now, so far, we've been identifying objects and drawing a box around them in this preview window, but how do we use this? What can we actually do with it? Well, there is a rich amount of information coming out of the YOLO model, and it's stored in a single variable. This is, of course, in the results variable that we're storing the output from. If we run the model just for a frame or two, just let it generate and quit it by pressing Q, and see how we've still got our previous entries in the shell.
If we type in results and then grab the zeroth element of it, which is the newest, the latest frame that's been detected, we can see all the data that we can grab from this. An important thing here is that we can see the Cocoa Library's list of names by clicking on this one here. Each item that can be detected with the Cocoa Library has a number associated with it. So zero is a person, one's a bicycle, two is a car, and so on and so on. This number is usually what we will be using to know what is being detected in that frame. For example, if we type in result.boxes.cls, we can get a list of all the things that were detected in the last frame. We only detected a human in the last frame, so there's only zero here. This is how we can do things with our detection results.
We have some demo code in the written guide to control a pin of the Pi-5 when a human is detected, and we can use it to do things like control a solenoid to open and close. Or turn on a light to make automatic lighting, or make a security system that emails you when it detects someone or an animal.
Yes, Cocoa is trained on animals. Well, I hope you enjoyed this guide. We can now use a Pi-5 to detect objects and do something based on that detection. That seems like a simple concept, but it has endless applications in endless projects. If you do build something cool or you just run into some trouble with this guide, feel free to pop a post on our community forums. We're all makers, and we're happy to help. I'd also like to extend a big thanks to the developers and maintainers of the Cocoa Library and OpenCV, as well as Joseph Redman and Ultralytics, who have brought us these YOLO models.
Makers love reviews as much as you do, please follow this link to review the products you have purchased.