Have you ever wanted to dive into computer vision? How about on a low-power and portable piece of hardware like a Raspberry Pi?
Well, in this guide we will be setting up some with the YOLO vision model family, OpenCV and the COCO object library on the Raspberry Pi 5. We will be taking a look at a few of the different YOLO models available, how to optimise them for both processing power and speed, how to control hardware with the detection results, and we will also be taking a look at YOLO World, an extremely exciting open-vocabulary model that detects objects from a prompt instead of its pre-trained item list.
Computer vision is now easier and more efficient than ever with us now being able to run high-performance vision models on hardware like the Raspberry Pi. This is thanks to the works of the incredible open-source projects we will be utilising in this guide; the COCO library, OpenCV and YOLO. Before we dive into all of this though, we will be referring to these a lot so a quick analogy is in order to explain their roles.
OpenCV is like a kitchen for our computer vision cook. It provides the necessary tools and framework for us to prepare and make food. YOLO is the cook in the kitchen. It is part of this system that is doing the actual work, "thinking" and trying to identify objects. And lastly, we have the COCO library. This is the training data that YOLO comes with - it's like the cookbook or the recipes that YOLO will follow. If the COCO library has instructions on how to identify a car, YOLO the cook can identify a car, if it doesn't have those instructions, YOLO can't identify it. So YOLO the cook, does work according to the COCO library cookbook in the OpenCV kitchen.
Hopefully that made sense, lets get right into it!
Contents
- What You Will Need
- Hardware Assembly
- Installing Pi OS
- Setting up the Virtual Environment and Installing YOLO
- Running YOLOv8
- Running Other YOLO Models
- Converting to NCNN (FPS Boost)
- Lowering the Resolution
- Controlling Hardware
- Diving Into YOLO World
- Where to From Here
- Acknowledgements
- Appendix: Using a Webcam
What You Will Need
To follow along with this guide you will need a:
- Raspberry Pi 5 - Either a 4GB or 8GB model will work here. Although this could technically be done on a Pi 4, it is far slower than the Pi 5 and would not be a nice experience, and for those reasons, we haven't tested on a Pi 4
- Pi Camera - We are using the Camera Module V3
- Adapter Cable - The Pi 5 comes with a different-sized CSI camera cable and your camera may come with the older thicker one so it's worth double-checking. The Camera Module V3 WILL need one
- Cooling Solution - We are using the active cooler (computer vision will really push your Pi to its limits).
- Power Supply
- Micro SD Card - At least 16GB in size
- Monitor and Micro-HDMI to HDMI Cable
- Mouse and Keyboard
Hardware Assembly
In terms of hardware assembly, it's pretty light here. Connect the thicker side of the cable to the camera, and the thinner side to the Pi 5. These connectors have a tab on them - lift them up, then insert the cable into the slot. Once it is sitting in there nice and square, push the tab back down to clamp the cable into place.
Just keep an eye out as these connectors only work in one orientation, and they can be fragile so avoid bending them tightly (a little bit is okay).
Installing Pi OS
First things first, we need to install Pi OS onto the micro SD card. Using the Raspberry Pi Imager, select Raspberry PI 5 as the Device, Raspberry Pi OS (64-bit) as the Operating system, and your microSD card as the storage device.
NOTE: INSTALLING PI OS ONTO THE MICROSD CARD WILL WIPE ALL DATA ON IT.
This process may take a few minutes to download the OS and install it. Once the process has finished, insert it into the Pi and boot it up. Your Pi will run through a first-time installation and just ensure that you connect it to the internet.
You will also need to remember the username you create here as it will be part of the file locations we cover in this guide. For simplicity's sake you can simply call it "pi".
Setting up the Virtual Environment and Installing YOLO
With the introduction of Bookworm OS in 2023, we are now required to use Virtual Environments (or venv). These are just isolated virtual spaces that we can use to run our projects without the risk of breaking the rest of Pi OS and our packages - in other words, we can do whatever we want here without the risk of damaging the rest of Pi OS. It is an extra moving part to learn, but it's incredibly easy.
To create a virtual environment, open a new terminal window and type in:
python3 -m venv --system-site-packages yolo_object
This will create a new virtual environment called "yolo_object". You can find the folder of this virtual environment under home/pi and it will be called "yolo_object".
After creating the venv, enter into it by typing in:
source yolo_object/bin/activate
After doing so you will see the name on the virtual environment to the left of the green text - this means we are correctly working within it. If you ever need to re-enter this environment (for example if you close the terminal window you will exit the environment), just type in the source command above again.
Now that we are working in a virtual environment, we can start installing the required packages. First, ensure that PIP (the Python package manager) is up to date by entering the three following lines:
sudo apt update
sudo apt install python3-pip -y
pip install -U pip
Then install the Ultralytics Package with:
pip install ultralytics[export]
The lovely folks at Ultralytics have been one of the key developers and maintainers of the newest YOLO models. This package of theirs is going to do much of the heavy lifting and will install OpenCV as well as all the required infrastructure for us to run YOLO.
This process will also install quite a large amount of other packages, and as a result, is prone to failing. If your installation fails (it will display a whole wall of red text), just type in the Ultralytics install line again and it should resume. In rare cases, the install line may need to be repeated a few times.
Once that has finished installing, reboot the Raspberry Pi. If you want to be a power user, you can do so by typing into the shell:
reboot
We have one more thing to do, and that is to set up Thonny to use the virtual environment we just created. Thonny is the program we will be running all of our code out of and we need to get it to work out of the same venv so that it has access to the libraries we installed.
The first time you open Thonny it may be in the simplified mode, and you will see a "switch to regular mode" in the top right. If this is present click it and restart Thonny by closing it.
Now enter the interpreter options menu by selecting Run > Configure Interpreter. Under the Python executable option, there is a button with 3 dots. Select it and navigate to the Python executable in the virtual environment we just created.
This will be located under home/pi/yolo_object/bin and in this file, you will need to select the file called "python3". Hit okay and you will now be working in this venv.
Whenever you open Thonny, it will now automatically work out of this environment. You can change the environment you are working out of by selecting it from the drop-down menu under the Python executable in the same interpreter options menu. If you wish to exit the virtual environment, select the option bin/python3.
Running YOLOv8
Create a new script in Thonny and paste in the following code:
import cv2 from picamera2 import Picamera2 from ultralytics import YOLO # Set up the camera with Picam picam2 = Picamera2() picam2.preview_configuration.main.size = (1280, 1280) picam2.preview_configuration.main.format = "RGB888" picam2.preview_configuration.align() picam2.configure("preview") picam2.start() # Load YOLOv8 model = YOLO("yolov8n.pt") while True: # Capture a frame from the camera frame = picam2.capture_array() # Run YOLO model on the captured frame and store the results results = model(frame) # Output the visual detection data, we will draw this on our camera preview window annotated_frame = results[0].plot() # Get inference time inference_time = results[0].speed['inference'] fps = 1000 / inference_time # Convert to milliseconds text = f'FPS: {fps:.1f}' # Define font and position font = cv2.FONT_HERSHEY_SIMPLEX text_size = cv2.getTextSize(text, font, 1, 2)[0] text_x = annotated_frame.shape[1] - text_size[0] - 10 # 10 pixels from the right text_y = text_size[1] + 10 # 10 pixels from the top # Draw the text on the annotated frame cv2.putText(annotated_frame, text, (text_x, text_y), font, 1, (255, 255, 255), 2, cv2.LINE_AA) # Display the resulting frame cv2.imshow("Camera", annotated_frame) # Exit the program if q is pressed if cv2.waitKey(1) == ord("q"): break # Close all windows cv2.destroyAllWindows()
Running YOLO code will often involve downloading models which will save to the same location that the script is in, so it may be a smart move to create a folder and save your script in there to keep everything neat. We just created a folder on our desktop and saved everything to that. When you save your YOLO code, also ensure that you save it as a ".py" python file.
Now run the code by hitting the green play button, and after a short period of downloading the model, a window should appear showing your camera feed, as well as anything detected by YOLO.
It will automatically draw a box around anything detected, label it, and give a rating of how confident it is on that detection (with 1.0 being the greatest).
And this is our base-level YOLO model working! It may not be the best FPS right now (you should expect about 1.5 FPS), but we will look at how to improve this very soon.
To stop this script, all you need to do is hit the Q key.
Lets quickly take a general look at how this code works so we can somewhat understand whats goin on under the hood.
Starting off, we import the required libraries; cv2 is OpenCV, Picamera2 is the library we use to get the video feed from the camera into our code, and Ultralytics where our YOLO model comes from
import cv2 from picamera2 import Picamera2 from ultralytics import YOLO
Then we end our set up phase by initialising the camera with Picamera2 and loading the YOLOv8 model.
# Set up the camera with Picam picam2 = Picamera2() picam2.preview_configuration.main.size = (1280, 1280) picam2.preview_configuration.main.format = "RGB888" picam2.preview_configuration.align() picam2.configure("preview") picam2.start() # Load YOLOv8 model = YOLO("yolov8n.pt")
Next we enter into a looping While True loop that gets an image from the camera, feeds it into the YOLOv8 model, and then gets the data it outputs.
while True: # Capture a frame from the camera frame = picam2.capture_array() # Run YOLO model on the captured frame and store the results results = model(frame) # Output the visual detection data, we will draw this on our camera preview window annotated_frame = results[0].plot()
This next bit of code is just some math to calculate the FPS we are processing at, and then some OpenCV related code to attach the FPS to the output data from the model.
# Get inference time inference_time = results[0].speed['inference'] fps = 1000 / inference_time # Convert to milliseconds text = f'FPS: {fps:.1f}' # Define font and position font = cv2.FONT_HERSHEY_SIMPLEX text_size = cv2.getTextSize(text, font, 1, 2)[0] text_x = annotated_frame.shape[1] - text_size[0] - 10 # 10 pixels from the right text_y = text_size[1] + 10 # 10 pixels from the top # Draw the text on the annotated frame cv2.putText(annotated_frame, text, (text_x, text_y), font, 1, (255, 255, 255), 2, cv2.LINE_AA)
Then finally we display the overlay the output data and FPS counter over the camera feed window.
# Display the resulting frame cv2.imshow("Camera", annotated_frame)
And we end it by letting us press "q" to exit. If the key is pressed, then it will break which will exit out of the current while true loop and close all the windows.
# Exit the program if q is pressed if cv2.waitKey(1) == ord("q"): break # Close all windows cv2.destroyAllWindows()
Before we move on to improving YOLO, there is a limitation worth pointing out. Right now your Pi is probably able to detect a person sitting in front of it, as well as common things like chairs, TVs, and cups. However, it is not able to detect everything. In the picture on the right, there are many things that aren't being detected and that is because of the limitations in the COCO library.
YOLOv8 has been trained on the COCO library which categorises objects into about 88 different categories - meaning there are only 88 different types of things you can detect. This might not sound like a lot, but when you consider how common and broad some of these categories are like "car" or "sportsball", it amounts to quite a lot it can detect. However, if an object is not part of its training library (like glasses), it won't be able to recognise it. This is a foreshadowing of YOLO World which is going to allow us to get around this limitation.
Running Other YOLO Models
So far we have been running YOLOv8, and one of the beauties of this ultralytics package is that we can simply swap out a single line in the code to completely change the model, and we can use this to run a more advanced v8 model, or even an older model. All you need to change is this line here in the setup:
# Load YOLOv8 model = YOLO("yolov8n.pt")
This line is currently using the nano model which is the smallest, least powerful, but fastest model of YOLOv8, and we can change this line to run one of the different sizes that this model comes in by changing the single letter after "v8" as shown on the right. If you change this line to another model size and run it, the script will automatically download the new model (which can be in the 100s of Mb for the larger models).
As you may have guessed, there is a trade-off here between image processing performance and FPS. If you run the extra-large model, you should be running at less than 0.1 FPS (we will improve this shortly), but you should also notice a great jump in detection performance. These larger models are better at recognising items that are far away from the camera, as demonstrated in the image on the right, and are also better at more reliably recognising objects with greater accuracy. If you hold a cup in your hand, the nano model may jump between recognising it as a phone and a cup, but the extra-large model will more consistently recognise it as a cup.
In our kitchen analogy, this is like telling the cook (YOLOv8), how long should they spend preparing a dish, nano is the least amount of time which may produce some rougher results, and extra-large is a "spend all the time in the world" situation.
There are also different-sized models between these we can use that are at different levels of this processing performance and FPS trade-off so have a play around and see which one is accurate and fast enough for your needs.
One more thing we can change is the model version. YOLO started at v1 and has now made its way up to v10 (we are using v8 though as at the time of writing it was just as fast as 10 and could be optimised better for the Pi). These versions have improved both the speed and capabilities of the models and in our kitchen analogy is like using a more skilled and experienced cook.
Changing this version is just as easy and if for example, we wanted to use v5 for legacy reasons, we would just modify the same line to be:
# Load YOLOv8 model = YOLO("yolov5n.pt")
If you want to try v10, you would just write:
# Load YOLOv8 model = YOLO("yolov10n.pt")
This guide will eventually be outdated, but there is a good chance that if version 11 or 12 or whatever number is released after this guide is published, you should be able to use this same code to run that newest model by simply altering that line.
Converting to NCNN (FPS Boost)
There are 2 main ways to increase FPS and the most effective way is to convert the model to a more efficient format (we will look at the other way to increase FPS next). When you first download the model, by default it will come in something called PyTorch format. However, we can convert it to something called an NCNN format which is more optimised to run on ARM processors like the Raspberry Pi's. By doing this we will be more effectively using the processing power of the Pi, and get improved speed at no cost to performance (literally a free speed boost).
And the process of converting the model is simple. Create a new Thonny script, paste in the following code and save it to the same file that the other script is saved in as we will be creating even more files!
from ultralytics import YOLO # Load the PyTorch model model = YOLO("yolov8n.pt") # Export as an NCNN format model.export(format="ncnn")
The first time you run this it will download some additional dependencies, then after a few seconds you should see a success in the shell and it should spit out a file called "yolov8n_ncnn_model". We don't need to touch this folder, but it's just worth knowing that it's been created in the directory we are working out of.
To use this model, all we need to do is specify it in the same model setup line we were editing before. We converted the nano model so our line will look like this:
model = YOLO("yolov8n_ncnn_model")
This line is just the name of the file that we created in the conversion process. If we convert a different model, it will come out with a similar name with a different version number or size.
After running the script, you should be getting about 4x more FPS and on the nano model, this equates to a more usable 6 FPS. On the extra-large model, you should expect around 0.5 FPS, which may not seem like a lot, but for a low-power computer like the Pi running such an impressively powerful model, it's quite an achievement.
And you should be able to do this on most models, we were able to convert all of the v5 and v8 models to NCNN. However, at the time of writing the v10 models were not able to be converted to NCNN which is why we are focusing on the v8 models in this guide. But whichever model you end up using, it's always worth giving NCNN conversion a go as it is a free increase in FPS without any performance cost.
Lowering the Resolution
The other method of increasing FPS is to drop the processing resolution - the fewer pixels in a frame it has to process, the quicker it can do so. By default, we are processing at 640x640 pixels, which might not sound like a lot, but computer vision models can do quite a lot with very few pixels. This is most likely more pixels than most needs and by dropping the resolution you can get a smooth 20 FPS, even up to 50 FPS if you drop it really low. However, lowering the resolution will come with a performance cost.
The other caveat is that you can't convert the model and lower the resolution at the same time. Due to technical limitations, changing the resolution from the script will not work if you are using an NCNN format model. So you can choose to either convert it to an NCNN model to get a 4x FPS increase (without any performance cost) or lower the resolution.
To lower the resolution, all we need to do is modify the following line, here we are setting the resolution to 320x320:
# Run YOLO model on the captured frame and store the results results = model(frame, imgsz = 320)
Here is what it looks like in the full script:
import cv2 from picamera2 import Picamera2 from ultralytics import YOLO # Set up the camera with Picam picam2 = Picamera2() picam2.preview_configuration.main.size = (1280, 1280) picam2.preview_configuration.main.format = "RGB888" picam2.preview_configuration.align() picam2.configure("preview") picam2.start() # Load YOLOv8 model = YOLO("yolov8n.pt") while True: # Capture a frame from the camera frame = picam2.capture_array() # Run YOLO model on the captured frame and store the results results = model(frame, imgsz = 64) # Output the visual detection data, we will draw this on our camera preview window annotated_frame = results[0].plot() # Get inference time inference_time = results[0].speed['inference'] fps = 1000 / inference_time # Convert to milliseconds text = f'FPS: {fps:.1f}' # Define font and position font = cv2.FONT_HERSHEY_SIMPLEX text_size = cv2.getTextSize(text, font, 1,2)[0] text_x = annotated_frame.shape[1] - text_size[0] - 10 # 10 pixels from the right text_y = text_size[1] + 10 # 10 pixels from the top # Draw the text on the annotated frame cv2.putText(annotated_frame, text, (text_x, text_y), font, 1, (255, 255, 255), 2, cv2.LINE_AA) # Display the resulting frame cv2.imshow("Camera", annotated_frame) # Exit the program if q is pressed if cv2.waitKey(1) == ord("q"): break # Close all windows cv2.destroyAllWindows()
Note: There is another line in this code that deals with resolution in the Picamera 2 set up:
# Run YOLO model on the captured frame and store the results results = model(frame, imgsz = 320)
Changing this line will not change the processing resolution. This will only change the resolution of the camera feed and will not speed up YOLO.
If we run this script at 320x320, you should get about 4-5 FPS - a similar FPS as converting to NCNN. This does unfortunately come with a slight processing performance cost, mainly in that objects will need to be closer to the screen for them to be accurately detected, but this performance loss at 320x320 is not too noticeable.
If we start dropping the resolution to lower values like 160x160 with:
# Run YOLO model on the captured frame and store the results results = model(frame, imgsz = 160)
you really will start to see some processing performance loss; however, you should also be getting a good 10-11FPS.
If we drop it even more to something like 96x96:
# Run YOLO model on the captured frame and store the results results = model(frame, imgsz = 96)
you may start struggling to identify objects unless they are close to the camera. In the image on the right, this is the furthest distance that I could detect a cup at 96x96 and at this point, it was struggling to even detect me. However, we were getting a staggering 25-30 FPS.
So, should you lower your resolution or convert it to NCNN? The answer to this is to give both a try! A good course of action might be to convert your model to NCNN first and if that isn't fast enough for your needs, choose a smaller model size (nano, small, etc.). If that still isn't fast enough for your needs, then start dropping the resolution bit by bit until you get the desired results - but watch out for the loss in performance and quality.
One final thing on resolutions, the model is designed to run at a resolution that is a multiple of 32 and not doing so will cause some very strange recognition issues. So always ensure that you set it to some multiple of 32 - e.g. 64, 224, or 480.
Controlling Hardware
Right now we can identify objects, and optimise that process for our needs, but how do we use it - we only have a preview window of our camera with our detections being overlayed.
All of this data is being retrieved from the results variable. If you run your script for a few seconds and stop it with "q" so that there are a few entries in the shell like in the image on the right, you can type into the shell:
results[0]
You will be able to see the insides of the detection results. The "[0]" on the end means you are getting the results from the newest frame that was analysed. If you scroll through this you should be able to see a section called names containing all of the objects that can be detected by the COCO library as well as the ID associated with these objects (for example, a person is 0, a bicycle is 1, a car is 2 etc). If you want to manually get this list you can type in:
results[0].names
In this results[0] we can also get a list of everything that was detected on screen. If you type in:
results[0].boxes.cls
You will get a list of all the ID's of the objects being detected. In my example on the right, we are only detecting a person which has an ID of 0. This is going to be our main method of utilising our detection results.
Using this "results[0].boxes.cls", we can write code to control hardware with the Raspberry Pi. Here is some demo code that achieves that:
import cv2 from picamera2 import Picamera2 from ultralytics import YOLO from gpiozero import LED # Initialize the camera picam2 = Picamera2() picam2.preview_configuration.main.size = (1280, 1280) picam2.preview_configuration.main.format = "RGB888" picam2.preview_configuration.align() picam2.configure("preview") picam2.start() # initialise output pin output = LED(14) # Load the YOLO model model = YOLO("yolov8n.pt") # List of class IDs we want to detect objects_to_detect = [0, 73] # You can modify this list while True: # Capture a frame from the camera frame = picam2.capture_array() # Run object detection on the frame results = model(frame, imgsz = 160) # Get the classes of detected objects detected_objects = results[0].boxes.cls.tolist() # Check if any of our specified objects are detected object_found = False for obj_id in objects_to_detect: if obj_id in detected_objects: object_found = True print(f"Detected object with ID {obj_id}!") # Control the Pin based on detection if object_found: output.on() # Turn on Pin print("Pin turned on!") else: output.off() # Turn off Pin print("Pi turned off!") # Display the frame with detection results annotated_frame = results[0].plot() cv2.imshow("Object Detection", annotated_frame) # Break the loop if 'q' is pressed if cv2.waitKey(1) == ord("q"): break # Clean up cv2.destroyAllWindows()
This code is the same as our previous scripts with a few additions. First we import the libraries required to control the Raspberry Pi's pins, and set up pin 14 with the lines:
from gpiozero import LED # initialise output pin output = LED(14)
Then we define the ID's of the objects we wish to control the pin with. If either of these are present then the pin will turn on. Right now it has 0 - person and 73 - book. You can change these to which ever IDs you want, or add more by simply adding another ID seperated by a comma:
# List of class IDs we want to detect objects_to_detect = [0, 73] # You can modify this list
Then we get the list of detected objects and see if any of the IDs we specify in the previous line match what is being detected in this frame:
# Get the classes of detected objects detected_objects = results[0].boxes.cls.tolist() # Check if any of our specified objects are detected object_found = False for obj_id in objects_to_detect: if obj_id in detected_objects: object_found = True print(f"Detected object with ID {obj_id}!")
And finally we control Pin 14 and set it on or off if an object is found:
# Control the Pin based on detection if object_found: output.on() # Turn on Pin print("Pin turned on!") else: output.off() # Turn off Pin print("Pi turned off!")
And from here, we can do whatever we want with this. You could control a solenoid like we do in the video guide, or you could control a servo, or motor.
Diving Into YOLO World
YOLO World is an object recognition model that is so incredibly powerful that it deserves its entire section. Usually, models are built upon a library of training data. The COCO library that YOLOv8 uses has been trained to identify cats, so it can do so. But it wasn't trained to identify glasses, so no matter how clearly you show the camera, it will not identify it. The usual way to get it to recognise glasses would be to retrain it and this involved a lengthy and processing-intensive process (far too much of a load for our little Pi), where you would label and feed it images of glasses. And only after this whole process would YOLOv8 be able to detect glasses.
This is the brilliance of YOLO World, it's an open vocabulary model meaning that you describe or prompt it for something to look for and it tries its best to find it. So you can simply tell it to look for glasses and as you can see in the image on the right, it was able to detect them, even without the need for retraining.
This isn't a magical silver bullet that can identify everything though, it still has a limited set of training data to work off, but long story short it has learnt from the whole image itself, not just learning to identify 1 object in that image. So if it is a common object that might appear commonly in photos - things like cameras, glasses, pens, batteries etc - it will most likely be able to detect them. For more obscure things that aren't photographed frequently - things like abacuses and Geiger counters - chances are it wont be able to recognise them.
It also may have some issues detecting common things, we couldn't get it to detect a paintbrush, or keys for example.
To use YOLO World we will need to modify our script slightly, here is some demo code for that:
import cv2 from picamera2 import Picamera2 from ultralytics import YOLO # Set up the camera with Picam picam2 = Picamera2() picam2.preview_configuration.main.size = (1280, 1280) picam2.preview_configuration.main.format = "RGB888" picam2.preview_configuration.align() picam2.configure("preview") picam2.start() # Load our YOLOv8 model model = YOLO("yolov8s-world.pt") # Define custom classes model.set_classes(["person", "glasses"]) while True: # Capture a frame from the camera frame = picam2.capture_array() # Run YOLO model on the captured frame and store the results results = model(frame, imgsz = 640) # Output the visual detection data, we will draw this on our camera preview window annotated_frame = results[0].plot() # Get inference time inference_time = results[0].speed['inference'] fps = 1000 / inference_time # Convert to milliseconds text = f'FPS: {fps:.1f}' # Define font and position font = cv2.FONT_HERSHEY_SIMPLEX text_size = cv2.getTextSize(text, font, 1,2)[0] text_x = annotated_frame.shape[1] - text_size[0] - 10 # 10 pixels from the right text_y = text_size[1] + 10 # 10 pixels from the top # Draw the text on the annotated frame cv2.putText(annotated_frame, text, (text_x, text_y), font, 1, (255, 255, 255), 2, cv2.LINE_AA) # Display the resulting frame cv2.imshow("Camera", annotated_frame) # Exit the program if q is pressed if cv2.waitKey(1) == ord("q"): break # Close all windows cv2.destroyAllWindows()
Note: The first time you run this script, it may take a few minutes as the YOLO World models are larger and it will download additional required packages - about 400 Mb total.
Right now this script is looking for the words "glasses" and "person", but we can instantly change that by altering this line. All you need to do is put the object you are looking for in quotation marks and separate it with a comma:
# Define custom classes model.set_classes(["person", "glasses", "yellow lego man head"])
In this line above we have used another cool feature of YOLO World and prompted it with a description instead of a single word. And as you can see on the right, it is detecting our yellow Lego man head. The detection is a little hit-and-miss and it is probably identifies the yellow more than anything, but it's a cool example of what can be achieved.
A few things regarding YOLO World though. First of all, it will only detect what you specify in the line above. If you do not describe an object in that list, it will not recognise it. If you delete the entire line, it will look for everything it can though and behave more like YOLOv8.
It also only comes in 3 sizes - small, medium and large - the demo script above it for the small, and you can replace it with the medium or large model by swapping out the s for an m or l. If you wanted the large model you would write:
# Load our YOLOv8 model model = YOLO("yolov8l-world.pt")
It also can't be converted to NCNN. This is a bit of a shame, but we can at least still drop the resolution to increase performance. And you may need to do so as it will run much slower than the default YOLOv8 model.
All in all, give YOLO World a go for your project. The ability to detect custom objects, or objects based on a prompt offers a great deal of flexibility and customisation. At the very least its a good bit of a fun though.
Where to From Here?
By now we should have YOLO object detection running on our Pi, whether thats YOLOv8, v5 or YOLO World. This guide is just a starting point though and we hope that you get out there are start adding computer vision to your projects. But most importantly, we hope you learnt something and had fun in the process.
If you are looking for some extension work or paths to wander down, have a look inside of the results variable, you can find a whole heap of cool things like the xy coordinates of the boxes being drawn. It is very possible to find the centre of the object being detected from this and wire up a servo to track the object.
If you are looking to run this system for extended periods of time, it will run a lot of read and write cycles on your MicroSD card, so setting up an NVME SDD for longevity may be a beneficial move. We have a great beginner guide on how to do that.
If you want to dive deeper into YOLO, check out the Ultralytics documentation page. They have a whole range of other cool models to check out, and because we have already set up the required infrastructure to use their models, it should be as simple as copying and pasting a demo code that you find there!
Acknowledgements
We would like to first of all thank the many contributors, developers and maintainers of the OpenCV and the COCO library. Much of the modern computer vision world has been built upon their efforts.
We would also like to thank Joseph Redmon and Ultralytics for developing and maintaining many of the YOLO models used here. They are fantastically powerful models that have been made able to run on low-power hardware like the Raspberry Pi and that is no small feat!
Appendix: Using a Webcam
This guide was built around using a Raspberry Pi camera module plugged into a CSI port. However, it is also possible to use a USB webcam as well. The code demonstrated throughout this guide should work out-of-the-box with most webcams - but with a few issues. There is a lot of variability in webcams on a Pi (which is why we opted for the camera module in this guide), but one of the common issues is a mismatched colour pallet. If the colours in your webcam feed appear to be mixed or the wrong hue, give this code a go:
import cv2 from picamera2 import Picamera2 from ultralytics import YOLO # Initialize the Picamera2 picam2 = Picamera2() picam2.preview_configuration.main.size = (1280, 1280) picam2.preview_configuration.main.format = "BGR888" # Change to BGR888 picam2.preview_configuration.align() picam2.configure("preview") picam2.start() # Load the YOLOv8 model model = YOLO("yolov8s-worldv2.pt") model.set_classes(["spray bottle"]) while True: # Capture frame-by-frame frame = picam2.capture_array() # Convert BGR to RGB frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) # Run YOLOv8 inference on the frame, specifying the desired size results = model(frame_rgb, imgsz=320) # Visualize the results on the frame annotated_frame = results[0].plot() # Get inference time inference_time = results[0].speed['inference'] fps = 1000 / inference_time # Convert to milliseconds text = f'FPS: {fps:.1f}' # Define font and position font = cv2.FONT_HERSHEY_SIMPLEX text_size = cv2.getTextSize(text, font, 1, 2)[0] text_x = annotated_frame.shape[1] - text_size[0] - 10 # 10 pixels from the right text_y = text_size[1] + 10 # 10 pixels from the top # Draw the text on the annotated frame cv2.putText(annotated_frame, text, (text_x, text_y), font, 1, (255, 255, 255), 2, cv2.LINE_AA) # Display the resulting frame cv2.imshow("Camera", annotated_frame) # Break the loop if 'q' is pressed if cv2.waitKey(1) == ord("q"): break # Close windows cv2.destroyAllWindows()
There is no guarentee that this will work, but it should for most webcam brands. If it does, you can modify this code as needed as instructed in the guide.