**Part 1** YOLO-E is one of the most mind-blowing computer vision models to come out in a while. To understand why that is, we first have to learn about the shortcomings of other models. Without YOLO-E, you would download your computer vision model onto a Raspberry Pi, maybe YOLO-11 or YOLO-V8. You would fire it up and start detecting objects just fine. You can detect a person, a chair, a keyboard, all those things because it comes already trained to identify them. But what if I wanted it to identify, say, an object like this? Well, we would need to train it to do so. I'd start by taking about 100 photos of it, then go through the fairly involved process of using those images to train YOLO to detect this new object. You can't do this process on a Pi; you would need a decently powerful computer. Even with a high-end gaming GPU, this will take hours of processing. Once it's finished, I would then put that model back onto my Pi 5 and start detecting. If I changed my mind and wanted to detect another object, I would need to repeat that entire process again. Now let's look at how to do it with YOLO-E. I want it to detect a Pokeball. I change one line of code, run a Python script on my Raspberry Pi, and like five seconds later, the model is ready to go. What? Welcome back to Core Electronics. Today we're looking at YOLO-E, a promptable vision model, how it works, and how you can use it in your projects with a Pi 5. Even if you don't plan to use this in a project, it's such a fun thing to play around with if you have a Pi 5. Let's get into it. YOLO-E is built upon other standard YOLO models, but there is one major difference. Instead of being trained on specific things, it has been trained on visual concepts and ideas. Let's say we gave it an image of a horse. Instead of saying, "This is a horse, learn what a horse looks like," it might have instead been told the visual ideas of that horse: what actually is that horse, why does it look like a horse? It might have been told that it has four legs, it's brown, it's a little bit hairy, it's got a long face, it's a mammal, etc. All of these ideas, and because it has these visual concepts and ideas, it can identify things it's never seen before if you ask it nicely. Let's say it's never seen a picture of a zebra in its life, but I prompt it to identify one anyway. It starts by breaking that prompt down into visual ideas that it knows. It might break the word "zebra" down into four-legged, stripy, black and white, hairy. It knows what all these visual concepts look like, and so it uses those to recognize the zebra. I think you can see that because it's equipped with these visual ideas, it can identify a lot of things. Normal YOLO models come by default trained to recognize about 80 things, give or take. Now, I am just guessing here, but I think YOLO-E could be used to recognize easily 5,000, probably somewhere in the tens of thousands of things without lengthy retraining processes. It's also really flexible in that you can just say "box," and it will identify boxes just fine. But then you can also say "blue box," "brown box," and "clear box," and it's able to identify all of those things as well. It understands the visual idea of what colors are. Now, this is not a magical solution and the end-all be-all of computer vision. It does have its limits. For example, really obscure or uncommon words or items, chances are it can't break it down and figure out what visual concepts constitute that, and it won't be able to identify them. I can't just say "Mr. Beast" or "Jeff Geeling" and magically start identifying. Another example is this little 3D printed Minecraft copper golem. YOLO-E does not know what Minecraft nor a copper golem is, so you would struggle to find a prompt to identify this. Although, it does have another mode that we have strategically omitted till now. You can prompt it with text or an image. You can show it a single image of this golem, just one photo, and it breaks it down into visual concepts, then identifies it off that. We'll look at this later in the video when we get to it though. So, YOLO-E can be prompted to identify nearly everything; that's what the E stands for, fun fact. It can be prompted with an image and will probably meet 70 to maybe 80% of your custom detection needs, and it runs as fast as regular YOLO models. I think you can see why this is such an incredible advancement in vision models. Alright, let's take a look at how to get this going. Our device of choice is a Pi 5 because it's a really good candidate for something that you would actually run this on in a maker project, and a 4GB or larger model is probably the safe bet here. You're also going to need a camera module; we're just using the camera module 3. Of course, you'll need everything else to power and run your Pi, including a microSD card that's at least 32GB in size. You can find everything you need in the written version of this guide linked below. There you'll also find a zip file containing all the code we'll be using, as well as some more detailed instructions if you get stuck on any of these steps. You're also going to need to install the YOLO Ultralytics package on your Pi. **Part 2** This has become a little bit more difficult than it used to be, so we have another video guide linked below. It's designed to complement this nicely. Go and watch it. You're going to go install that first. And of course, ensure you've plugged in your camera. Once you've completed and installed everything from that video, head on over to the written guide linked below and download that zip file. Extract all the Python scripts into a file that is somewhat easy to get to; I'm just using my desktop here. You're going to want them all inside a file because this process is going to create lots of little extra files inside of here, and you don't want it to get messy. To begin with, we're going to open up YOLO E run model, which is going to do so in Thonny because hopefully, you've already set up Thonny to use the virtual environment that we created. As you can see, there is not much code required to run all of this. So we go ahead and import everything we need and then we just use PyCamera2 to configure and set up the camera. Then we load the model that we want to be using for our object detection. We then take that image and run object detection with our model. All we're going to do is get the results of that and then put it in a visual thing that we can see. Now, this line here has two options: boxes and masks. Boxes are those identification boxes you'll see in all computer vision. You can turn it off if you want. Mask kind of draws a silhouette around the object that it's detected. I like to leave this off because it can get messy, but you can also use it in your project to kind of figure out the area of the object if you want to have a play around with it. You'll see what I mean. This entire section here is just calculating the FPS and then showing it on there. Half the code is just this FPS, which is completely optional. The things you want to change in this code are the camera resolution. Note, this is not the resolution that YOLO processes at. It will not increase performance. We'll take a look at it later. This is just what resolution camera you want to use. You also want to go ahead and change the model to whatever you're using. Don't change it now. We'll show you how to change it in a bit. So let's go ahead and run this code. The first time you run any set of code or something new with the Ultralytics package, it might take a while as it downloads things that it needs or might need to download a model or something like that. Ah, and you of course need the internet to download a model. That's probably a very important thing. Let me connect it to the internet. We downloaded our model. Come on, a little bit more. And with that, we are up and running with our YOLO E detection. Hold up. Wait a minute. Where are my prompts at? What is this? Well, this is something called prompt-free mode. It essentially just tries to detect everything it can in an image like a regular object detection model. Now, this FPS is pretty bad. So let's see how we can speed it up as this process is also how we're going to be able to apply our custom prompts. We're going to press Q to stop that from running. Then we're going to go ahead and open up the script prompt-free Onyx conversion in Thonny like so. Now, this script is going to take a model that we want to use and convert it to the Onyx file format, which is just a more efficient format for running it on devices like the Pi. It's just going to give us improved FPS. Now, there is another model format that works well on the Pi, and that is NCNN. Traditionally, we have used NCNN, but I didn't really see much of a difference between Onyx and NCNN performance-wise with YOLO E. So we're just going to be sticking to Onyx, but give it a go for yourself. See what works. It doesn't really matter. You can follow the guide using Onyx or NCNN from now on. We're going to keep it at Onyx though, like so. The other option we can change is the resolution of the model, which is another way to increase FPS. By default, your model processes at 640 by 640 pixels. The code will take whatever size the camera you've set to in the main script here. So I've got an 800 by 800, and then it's going to scale it down to 640 by 640 to be processed by the vision model. By changing this number here, we can tell it to go even lower. So let's say 320 by 320 pixels. It's always square. Less pixels to process per frame means more frames per second. So you can drop this pretty low and get some really good FPS. Like I could set this to 96 and get like 20 FPS or so. I can set it to 128. Whatever number you want between 32 and 640, it just needs to be a multiple of 32. Must be a multiple of 32. That's very important. Also, be aware that there is actually a big downside to this. The lower the resolution is, the lower your detection range is. If I try to detect this phone at the full 640 pixels, I can detect it all the way back here. But if I set it to 128 pixels, I struggle to detect it more than a couple of meters away. We'll look at how to tune these to your needs in a bit though. It's also worth knowing that 320 and 128 might not sound like a lot of pixels, but these models can do a lot with very few pixels. For now though, I'm just going to set my resolution to 192, which is going to give us a bit of a nice middle ground between pretty fast and also able to detect at some distance. Let's go ahead and run that code. After just a few seconds, that's all it takes to really convert it to Onyx. Let's click down. Beautiful. After it's finished, we should be able to open up the file that all of our scripts are in. **Part 3** See YOLOE...ONNX and that is our converted model there. You can also see our .pt model, which is the one that we started with and we downloaded with the Ultralytics package. Now to run it, all we need to do is go back to our run model folder and instead of .pt, which is PyTorch by the way, it's just the default kind of model format that all the models will come in, we can just delete .pt and set it to .onnx, run it, and we should be able to let it do its thing real quick. We got our object detection running and as you can see, way better FPS. It's struggling a bit to detect the things in the background because the resolution is smaller and it's far away, but I can pick up the phone, it detects phone, beautiful smartphone, iPhone, webcam, everything. Will it do Pokeball? Moth. I think that's a good example of some of the false positives it gets. It'll identify one object as like 50 different things. What does it detect this little fella as? iPhone. I don't think that's an iPhone buddy. And that is our first demo done and dusted, a prompt-free mode that identifies whatever it can see. Now, this might work well for your project, but chances are you're going to need to use a text prompt approach. A good example why is I'm just going to go ahead and look for a picture of a tiger on my phone, just googling it, Google Images. I'm just going to hold that up and as you can see, it's identifying it as pretty much everything but a tiger. Oh, no, there you go. Oh, once, for one frame it identified it as a tiger. Definitely not a philosopher, definitely not a wolf, lots of false positives. Let's fix this by using a text prompt. Alrighty, go ahead and open up the text prompt Onyx conversion Python file and this is pretty much the same as the other Onyx conversion file we had before, but before we convert it, we're going to feed it our prompts. Now, what we're doing here is a bit of a Pi-specific thing for the Onyx file format. Usually, you would just use the default PyTorch model that it downloads by default and in the main run model script, you would put your prompts in here with the code and you would just run it and that's it. However, when we convert the PyTorch model to the Onyx format so we can run it on the Pi more efficiently, it hardens kind of like clay and we can no longer give it prompts. So we have to put the text prompts into the model before we convert it and it hardens up. This is super simple though, all we need to do is put in our prompt text here and if we want to put in more, so let's say we identify a phone, hand, I'm going to copy and paste this and I'm going to say please, pretty please, detect tiger. Let's say I also want to detect beard, I'm just going to put that in as well and let's also put in a pokeball as well like so. You can just keep chaining these on for as long as you would like. Alrighty, let's go ahead and export that, let it do its thing and once it's finished exporting we should be able to see our new file in here. Now something very important is we've created two here, this one was dash pf which is the prompt-free mode we did just before and this one here without it is the one we just created. Now if we use this same script to export another folder it's going to create it with the same name and overwrite what we just made here. So to prevent this from happening you can change the name of it. So I'm just going to go ahead and change it to tiger-seg. I found that you need to keep the dash seg on the end to keep it happy. I'm going to copy that name here as well, .onnx, keep your file format and then all we need to do is go back to this run model here and I'm going to place it in there like so, run that and we should be running our model. As long as the model is in this folder here you can say whatever the model name is here and run it. And like that, we're up and running and let's get a picture of a tiger up. There we go and as you can see we're pretty good on those detections now, even the little ones. Let's just do a few pictures of a tiger. Oh look at that big boy. Yeah, that's a tiger as well. Look at that tiger, look at that boy. And that is the whole process. Have a play around with it and see what it can do. There is a bit of an art to understanding what prompt you'll need to identify something but you can smash out different prompts really quickly. Changing one line of code and exporting the model in five seconds is far quicker than a few hours of retraining with a traditional model. Again though, it's not entirely magic. If I tell it yellow Lego man hand it can identify the giant Lego head but I don't think it actually understands Lego or head and is instead looking for something that's yellow. Now if you're struggling to identify something or it's not giving you a confidence rating that you like, by the way that little number in the top of the box is the confidence rating from zero to one, then you can change the model size which is the final thing that we can use to tune our setup. I am going to keep the exact same prompts and resolutions but I'm going to change the model size from the small model to the large model by just replacing the S with an L. We're running YOLO E version 11 large model instead of the small model. **Part 4** Medium model, which is in between large and small. Essentially, a larger model is more powerful in its detections but runs slower. A smaller model runs faster but has less detection power behind it. I'm going to go ahead and export that, and it's probably going to need to download that large model again. Let's run that small model. Just as a test, I'm going to show it my hand. You can see it's a bit hit and miss here in the detections. If I hold up my phone, it's about 80, 60, about that percent confident that it's a phone. Let's now change it to run the large version of the model that we just exported. If I hold my phone up, you can see we are pretty damn confident on our phone here. Hold my hand up. There we go. Nice and solid. Add a bit of an angle. Still detecting hand. As long as it's not blurry. Okay, we can change the resolution, model format, model size. How do I correctly pick these things though? Really easy rule of thumb. Keep it on Onyx or NCNN if you want, whatever's running better for you. Then find the model size that works accurately enough for you. A cup is easy to identify and you can probably get away with the small model, but a more complex object might need a larger model. Once you've figured out the model size that works, change the resolution and make it as small as you can get away with. If you're trying to detect something really close, congrats, you can probably use a really low resolution and get good FPS. If it's on the other side of the room or across a hallway, you might need to use the full 640 pixels and just deal with the one to three FPS. For most maker projects though, that's probably enough. Now we found text prompting is the best and most reliable method of using YOLO-E. I would say try to do it first. However, you might find an object that is very difficult to detect. For example, this 3D printed Minecraft copper golem. I don't think it's possible to use a text prompt to identify this, but this is where image prompting comes in. It's a bit hit and miss. Some things work incredibly well and others not so much. So try the text prompt first. If that doesn't work, give this a go and you might be lucky and get saved. First things first, we need an image of our object. We're going to go ahead and open up, where are we? Image prompt capture in Thonny. I'm just going to close that old one down. At the top, you'll find the name of the image that it's going to save as. I'm just going to save this as golem.jpg instead. Run that script. I'm going to hold up our little fella. I'm going to make the photo a little bit bigger here as well, just so we can kind of get a good. Let's get a good photo of him like so. I'm just going to hold him up nice and square like that. I'm going to press space to take a photo and that will have saved to our file. Let's stop that from running. Now I'm going to go to image prompt draw box. Open that up in Thonny as well. Change it to the name of the file we just created. If you're going to use your own custom image for this, you can just drag it into the file here and continue on from here with that file name. I'm going to give that a run. What we're going to need to do is we're going to need to draw a box around the golem like so. I might ignore his antenna because we don't need it really for the detection like that. If you look in our shell, that's going to create four numbers here. We're just going to copy them because they are defining the box that we've just drawn around it. Really try and draw that tightly because everything that's going to be inside of this box is what the YOLO model is going to look at and try to break down into visual concepts. Once we've got the image and the box, we can go ahead and open up image prompt Onyx conversion. Now this is a little bit more involved, but it's the exact same kind of Onyx conversion as before. In here, we're going to be able to specify the image and then what box coordinates actually is the thing that we're trying to identify in that image. I'm going to change this to golem instead, golem.jpg. I'm going to paste in the coordinates that we got from there that are in the shell still. Nice. Exactly like the text prompt, you can chain as many of these together if you want. If I copy this and paste it there, we can have a third, fourth, fifth thing on and on and on and on. But we're only going to be doing one. So I'm going to delete the other one like so because it's going to be looking for an image that doesn't exist there. I might go down here and drop the resolution a bit because we don't need the full 640 there. I'm happy with the small. I might go with a large model on this one because it's a bit of a tricky thing to identify. Let's give that a run. Let it do its thing. This is going to save as the same file name as the previous one we did. So YOLO E11 seg to use it or go ahead and rename it to something.seg if you want to keep it. Very important. Here you can see golem.jpg has been assigned an ID of one. Unfortunately, you can't really assign names to the model. It's a number. So the first thing you train is going to be ID zero. The second thing is going to be ID one and on and on and on. Hey, look at that. We are detecting our object, which is pretty crazy because that was just one photo that we showed it and it went, I kind of understand what composes **Part 5** what ideas make up that image, and now it's doing the exact opposite to identify that object. Nuts for something that we can do in just seconds on a Raspberry Pi. One more thing before we leave, included in here are two demo codes to get you started in your project and actually using this. They are just modified versions of this run model script and can serve as the starting point for your own code. They are demo object counter and demo location. Demo object counter lets you specify an object and how many of that object and how confident you want that reading to be. If it detects that many with that confidence, it'll let you do something, whether that is moving a servo, sending an email, or whatever you're going to do with your project. Essentially, if object and certain amount is met, do something. Demo location, on the other hand, lets you specify an object and it tracks the location of that object on the screen. Super handy for figuring out where something is. And that about wraps us up for now. You are equipped with the ability to use YOLO-E on a Raspberry Pi 5 to detect custom objects on the fly and with no retraining, which is a pretty mind-boggling thing to be able to do. We hope that you go out and make something cool with this. If you do, post about it on our community forums. Or if you need a hand with anything we covered in this video, feel free to head on over there as well. We're all makers and we're happy to help. Till next time though, happy making. Is this a tiger? Yes, it is. I don't know what I'm doing.
Makers love reviews as much as you do, please follow this link to review the products you have purchased.