Getting Started with the Grove Vision AI V2 | Power Efficient Object Detection

Home
Videos
Getting Started with the Grove Vision AI V2 | Power Efficient Object Detection

The Grove AI Vision V2 is one of the coolest boards around for makers right now, as it provides capable computer vision at only 0.35 Watts! We are going to be looking at how it achieves this, how to set up and flash models onto it, and how you can use the output data with a microcontroller so you can add it to your next maker project.

Transcript

Real-time YOLO object detection running on microcontroller hardware. It sounds like asking your coffee maker to boot up a game of Counter-Strike or something, right? Well, how about that? This is YOLO running on microcontroller hardware, in case that wasn't clear. Welcome back to Core Electronics. We are doing yet another YOLO-based computer vision video, though this one's different. Don't click off there. This one might be the coolest of them all.

We've done a lot of computer vision videos with the Pi, sometimes with dedicated AI acceleration hardware on it. You might have run some on your computer for a bit of fun, or you might have seen some crazy performance on a high-end NVIDIA Jetson with over 275 teraflops per gigafart of compute power. And big, powerful AI hardware is really cool. But today's video lies on the complete opposite end of that spectrum, in an area where most makers probably don't think to look. What if instead of trying to squeeze the most tops out of each one, we tried to use as little watts as possible?

Enter the Grove Vision AI V2 by Seed Studio, a board that only uses 0.35 watts of power while still being able to run a YOLO V8 model at 20-30 FPS. The specs sound like fiction, but the board is very real, so let's see how they pulled it off. If you have a keyboard in front of you that lights up with RGB colors and whatnot, chances are it's thanks to one of these, a small RGB LED. And one of these at full white brightness uses about the same amount of power as the Vision board. And your keyboard has one of these LEDs for every single key that lights up. That is just a good sense of how little 0.35 watts is, how little power this board actually uses.

Now, I'm not going to lie, I found it a little difficult to believe that number, so we went ahead, booted up the AudioArc, plugged in the Vision board, ran some object detection, and saw that it was in fact using about 0.35 watts of power. It is actually a tad bit more, but we can probably chop that up to the camera using some power. Nonetheless, 20-30 FPS on a YOLO 8 model with 70 milliamps of current being drawn. How does it do this? Well, there's two things making this happen. This board has microcontroller hardware, but that doesn't mean that it's the same, nor you can run these models on your ESP32 or Arduino or Pico. No, this is specialized microcontroller hardware designed for AI purposes.

In the onboard WiseEye microprocessor, you'll find a dual-core M55 processor, a high-speed, newer-generation processor with support for machine learning capabilities. But the real star of the show is the Ethos U55, another ARM core, but this one is designed for one thing and one thing only, machine learning acceleration. It's a dedicated processing core for tasks like computer vision. All of these cores come together to make something that's just powerful enough to run computer vision on device, yet it's microcontroller hardware, so it's efficient enough to sit power like an LED. This is modern edge processing, and it's here, and it's ready to be applied in your maker projects.

The second part of this equation is the model itself. This isn't running a standard off-the-shelf YOLO model that you can pull from Hugging Face. It's been trimmed down and compressed for maximum efficiency. They've essentially gotten a car and just stripped it down to the things you only really need, like wheels, a drivetrain, engine, and a steering wheel. You don't need doors, you don't need anything like that. The model runs at 192x192 pixels, the default is 640x640, and this does lower detection distance a bit, but it increases speed. The models have been quantized down to int 8, which greatly reduces model size and increases speed. The models have also been cut down to only detect one type of object per model, which results in a smaller model and increases speed.

So this board is really just a beautiful fusion of new AI-focused microcontroller hardware and state-of-the-art models cut down to run on an embedded system like this. But now the question is, how does it all come together? How does it perform? Let's go ahead and fire it up. The Vision AI board comes either as a kit with a camera, camera cable, Xiao, ESP32, with the idea being that the Vision board does all the detecting and analysis of the frames and whatnot, and then you send that data to your ESP32 and do whatever you want with it in your project. If you have a different microcontroller in mind though, you can just buy the AI Vision board by itself and source your own camera module and cable.

There is limited compatibility here, some camera modules work fine and some not at all, and some like the official Pi camera module turn the whole world green for some reason, but this didn't really affect object detection too much. In the written version of this guide linked below, you'll find some cameras that do definitely work, and there you will also find all the demo code that we're going to be using if you want to follow along and get your own going. Assembly is straightforward, just ensure you connect the camera cable the correct way around or you'll be in for some head scratching. Now comes another crazy thing about the board. Of every computer vision related thing that we've done on this channel so far, this is by far the easiest to get going.

You ready? Step 1, plug the board into your computer. Step 2, go to the SenseCraft Studio website and connect your board. Step three, select the model you want to use. That's literally it. It takes like a minute to flash the model and if you didn't automatically start, you can hit invoke to see a live preview of your camera through the studio as well. Next to it, you should also see this device logger here, which is showing a bit of the information that we're going to be taking and actually using later. In here, you can see a performance line, which shows you how many milliseconds it took to do all this processing. So 55-ish milliseconds to process an entire frame. And then below is a boxes line, which has all the detection data in it, including, you know, the confidence score and whereabouts the box is.

But yeah, that's how incredibly easy it is to get going with this. By the way, Seed Studio has gone pretty ham here. If you go to select model and scroll down, you can see that there are quite a lot of models here to actually choose from. And all of these are pre-trained models ready to just upload to the board like we did. If you've ever wanted to detect the soundtrack from the hit movie Inception on CD, you're in luck. Maybe not though, because a good chunk of these models are borderline useless. For the life of me, I couldn't detect any CDs with this CD model. I guess that a lot of them are just automated through some pipeline and quality control is a little lacking, but a lot of the models do work and they work well, especially the more common things like people, faces, pets, and community models that people have uploaded to SenseCraft tend to be pretty solid as well.

Be wary that just because the model is on here and ready to go doesn't mean that it actually works at all. So test it before you base your entire project around it. Also be aware that you might struggle to find models that detect multiple things. One of the trade-offs of running on the device like this is that these models tend to only detect one thing and one thing only. So only faces or only apples or only dogs, you know, stuff like that. Also, side note, Seed does have some really good documentation for training your own models completely from scratch. Just be aware that it's not very beginner friendly and can be a little bit involved, but it's definitely possible. Overall though, once you find a model that works, this system works well.

It's really incredible to think that this is doing object detection so reliably and with such a decent frame rate on barely any power. And it just, it hits that perfect detection performance to power ratio, you know, it's just incredible. One more thing before we move on, you can also update your device's firmware here as well, down at the bottom. Probably definitely worth doing this. Our board was running firmware about a year behind and we got about a 20 to 30% speed boost after doing so. And also down here, you can connect the board to Wi-Fi and your MQTT broker to beam it to Home Assistant if you're into that stuff.

All right, we've got our board, we've flashed a model onto it and our preview shows that it's working well. How do we actually use it? How do we take that detection data and apply it to our projects? Well, we have a few options. It supports I2C communication and it has a great Arduino library if you want to go and do something in C++. But I am number one MicroPython shill and I think about getting that shirt at least once a week. So we're going to connect it to a Pico 2 via UART. I'm going to go ahead and connect ground and 3.3 volts from my Pico to the vision board. This means that we no longer need to power it with USB-C, you can just get its power through the Pico. And then we're just going to connect our UART pins. I'm using pin 0 and 1 on my Pico.

Alrighty, I'm in Thonny and I've just got this demo code here that's going to show us what's going on UART message wise. A very nice thing I like about this board is that you choose when you want a frame to be processed. We start by setting up our UART comms, which is at this scarily high board rate by the way. Then we simply send this AT plus invoke command and that tells the board, hey, can you please, you know, take a photo and then analyze it and then send that data back to us over UART. And the rest of this code here is just simply collecting what it sends back and then prints it out. And let's give that a run. And as you can see, that's a pretty quick round trip.

This here is actually two messages that it sends back. When we send that invoke command up here, it starts by immediately sending back this type 0 message here, which is just a confirmation to say, ah, yes, I got it. I'm starting it. Here's all the settings and the config that I'm running it with and such and such. Then once it's finished processing the frame, it sends back this type 1 message here. They're kind of just pushed together here, but they do come through separately. And as you can see, we have our performance line with how long that frame took to process. And as you can see, we now have information in our boxes. Very exactly the same information that we saw in the SenseCraft studio earlier. This is wickedly cool that you can control all the timings of this system and all the frame analysis directly from your code.

And obviously, if you put this in a loop and run it as fast as you can, you'd get, you know, your 20 to 30 max FPS from this. If you want less FPS, send the command as infrequently as you want. Just be aware, though, you probably won't be saving much power. Here is an audio test where we tell the board to only process one frame a second. And as you can see, there isn't much of a difference between standby and processing. So you're probably not going to be saving much power doing this. You know, you might as well just run it at the full FPS it can handle. But nonetheless, super easy.

And really all you're after are these numbers in the boxes variable here. The first two are the x and y position of the detection box that it draws around the object. So this is actually going to be the center point of your thing. The third and fourth numbers are the width and the height of that detection box and all of this is done on a 480 by 480 pixel image. And this fifth number is the last important number because it's the confidence score of that detection. Really helpful in your project. If multiple of the same objects are detected it's going to create another string of numbers in boxes.

I've just put up a generic stock image of some people and I'm going to point the camera to it and run it. And as you can see we've got multiple boxes of multiple things being detected. Also just a fun little thing in this invoke line up here if you change this last number from a 1 to a 0 and run it, point it at me, you can see that our message is suddenly a lot bigger and if we expand that in the shell you can see that, oh let's expand it again, you can see that we get the same message out but this time that type 1 message is actually going to send the entire JPEG image that it took over UART. And as you can see huge mess but this is all just a base64 encoded JPEG.

Probably not helpful in your project but pretty cool that you can actually get that image sent over UART and you know unless you're a weirdo into handling images on microcontrollers but you can reconstruct the image. Ah man if only there was like you know like a like a second demo code that stripped away all the unnecessary thing and gave you the boxes. There is, let's take a look at it. This code is very similar to the first one but it's ready to be applied in your projects. Up here you have a function that sends the invoke command and then it reads both those messages and just extracts the boxes information for you. It's also got a timeout feature and error handling and it's just been made nice and robust ready to actually use in a real world project.

So let's go ahead and run it and as you can see no objects detected. All right I'm going to very carefully point the camera at me and there we go we've got all the information coming out. If you are interested in using this we'll go over it in more depth in the written guide but we're just letting you know that if you want to actually use this we've got some nice code to get you going. Well that about wraps us up. I've been playing around with this thing for quite a while now and I am still amazed at this thing every time I use it. I can't believe that we've reached a point where we can do this sort of stuff on microcontroller architecture.

The role of microcontrollers is really shifting quite quickly as more and more processing power becomes available. In case you haven't seen it you can now run OpenCV on a microcontroller in MicroPython, another computer vision thing on microcontroller hardware. This technology is also only going to get better from here and right now it's already more than good enough for use in maker projects. I'm just so excited for where this sort of stuff is going to go. If you do make anything cool with this and you want to share it or you just need a hand with anything we covered in this video head on over to our community forums. We're all makers over there and we're happy to help. Until next time though, happy making.

Transcript

Comments

Follow us on instagram

About Us

Resources

Related Content

Transcript

Comments

Follow us on instagram

About Us

Resources