Getting Started with the Grove Vision AI V2 | Power Efficient Object Detection

By Jaryd

Updated 14 December 2025

In this guide, we will be exploring the Grove Vision AI V2 – a microcontroller-based board capable of running YOLO and other computer vision models in real time, all while consuming a relatively tiny amount of power. We’re going to look at how this awesome piece of hardware manages to be so power efficient, how you can get computer vision models running on it in minutes, and how to connect it up to a microcontroller to use the output detection data in your own projects.

This might be one of the most surprising bits of computer vision hardware we’ve played with yet – incredibly low‑power, incredibly simple to use, and incredibly fun to experiment with. Let’s get into it!

Contents:

What is the Grove Vision AI V2 and how Does it Work?
What You Will Need
Uploading Models to the Board
Reading Detection Data via UART
Where to From Here?

What is the Grove Vision AI and how Does it Work?

AI and computer vision running locally has become huge. Over the years, we have produced quite a few computer vision guides - mostly with Raspberry Pis, sometimes with dedicated AI-accelerating hardware. You may have spun up some OpenCV or a YOLO model on your computer for fun and accelerated it with a gaming GPU, or you might have seen some crazy performance on a high-end Nvidia Jetson. And big, powerful AI hardware is really cool, but this board lies on the complete opposite end of that spectrum, in an area where most makers probably don't think to look.

The Grove Vision AI V2 asks a question - what if instead of trying to squeeze the most TOP/s out of each watt, we tried to use as little Watts as possible? And as little Watts as possible, they have achieved with this board using as little as 0.35W, while still being able to run a trimmed-down YOLOv8 model at 20-30fps! The specs sound like fiction, but the board is very real, so let's see how exactly they pulled this off.

a close-up of the Grove Vision AI V2 board.

If you have a keyboard that lights up with RGB colours, chances are it's using a small WS2182 RGB LED. One of these at full brightness uses about the same amount of power as the vision board... and your keyboard has one of these LEDs for every single key that lights up. This is hopefully giving you a good sense of how little 0.35 Watts is and how little power this board actually uses.

This is such an incredibly small amount of power to be doing a task like this, and we wanted to have a look for ourselves at the board's power usage. To test its power consumption, we fired up the otiiarc, an extremely accurate tool that can precisely monitor power usage for devices like this. The image on the right is a reading showing the board being powered on, idling for a bit, then running a YOLO model. Once a small chunk of power is attributed to running and capturing images with the camera, it is in fact consuming about 0.35 Watts of power.

25 FPS. Running a YOLOv8 model. 70 milliamps of current. How does it do this? Well, 2 things are making this happen.

an otiiarc analysis showing the power on process and power consumption levels of the vision board.

The Vision board is based on microcontroller hardware, but that doesn't mean that we can run these models on a standard ESP32, Arduino, or Pico. Instead, this board uses microcontroller hardware designed for AI purposes. In the onboard Wiseeye microprocessor, you will find a dual-core m55 processor - a high-speed, newer-generation core with support for machine learning capabilities. But the real star of the show here is the ethos-u55, another ARM core, but this one is designed with one thing in mind - machine learning acceleration. It's a processor core designed from the ground up for tasks like the computer vision we wish to run.

All these cores come together to make something that is just powerful enough to run computer vision on-device, yet efficient enough to sip power like an LED - this is modern edge-processing and it's ready to be applied in your maker projects.

A screenshot of the sensecraft studio showing the grove running flower detection on a frame of the presenter holding a boquet of flowers.

The second part of this equation is the model itself. This isn't running an off-the-shelf model you might pull from Hugging Face. Instead, it's been trimmed and compressed down for maximum efficiency - like stripping a car down to the absolutely minimum it needs to be able to drive, no seat, no door, no roof, just an engine with wheels.

The model runs at a resolution of 192x192 (standard YOLO models come at 640x640), which lowers effective detection distance, but increases processing speed greatly. The models have also been quantised down to INT8, which greatly reduces model size but increases processing speed. The models have also been cut down to only detect one object, which results in a smaller model, but increases processing speed greatly.

So all this board really is just new AI-focused microcontroller hardware, and state-of-the-art models cut down to the bare bones to run on an embedded system like this. But now the question is, what is the result? How well does it perform? Let's fire one up and see.

What You Will Need

The Vision board comes either as a kit, with the board itself, a camera, a camera cable and Xiao ESP32 (which conveniently connects onto the Vision board) ,with the idea being that the Vision board processes images, and then sends them to the ESP32. Or, if you don't wish to use the Xiao and have a different microcontroller in mind, you can purchase the parts individually. You will need a:

Microcontroller - We are going to be using the Raspberry Pi Pico 2 running MicroPython for this guide.
Grove Vision AI V2 Module - This is just the board itself.
Camera Module - There is limited compatibility with this module, and Seeed Studio recommends using an OV5647-based camera module. For this guide, we will be using this inexpensive Arducam module that we know works, but we also managed to get the official Raspberry Pi modules to (somewhat) work as well. They just tint the whole image green, which doesn't greatly affect detection results.
Camera Cable - Chances are your camera will come with a suitable cable, but it's worth double checking incase it doesn't.

Uploading Models to the Board

Before we begin, ensure that you have connected your camera to the board. This is fairly straightforward, just ensure you connect the cable around the right way, or you may be in for some headscratching. The correct way is shown in the image on the right.

an image showing the correct orientation of the camera cable

Now comes another incredible feat of this board - how easy it is to get going. Of every computer vision-related guide we have ever done, this is by far the simplest and quickest to get going. Ready?

Connect your board via USB and open up the SenseCraft Workspace in your browser.
Press the connect button and select your board from the list of COM Ports.
Press Select model and choose the model you want. This will then upload it to the board.

That's it! That's how fantastically simple it is to use this board. The longest part of this whole process is waiting the minute or two for the model to be uploaded to the board.

a screenshot of the sensecraft studio showing the model flashing process.

If it didn't automatically start, you can hit the invoke button to see a live preview of the camera and its analysis with your model. Outside of this preview, it is difficult to visually see your camera and detections - this will be your best tool for any sort of debugging in the future.

On this page, you should also see a device logger window outputting detection results. You should see two lines being printed like the following:

perf: {"preprocess":7 ,"inference":48, "postprocess":0}
boxes: [[245, 292, 480, 355, 94, 0]]

The perf line contains the processing times (55 ms total in this case), and the boxes line contains information about the detected object, including the bounding box coordinates (the first four numbers), and also the confidence score (the 5th number). We will look at using these values later on as these are what will be sent to our microcontroller.

But that's it! That's how simple it is to get going with a detection model, and there are quite a few to choose from. The models page is a little easier to use if you are looking for something specific. Just ensure that you set the filters to only look for models for the Grove - Vision AI V2.

a picture showing the preview camera feed from SenseCraft studio. The frame shows a YOLOv8 model detecting a human.

Just be wary that there are a lot of really unreliable and borderline useless models here so it's worth trying them before you base your entire project around it. It looks like a lot of these were uploaded with little human verification, but community contributed models and more common detection tasks like people and pets work well.

Seeed Studio does have some wonderful documentation on training your own model from scratch, but just be aware that this is not very beginner-friendly, and can be a bit involved - however, it is still entirely possible, we don't want to dissuade you from trying!

One more thing before we move on, through this studio you can easily update your device's drivers. This may be worthwhile doing as after updating our year-out-of-date drivers, we saw a noticable increase in performance. On this page, you can also connect your device to a Wi-Fi network and also an MQTT server to directly beam the output to Home Assistant or an MQTT project of yours.

a screenshot of the sensecraft studio demonstrating how to easily update the device's firmware, and also connect it to an MQTT broker.

Reading Detection Data via UART

We now have a board, with a chosen model flashed onto it, and the preview is showing us that it is working as we would like. How do we actually use it in our projects? Well, we have a few options. If you are using C++, Seeed Studio has made a great library for it that talks to the board through I2C - definitely worth checking that out if you are.

However, for this guide we will be using MicroPython and reading the output from the board's UART. We are using the Pico 2, but it should be a similar process for any other board running MicroPython.

First, we must connect the boards together. Connect a 3.3V and ground pin from the Pico to the Vision board. This power supply means that we no longer need to plug the Vision board into USB. Then we will connect the Vision board's TX and RX pins to the Pico's - we are gonna use pins 0 and 1 for our demo. The image on the right shows this complete setup.

An image demonstrating the correct wiring to connect the Grove Vision AI's UART to a Raspberry Pi Picos. Ground is connected to ground, the Pico's 3.3v out is connected to the Vision's 3.3v power, and the Vision's TX and RX are connected to Pins 0 and 1 of the Pico.

To run our code, we are going to be using Thonny IDE. Create a new script and paste in the following demo code:

from machine import Pin, UART
import time
import ujson
from time import sleep

# Pico UART0 (GP0 = TX -> Grove RX, GP1 = RX -> Grove TX)
uart = UART(0, baudrate=921600, tx=Pin(0), rx=Pin(1))

# Send one inference
uart.write("AT+INVOKE=1,0,0\r")

# Read for up to 200 ms to catch both operation and event responses
start = time.ticks_ms()
buffer = b""
while time.ticks_diff(time.ticks_ms(), start) < 1000:
    if uart.any():
        buffer += uart.read()
    else:
        time.sleep_ms(20)
    
print(buffer)

This code is designed to demonstrate in simple terms how to interact with this board and write code around it. Starting off, it imports all the required libraries and sets up UART communication at a staggeringly high baud rate of 921,600:

from machine import Pin, UART
import time
import ujson
from time import sleep

# Pico UART0 (GP0 = TX -> Grove RX, GP1 = RX -> Grove TX)
uart = UART(0, baudrate=921600, tx=Pin(0), rx=Pin(1))

Then it sends the "AT+INVOKE" command, the essential line in all of this. This command asks the board to take a photo, analyse it, and send the results back over UART. There are three optional inputs to this line which can be configured through the three numbers after the equals sign:

First number: This is telling the Vision board how many frames it should process. We left it at 1 as we only want to handle 1 frame at a time.
Second Number: This sets whether the board only return a response if the results are differed - only if the last analysis is different to this one. If you detected nothing last frame, and nothing again this frame, it won't send anything back - only when something changes. We leave this at 0 (false), as we want the detection results of every frame.
Third number: This sets whether the board should send the image over UART. If set to 0, it will send a huge string of text over UART which will be the 480x480 pixel image base64 encoded. We leave this as 1 as we don't need that info, but it's a fun thing if you want to play around with it and try and reconstruct the images.

uart.write("AT+INVOKE=1,0,1\r")

The rest of the code deals with receiving the data from the board:

from machine import Pin, UART
import time
import ujson
from time import sleep

# Pico UART0 (GP0 = TX -> Grove RX, GP1 = RX -> Grove TX)
uart = UART(0, baudrate=921600, tx=Pin(0), rx=Pin(1))

# Send one inference
uart.write("AT+INVOKE=1,0,0\r")

# Read for up to 200 ms to catch both operation and event responses
start = time.ticks_ms()
buffer = b""
while time.ticks_diff(time.ticks_ms(), start) < 1000:
    if uart.any():
        buffer += uart.read()
    else:
        time.sleep_ms(20)
    
print(buffer)

This isn't your standard UART reading as the message comes through as two JSON-formatted messages. When you send the invoke command, the board immediately replies with a confirmation message. This message will start with a "Type 0" label and contains information about the model and all the settings it will use. We will largely be ignoring this:

b'\r{"type": 0, "name": "INVOKE", "code": 0, "data": {"model": {"id": 1, "type": 0, "address": 4194304, "size": 0}, "algorithm": {"type": 3, "categroy": 1, "input_from": 1, "config": {"tscore": 39, "tiou": 45}}, "sensor": {"id": 1, "type": 1, "state": 1, "opt_id": 1, "opt_detail": "480x480 \r

Once the board has finished all its processing, it will send the second "Type 1" message containing all the detection data:

{"type": 1, "name": "INVOKE", "code": 0, "data": {"count": 1, "perf": [7, 48, 0], "boxes": [[245, 235, 449, 392, 92, 0]], "resolution": [480, 480]}}\n'

In this demo code, these two messages will come through merged as one message. This last message is the one we are interested in, mainly the "perf" line and the "boxes", which are the same lines we saw being output in the SenseCraft Studio. In reality though, it is just the boxes line we are interested in. The breakdown of the six digits in this line is as follows:

First and Second values: These are the x and y coordinates of the box being drawn. This is the centre point of the detection box in pixels (on a 480x480 pixel image).
Third and Fourth values: These are the width and height of the box being drawn. These are less so important, but could be used to primitively gauge how far away an object is.
Fifth value: This is the confidence score of the detection. The higher the number, the more certain the detection is.
Sixth value: This is the ID of the object being detected in the model. This will likely always be 0 as most models can only detect one type of object. If you find the unicorn example of a model that can detect multiple objects, it will differentiate between these objects through this number. The second type of object it can detect will have an ID of 1.

If multiple objects are detected, the boxes variable will hold multiple entries, like in the following line:

{"type": 1, "name": "INVOKE", "code": 0, "data": {"count": 1, "perf": [7, 48, 0], "boxes": [[57, 209, 115, 156, 43, 0], [62, 235, 94, 230, 77, 0], [141, 245, 83, 230, 87, 0], [214, 256, 62, 172, 81, 0], [266, 256, 62, 172, 81, 0], [319, 256, 62, 172, 83, 0]], "resolution": [480, 480]}}

It is really nice that you can directly control all the frame analysis and timing from the code itself. If you were to place the code in a loop, you would effectively keep calling the invoke command and run the system at its maximum FPS. If you want fewer fps, you would just invoke the system as infrequently as you want.

If you were thinking that this is a smart way to save power, you may be disappointed though, as there is minimal power consumption difference between processing and standing by. The image on the right is an Otiiarc analysis of the Vision board, periodically analysing frames and standing by.

an image of an otiiarc analysis of the grove board showing that the power consumption is minimal between processing and stand-by

That demo code was handy at exploring how this system works, but the time difference between those two messages can make the system a little difficult to develop code for. That's why we have gone ahead and written up a second piece of demo code that more robustly handles the messages and is ready to apply in your projects:

from machine import Pin, UART
import time, ujson

uart = UART(0, baudrate=921600, tx=Pin(0), rx=Pin(1))

def invoke_once(timeout_ms=200):
    # flush leftover data
    while uart.any():
        uart.read()
    # request inference
    uart.write("AT+INVOKE=1,0,1\r")
    # start timing and set buffer
    start = time.ticks_ms()
    buf = b""
    depth = 0
    # read message
    while time.ticks_diff(time.ticks_ms(), start) < timeout_ms:
        if uart.any():
            data = uart.read()
            for ch in data:
                buf += bytes([ch])
                if ch == ord("{"):
                    depth += 1
                elif ch == ord("}"):
                    depth -= 1
                    if depth == 0:
                        raw = buf[buf.find(b"{"):]
                        try:
                            js = ujson.loads(raw)
                            buf = b""  # reset
                            if js.get("type") == 1:
                                # Always return list of boxes (may be empty)
                                return js["data"].get("boxes", [])
                            # if type==0, just ignore and wait
                        except Exception as e:
                            # corrupted? reset buffer
                            buf = b""
        else:
            time.sleep_ms(2)

    return None  # timeout


while True:
    boxes = invoke_once(timeout_ms=200)
    if boxes is None:
        print("Timeout: no result")
    elif len(boxes) == 0:
        print("No objects detected.")
    else:
        number_of_detections = len(boxes)
        print("Detected", number_of_detections, "objects.")
        
        print("x:", boxes[0][0], "y:", boxes[0][1], "conf:", boxes[0][4])


"""
Contained in boxes is all the relevant detection data.
You can access a specific detection like so:
1st detected object boxes[0]
2nd detected object boxes[1]
3rd detected object boxes[2]... and so on.

The detection data contains the x and y coordinates of the centre of the object,
the width and height of the box detected, the confidence score, and the ID of the
detected object (will be 0 nearly always).
Detection information is stored in the boxes like so:
x [0], y [1], w [2], h [3], conf [4], ID [5]

If you wished to get the confidence of the 1st object detected, you would use:
boxes[0][4]

To get the x , y location of the 3rd detected object, we would use:
boxes[2][0] , boxes[2][1]

If you try to get the x,y location of boxes[2] and it doesnt exist (not detected),
you will get an error. The following code will only check it, if it exists.

if len(boxes) > 2:
    print(boxes[2][0] , boxes[2][1])

"""

This code is pretty similar to the previous one, but it packs it all inside a function and handles the return message better. Most of the function is just juggling the JSON message and looking for the right parts of it to know where a message ends and starts. It also parses it and strips out everything uneeded until it just has the boxes data left. Additionally, it has a built-in time-out function - if a message is incorrectly sent or errors, you don't want the function to wait for the rest of it.

With everything handled in the function, you are left to simply call it like so:

while True:
    boxes = invoke_once(timeout_ms=200)

With this, all of the boxes data will be stored in the variable "boxes". Before you try and use this information, it is worth checking to see if it exists first. If there is an issue reading the message, or there was a fault outputting the data, boxes will store "None":

    if boxes is None:
        print("Timeout: no result")

If there is nothing detected, then boxes won't contain any entries, and the length of it will be 0:

    elif len(boxes) == 0:
        print("No objects detected.")

If we know that these are not the cases, we can start using the detection results! We can count the number of detections in the frame by checking the length of boxes, which will count how many entries it contains:


    else:
        number_of_detections = len(boxes)
        print("Detected", number_of_detections, "objects.")

At the bottom of this code is some text demonstrating how to extra coordinate information or confidence value from a detection:

"""
Contained in boxes is all the relevant detection data.
You can access a specific detection like so:
1st detected object boxes[0]
2nd detected object boxes[1]
3rd detected object boxes[2]... and so on.

The detection data contains the x and y coordinates of the centre of the object,
the width and height of the box detected, the confidence score, and the ID of the
detected object (will be 0 nearly always).
Detection information is stored in the boxes like so:
x [0], y [1], w [2], h [3], conf [4], ID [5]

If you wished to get the confidence of the 1st object detected, you would use:
boxes[0][4]

To get the x , y location of the 3rd detected object, we would use:
boxes[2][0] , boxes[2][1]

If you try to get the x,y location of boxes[2] and it doesnt exist (not detected),
you will get an error. The following code will only check it, if it exists.

if len(boxes) > 2:
    print(boxes[2][0] , boxes[2][1])

"""

And with that, you now have a diving off point to apply the Grove Vision AI v2 in your own project!

Where to From Here?

We now have a device that can load and run an object detection model, and we can get this data output over UART. With a power draw of only 0.35 Watts, the most obvious use case is for microcontrollers, but as its UART, you could realistically whack this on anything, even a fully fledged computer.

With a power draw that low, another obvious use case is in battery-powered projects. However, with there not being much of a difference in power draw between idle and processing, you might wanna check out our guide on how to use a power timer in your project. This will allow your microcontroller to power the Vision board and itself off.

There is also a wealth of other AT commands to send to the board that might pique your interest. You can check them out on their GitHub. And as always, if you are looking to upgrade your microcontroller skills, we have an entire course for the Pico.

If you make anything cool with this and want to share it, or you just need a hand with anything we covered in this guide, feel free to post about it in the forum below. Until next time though, happy making!

Have a question? Ask the Author of this guide today!

Tags:

Getting Started with the Grove Vision AI V2 | Power Efficient Object Detection

What is the Grove Vision AI and how Does it Work?

What You Will Need

Uploading Models to the Board

Reading Detection Data via UART

Where to From Here?

Comments

Follow us on instagram

About Us

Resources