Computer vision has captured the interest of entrepreneurs all over the world, and for good reason: the capabilities of modern AI tech makes previously impossible ideas into real products.
Detecting and classifying objects in photos and videos has found applications in many different areas and hundreds of systems — from security cameras with built-in facial recognition and detection of diseases based on x-ray scans to simple mobile apps.
But enough with the intro, in this article I wanted to share the ‘behind the scenes’ of how a computer vision product is developed, especially a complex one.
I have had an opportunity to work on Birdsy — a bird recognition app for people who want to see who visits their backyards while they are not looking.
Birdsy is a complex, AI-powered app with real-time object detection and classification that needs to be able to run on weak hardware, detect bird types and sex with high accuracy.
Given all that, the road from the initial idea to publishing the app in app stores was complicated — and captivating at the same time.
We have faced many hurdles — both from the business and development side of view — that I’ve decided to share in one place to hopefully help entrepreneurs and AI developers who might be facing the same project.
Birds have had millions of years of evolution to blend in perfectly with their environment to avoid predators and, in this case, bird watchers, making it harder for them to admire wildlife.
While observing certain bird species face to… beak can be problematic, monitoring them through a video camera from the comfort of your home is a lovely way to enjoy our winged friends, especially if AI removes the need for looking through hours of video footage and sends you alerts when a bird has entered the camera’s view, automatically detecting what bird species it is.
There are two parts to Birdsy:
To make the service more approachable and easy to use, any camera can be used for bird watching. This is where we ran into the first problem: low-quality cameras as they are the most affordable and the most widespread.
While the ‘no camera limit’ is great for users, it presented a challenge for us, as the object detection model runs using the camera chipset.
Where someone gets a good deal, others get the short end of the stick, in this case ‘others’ being our CV developers. Working with a cheap camera means working with a cheap chipset which makes it impossible to use the default neural network architecture.
Compared to a top-of-the-line, gold standard of computer vision video cameras (NVIDIA Jetson Nano) which allows using around 120 layers of default YOLO v4, the cameras we had to work with allowed only for 22 layers.
Where a full YOLO v4 neural network provides great recognition results, a stripped-down version performs poorly. We have tested both and were unpleasantly surprised with how low the model depth was when running it using a cheap chipset.
We started with training the default YOLO v4 model and testing it on the customer’s dataset The results we achieved were satisfactory - 95% mAp, which in the world of computer vision is more than enough to launch a model into production.
After retraining the model to fit the camera’s parameters, the detection quality dropped significantly. But where machines fail humans advance.
We have tested the neural network on test data and visually evaluated false positives and false negatives. This highlighted where the network lacked knowledge and where it made the most mistakes.
The network was eager to detect people, especially people’s hands, as animals (and we don’t blame it, humans ARE animals after all). While, from a biological point of view, this is correct, the end-user is more interested in looking at birds than their neighbors, so we had to work on teaching the network to ignore people and focus on birds and mammals instead.
To do this, we have added negative examples, including pictures of people at various angles as well as human hands.
The cameras have two modes: regular daytime mode which produces full-color images, and nighttime infrared mode, which produces black and white images. When the camera switched to infrared, the model produced a lot of false positives:
Users would be less than happy to be woken up by a notification, get excited to look at an owl or a fox, but end up looking at a recording of a moth banging its body against a camera lens.
In order to reduce sleep interruptions only to a minimum, we collected instances of false positives in nighttime settings and marked them by hand.
Ever heard of social media being called a ‘highlight reel’ where people present the best version of themselves? Who knew the same could be true for wild animals.
Photos of birds that can be obtained from open sources, like Google Images and Youtube videos, are usually high quality, very sharp, and depict specimens at their best, looking at the camera or at least positioned front on, with nothing in between the bird and the camera obstructing the view.
Reality is not always as pretty. Cameras produce low-quality images that can make it hard to understand what’s going on even for a human eye, bad weather conditions like rain, snow, or dust can obstruct the view, and we are sure birds sense when someone wants to capture them and position themselves in the most ridiculous way possible.
The dataset that the client provided (consisting of shart images found on the Internet) was not of very much use for this project.
We needed to collect a set of images of birds in real conditions using the client’s cameras to show the model what birds really look like, and not how they are presented on social media.
So, after doing all of the above:
We have managed to achieve a 97,5% mAP for object detection. This is a very high result for a computer vision model as the unwritten rule for any CV model going into production is to have over 94% mAP.
While the results we have achieved now are more than enough to be used in the final product, there’s still room for improvement.
After enough images for each group are collected, we expect the mAP to increase and reach 98,5%.
The next step in getting to know your backyard visitors is to pass the image with a bird to an object classification model. Its goal is to recognize the bird species and its sex.
As some bird species live exclusively on certain continents, we have decided to create two models - one for those who live in North America, and one for those in Europe.
Initially, the problem of object classification was solved using a ‘head-on’ approach: the network was shown photos of all the different species, both males and females, from which it tried to learn what they look like and how they differ from each other.
This resulted in very poor accuracy scores, in other words, the network made a whole lot of mistakes when identifying bird and mammal species.
The network was trying to learn too many aspects at the same time. Many bird species look very similar to each other and differ from one another by a single patch of differently colored feathers or a differently shaped beak.
Retaining all of this information, along with what different sexes of the same species look like, is too difficult under given circumstances. The network would often mix up bird species while determining the more broad bird type correctly.
For example, the difference between a hooded warbler and a kentucky warbler in a patch of black feathers:
The network would label a hooded warbler as a kentucky warbler, producing a wrong result, but being generally correct: both ARE warblers. For the sake of time, the client decided it was more important to detect the overall bird type rather than its particular species, so that’s where we started.
After model evaluation, we have decided to implement a multistage approach:
By grouping the bird species, we have managed to decrease the number of classes from 98 to 49, which greatly improved the accuracy score as the network simply didn’t have as many classes to choose from.
When you come across a new concept, you read books or watch educational videos to grasp it. If you fail, you ask your friend to explain it to you or you visit a seminar on the topic. In other words, you try to accumulate more information about it to understand it better.
The same goes for neural networks. The better you need one to understand what a warbler looks like, the more images of warblers you need to show it. The more data it has looked at, the better accuracy scores will be.
The multistage approach we’ve chosen not only improved the accuracy of the object classification model but also made it possible to analyze the dataset and determine where the network lacked learning data.
After the object classification model was launched, we were surprised to receive results that were much worse than what the tests showed us: the model could not determine the bird species or type correctly.
The problem ran deeper than this: our computer vision developer in charge of the whole project, who has learned all of the bird species himself while working on it, also failed to determine what the birds were when he received the images incorrectly labeled by the network.
It turned out that July is not the best time to launch a bird classification model as it is the time where teenage birds learn to fly and leave their nests.
Remember the ugly duckling story? Well, it’s true for most birds, fledglings look nothing like adult birds, and it is hard to know what bird you are looking at if it’s still young.
We have collected images of young birds over the summer and plan on training the classification network to determine the different bird species at different ages.
Birdwatchers are a passionate bunch, they know how to identify a bird by the shape of its single feather. They possess knowledge that our classification network dreams of having, so why not bring the two together and form a bird-loving alliance the world has never seen before?
Currently, the classification network doesn’t just tell the user the bird species, it shows the degree of confidence along with other guesses.
The user can confirm the network’s guess or correct it, thus helping us train it - one bird at a time. After running the user feedback system for 3 months, we have collected over 20 thousand images. This data is invaluable to us since the photos are made in real-life conditions (bad weather conditions, at night, etc.) and are marked up by experts.
It is worth noting that during this project, we have become the bird experts ourselves. Looking at birds all day long, while essentially educating a virtual child on all the little differences between different types of sparrows, makes one an instant platinum member of the birdwatching community.
If all else fails, our CV team members can easily find themselves in ornithology.
On a serious note, looking through thousands of images of birds, be it for dataset markup or analyzing where the network makes the most mistakes, we have delved deep into this project and came out the other end not only with a bunch of bird knowledge but a better understanding of how complex image recognition and classification systems work, how to best implement them, how to analyze a large dataset and find its weak points.
This project was invaluable for us as an opportunity to research and work with the latest computer vision technologies, work with real-time customer feedback, and polish our problem-solving skills when working with outdated code.