If you’re on this blog, you will have no doubt learned the importance of remote sensing methods like aerial surveying for monitoring whale populations. By capturing thousands of detailed images that cover the whales’ habitat, we can learn about their numbers, their proximity to land and gain insights into their population’s health. This is vitally important information for many folks in industries like travel and tourism, port services, fisheries and shipping.
Given the immense volume of imagery involved, analysing data from these agencies and companies to derive the relevant information on whale presence is a time-consuming and therefore expensive effort. Part of this involves hand-selecting sections of the images that are land, ice, cloud coverage and glare so that we can give an accurate estimate for the area where we could have found whales -- key to derive statistics such as density per area. Highlighting a bunch of regions or objects in an image like chunks of ice or sections of rocky terrain may not sound particularly difficult for a human being, but multiply that by thousands of images over 5000 x 7000 pixels wide and it becomes a major time sink. Not to mention factors like reader fatigue and experience, which can compromise the quality and repeatability of the results. If these estimates are imprecise, the people relying on this kind of data to make their decisions are likely to be misinformed and as such, potentially make disastrous choices, both for their businesses and the whales.
Photo credit: Fisheries and Oceans Canada
Given that the differences between features of interest are not too difficult for humans to distinguish but contain enough nuances that traditional computer vision algorithms such as threshold based clustering will fail, this problem is a prime candidate for deep learning. When it comes to segmenting (aka highlighting) regions of interest in images, especially given the existence of labels from previously annotated datasets, the U-Net architecture is ubiquitous.
The U-Net was developed in the context of medical imaging interpretation. Surprisingly, it went unnoticed by the computer vision community for a couple of years until it started winning several important data science competitions and its potential and novelty became obvious (to this date, the original article has nearly 25000 citations). Across many medical fields, there are endless use cases, some examples of which include segmenting organs for volume calculations or for determining surgical boundaries, segmenting nodules to perform similar data derivations, segmenting lesions between scans to determine the rate of progression.
Additionally, traditional computer vision methods would often fail for certain tasks due to the heterogeneity and variability of the input data and complexity of features in the image that constitute particular regions of interest. Therefore, in the push to automate these processes and bring the value of deep learning to clinicians, the U-Net was born.
The U-Net gets its name because some outputs of the starting convolutional blocks are also fed into later deconvolutional blocks making it look like a U shape in architecture depictions.
Source: Dept of Computer Science, University of Freiberg
U-Nets are trained on images using binary masks of the regions of interest as learning targets. What this means is that, for each pixel in the original image, instead of the RGB value for that pixel, the mask will contain the class label for that pixel, e.g. 0 for no lung (outside regions of interest), 1 for lung (inside regions of interest). Therefore, U-Nets are trained to spit out digitized, simplified versions of input images where every pixel is labelled based on what class the model thinks it belongs to, e.g. land, ice, water etc. What also makes it extremely useful for many real world problems is that the U-Net can be trained to produce segmentations for multi-class problems. In other words, where a doctor has annotated the lungs, a nodule, some lung opacity, the heart etc. across a dataset, the U-Net model can learn to segment all of these simultaneously.
That is also why it is such a powerful tool in the effort to reduce the burden on aerial survey reporters by automating some of their manual work. This type of model can be trained on one task, such as land detection and segmentation but then also, very easily extended to predict other useful regions with the addition of new labelled data. This is the approach used at Whale Seeker. Since we possess a large set of annotations lovingly made by our expert biologists, we can easily convert them to machine learning friendly formats and then train a U-Net to segment important regions within aerial images. Initially, this experiment was performed on land detection, but given the flexibility of the model, it is also a simple process to extend this for other features in the landscape.
This result highlights the potential of the U-Net, despite its humble beginnings in medical AI, to tackle a broad range of applications within computer vision. Although the cross-over between domains such as medical imaging interpretation and whale conservation may seem far-fetched, with the multidisciplinary team at Whale Seeker, these connections are clear.