Recently, we were approached about monitoring seals from aerial images using machine learning. As the name Whale Seeker suggests, our current expertise lies with whales, not seals. However, the premise is similar: training a model to detect relatively small targets in large marine aerial images. We already have the pipelines built, all we needed was a dataset to get started building a preliminary model.
Luckily for us, NOAA has been following an AI-centric mission recently, and have started making a lot of data publicly available, so that the AI community can pick it up and experiment with it. Among these datasets is the fantastic NOAA Arctic Seals 2019 dataset (more information here) containing 40,000 images with over 14000 annotated seals. The automated annotation process has been done in collaboration with the Microsoft AI for Earth program and the University of Washington. While 14000 is not huge by machine learning standards, it is by wildlife biology standards. This was more than enough to at least start experimenting.
Here you can see why we would want an AI model to process this instead of a human! The seal is the tiny little brown-ish blob on the right, mid-height. Full resolution image here. Still don’t see the seal? Hover over the image above for a hint!
We applied our model training pipeline to this dataset, holding out a test set to validate our results, and measured the model performance. However, something was off - while our model was finding almost every seal in the test set (high recall), it made many false positive predictions (low precision). It’s not unheard of for models to be trigger-happy, but we investigated further, to understand what kind of errors it was making.
It turned out lots of false positives were actually real detections! Our model was actually getting good results, but scored low because some of the original annotations were incorrect. So we re-ran the model on the whole dataset, and asked our biologist to review all “mistakes”. Through the re-annotation process, over 1500 additional seals were found. While it was surprising to find this many, we were not expecting the annotations to be perfect. We ourselves have a lot of experience annotating large datasets and know how hard of a process it is (just look back at the sample image above!). And it’s not just for complex datasets: Nortcutt et al. (2021) show that all public datasets have at least 3.3% errors and upwards. There’s actually a whole website dedicated to displaying errors in large public datasets.
“Our model was actually getting good results, but scored low because some of the original annotations were incorrect.”
So, with the corrected annotations in hand, we re-evaluated our model’s performance and, as expected, measured a much higher precision. Many of our initial “mistakes” were now points instead of penalties. In order to further increase performance, we retrained our model on the corrected dataset, but we didn’t see a significant improvement. What gives, then? If a model trained on noisy data is as good as one trained on a clean dataset, could we just cut corners and annotate datasets quickly, without even trying to get perfect annotations?
As with everything else in AI, it turns out that is not an easy question with a clear-cut answer. Obviously, if we COULD get perfect annotations on a dataset, we’d take them, but if the effort needed to annotate a dataset with 99% accuracy is 10x that of annotating it to 90%, it does become useful to raise the question of what’s “good enough”.
To make matters worse, not all noises in labels are created equal. Studies which synthetically added different noise types on a classification task to monitor their effects on learning. The effect is clear - different noises added during training lead to wildly different results on test accuracy.
Different noise types during training have different impacts on test performance.
Another conclusion they come to is that, as expected, the more noise you add, the worse the test performance will be. This aligns with Xu et al. 2019 which draws a similar conclusion (and in the case of the latter, in the context of object detection, which is closer to our application).
As the data quality becomes poorer, the expected performance of the model drops However, looking at these results, they seem to point to something interesting - While not ideal, anything below 15% noise/missing rate will actually not affect the model’s performance significantly! This is exactly what we observed in the first part of this story! The models seem to learn relatively well, seemingly drowning out the errors in a sea of correct labels. As expected of AI, it always has the solution to every problem! Even with approximate annotations, we can still get good scores on the test set.
That may be — but this is where reminding yourself WHY you’re building a model is important: the REAL goal is not to score high on the test set, it’s to build something that has real-world use. That first paper mentioned actually has a great take on why this distinction is important: When training a model to solve a problem, researchers will usually train many models, switching up the training parameters (this is called hyperparameter tuning) or the datasets splits (this is cross-validation). They will then, out of the dozens of models trained this way, select the best one. However, if your test set has annotation errors, you are actually not selecting the best model! That 15% error rate becomes critical in the real world, because it will make the difference between selecting a model that really understood the data and another model which just learned to make the same mistakes as your annotators.
“That 15% error rate becomes critical in the real world, because it will make the difference between selecting a model that really understood the data and another model which just learned to make the same mistakes as your annotators.”
Not only that, but the consensus around noisy annotation seems to be that without the use of specific techniques to mitigate noise in the labels, model generalization usually suffers. The model’s internal representation of what this or that object looks like is corrupted by the errors in the data. The model can get good results on the test set, which is very similar to the training data (so has the same kind of errors), but might fall apart in the context of new data.
Obviously, this is not an exhaustive literature review on the question, but I hope this mental expedition was interesting food for thought! In the case of our arctic seals model, we really hope to build a tool that helps wildlife managers assess populations, so we really wanted to have the right annotations for training our model. Not only that, but this was an opportunity to give back to the wildlife AI community and help improve available open-source datasets - once we’ve finished cleaning up the data, we’ll be sending it to the dataset owners for them to update the download links.