“How accurate is it?” is a pertinent question for new AI applications — but one that can be deceptively hard to answer.
Why? Because the gold standard for accuracy is often human judgements — but those can be messy and even contradictory. Take the use case of detecting whales from aerial images, for instance. The goal is simple enough: count all the whales there are at a given time, and record their locations. But whales only spend a portion of their time at the surface, where humans (or AI) can see them: so what about the whales underwater? One useful way around this “availability bias” is to find out what proportion of the time our species spends at the surface, then scale our responses accordingly. For example, if our species spends 40 percent of its time at the surface, and we see 400 whales in a survey, we can assume there are another 600 we’re not seeing, for a total of 1000. So if we can train an algorithm that recognizes all the whales at the surface, we’re golden. Simple, right…
But what about the whales in between? Like these ones:
Photo Credit: Department of Fisheries and Oceans Canada.
We counted seven whales here. How many do you see? Should a faint silhouette beneath the waves count the same as a whale with its whole butt up in the air? How do we decide how deep is too deep to count as “at the surface?” In a word, subjectively. And this is just the beginning — as in any area of life, humans find all kinds of ways to disagree on detecting animals from aerial imagery. The stats are murky for whales, but the trend for two experts looking at the same imagery to disagree has been documented numerous times for a number of animals. Biologists counting birds, elephant herds, sea turtle tracks and grey whales have all come back with discrepancies in the range of 10-20%. This is where the notion of absolute truth starts to fray: if humans are the gold standard against which AI is measured, don’t we need them to at least agree with each other?
Further down the rabbit hole, it’s worth pointing out that humans observers aren’t even completely consistent with themselves. One aerial ecological survey found that even when the same observer analyzed the same images two times, the two counts came out on average 6% different.
There are ways of getting at the question of absolute truth, and how well humans stack up against it. One study simulated the conditions of an aerial survey using a scaled physical model of a colony of wading birds. Because the model was man-made, the true number of birds could be easily ascertained, to examine the accuracy of observers’ counts. Even when experienced observers were allowed to comb over high-resolution images of the model, their estimates fell on average 20% above or below the real number (usually below — it’s easier to miss birds than to make them up). But in the wild, our best guess is to compare different observers’ counts with each other — so the “gold standard” is to only make the mistakes any human would make.
Maybe we need to re-frame our question: instead of asking how close AI is to “the truth”, we should aim to build AI that disagrees with a human observer no more than humans disagree with each other — or themselves, for that matter. Reframing the accuracy debate helps us focus on where AI can really deliver: its speed. If AI can reach human levels of accuracy, in a fraction of the time, this means more data can be analyzed more quickly, and in a field where data collection is outpacing analysis, this is a big deal. It means better decisions for whales and humans.
Does this mean AI can’t help us make gains in consistency? Not exactly. For one thing, while machine learning algorithms may perform better on some datasets than others, their performance on a given dataset won’t change over time — unlike humans. In addition, by only calling on human experts to help the algorithm out on edge-cases (the toughest calls), we humans can focus on developing standardized thresholds for detection — while outsourcing the bulk of the “easy” judgements to the AI. Whale Seeker’s new Human-in-the-loop tool Mobius integrates human and machine judgements, allowing for huge gains in efficiency without sacrificing reliability.