The importance of data (and lots of it) for AI is old news by now. What’s less well known, but becoming increasingly apparent, is that data availability (not algorithm design) is the biggest obstacle for most AI applications.
The belief that sheer quantity, rather than quality of data, makes or breaks AI projects, has led to a “race to the bottom,” where labelling work is crowdsourced, or outsourced to dedicated labelling “factories”. In either case, these jobs are often menial, and poorly paid.
Just as bricks-and-mortar companies are held accountable for the ethics of their whole supply chain, there can be no ethical AI without ethical labelling practices. The Montreal Declaration distills a few key principles that must be adhered to if AI is to be developed responsibly. While many of these principles apply most readily to finished AI applications, the declaration highlights the importance of the Equity Principle throughout the lifecycle of AI:
“Industrial [AI system] development must be compatible with acceptable working conditions at every step of their life cycle […] including data processing.”
As a signatory of the Montreal Declaration, Whale Seeker is committed to upholding this principle. This means everyone who labels data for us is paid at least a living wage for Montreal. This is also one of the ways we’re bringing B Corp values to AI.
There are pragmatic reasons for caring where your labels come from, as well as ethical ones. While it’s easy to get caught up in the race to the bottom, the reality is that data labelling is often far from a trivial part of the machine learning equation, in many cases requiring a good deal of subject-specific expertise to produce useful results. Incomplete data can lead to inaccurate or biased performance when the algorithm is applied in the wild, with potentially serious repercussions, depending on the application.
From a purely results-driven point of view, ignoring the importance of data quality places a ceiling on the performance of the resulting AI algorithm, regardless of the brilliance of the engineers training it. Put simply: crap in, crap out.
For Whale Seeker, the verdict on data labelling is clear: there are no shortcuts without sacrificing in quality. That’s why our manually detected data is always labelled by an expert with a Master’s degree or above in biology. Our labellers are also paid for their time, not by the task — a practice which has been shown to improve label accuracy.
We often get questions about outsourcing data labelling, and hope this explains why we put so much emphasis on in-house expertise: both for technical performance and ethical standards. In our next blog we’ll be discussing the technical setup that allows us to make the most of that expertise.