MIT, NYU Pull Tiny Images AI Training Dataset Due to Racist, Sexist Labels
- By John K. Waters
Artificial intelligence (AI) researchers at the Massachusetts Institute of Technology (MIT) and New York University (NYU) this week took down the widely used "80 Million Tiny Images" dataset created to train machine learning systems because it was found to contain a range of racist, sexist and otherwise offensive image labels, including 2,000 uses of the N-word, as well as terms such as "bitch" and "whore," and non-consensual photos taken up women's skirts.
In a letter on the CSAIL mailing list, MIT professors Bill Freeman and Antonio Torralba and NYU professor Rob Fergus apologized for the labels and asked others to refrain from using the dataset and delete any existing copies.
"[B]iases, offensive and prejudicial images, and derogatory terminology alienates an important part of our community," the letter reads, "precisely those that we are making efforts to include. It also contributes to harmful biases in AI systems trained on such data. Additionally, the presence of such prejudicial images hurts efforts to foster a culture of inclusivity in the computer vision community. This is extremely unfortunate and runs counter to the values that we strive to uphold."
The dataset will be taken offline permanently, they wrote, because the images are too small for manual inspection and filtering by hand.
Created in 2006, the dataset actually contains 79,302,017 color images, 32×32 pixel-sized, scaled down from images extracted from the Web using automated search queries on a set of 75,062 non-abstract nouns derived from Princeton University's WordNet lexical database, the professors explained in a white paper. The words in the search terms were used as labels for the images. The researchers used seven Web search resources for this purpose: Altavista, Ask.com, Flickr, Cydral, Google, Picsearch and Webshots.
The researchers claimed they were unaware of the offensive labels, and that they were "a consequence of the automated data collection procedure that relied on nouns from WordNet."
The revelations about the MIT/NYU dataset underscores a problem with large-scale vision datasets (LSVDs), argued UnifyID AI Labs researcher Vinay Uday Prabhu and University College Dublin researcher Abeba Birhane in a paper. The paper identified "verifiably pornographic" associations and "ethical transgressions" contained within image-recognition datasets.
"LSVDs in general constitute a Pyrrhic win for computer vision," they wrote. "We argue, this win has come at the expense of harm to minoritized groups and further aided the gradual erosion of privacy, consent, and agency of both the individual and the collective."
The paper's authors cited several examples of problematic LSVD image labeling, including the systematic under-representation of women in search results for occupations, object detection that detects pedestrians with higher error rates for recognition of demographic groups with dark skin tones, and lighter-skinned males classified with the highest accuracy. (Darker-skinned females suffer the most misclassification, they wrote.)
According to Prabhu and Birhane, LSVDs like 80 Million Tiny Images and the better-known ImageNet suffer from the "inherited the shortcomings of the WordNet taxonomy...such as the single label per-image procedure when the real-world images usually contain multiple objects." Unlike ImageNet, the Tiny Images dataset had not been audited or scrutinized until now. The dataset contains 53,464 different nouns directly copied from WordNet.
"We posit that the deeper problems are rooted in the wider structural traditions, incentives, and discourse of a field that treats ethical issues as an afterthought," they wrote. "A field where in the wild is often a euphemism for without consent. We are up against a system that has veritably mastered ethics shopping, ethics bluewashing, ethics lobbying, ethics dumping, and ethics shirking."
About the Author
John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at email@example.com.