In 2012, artificial intelligence researchers engineered a big leap in computer vision thanks, in part, to an unusually large set of images—thousands of everyday objects, people, and scenes in photos that were scraped from the web and labeled by hand. That data set, known as ImageNet, is still used in thousands of AI research projects and experiments today.
But last week every human face included in ImageNet suddenly disappeared—after the researchers who manage the data set decided to blur them.
Just as ImageNet helped usher in a new age of AI, efforts to fix it reflect challenges that affect countless AI programs, data sets, and products.
“We were concerned about the issue of privacy,” says Olga Russakovsky, an assistant professor at Princeton University and one of those responsible for managing ImageNet.
ImageNet was created as part of a challenge that invited computer scientists to develop algorithms capable of identifying objects in images. In 2012, this was a very difficult task. Then a technique called deep learning, which involves “teaching” a neural network by feeding it labeled examples, proved more adept at the task than previous approaches.
Since then, deep learning has driven a renaissance in AI that also exposed the field’s shortcomings. For instance, facial recognition has proven a particularly popular and lucrative use of deep learning, but it’s also controversial. A number of US cities have banned government use of the technology over concerns about invading citizens’ privacy or bias, because the programs are less accurate on nonwhite faces.
Today ImageNet contains 1.5 million images with around 1,000 labels. It is largely used to gauge the performance of machine learning algorithms, or to train algorithms that perform specialized computer vision tasks. Blurring the faces affected 243,198 of the images.
Russakovsky says the ImageNet team wanted to determine if it would be possible to blur faces in the data set without changing how well it recognizes objects. “People were incidental in the data since they appeared in the web photos depicting these objects,” she says. In other words, in an image that shows a beer bottle, even if the face of the person drinking it is a pink smudge, the bottle itself remains intact.
In a research paper, posted along with the update to ImageNet, the team behind the data set explains that it blurred the faces using Amazon’s AI service Rekognition; then, they paid Mechanical Turk workers to confirm selections and adjust them.
Blurring the faces did not affect the performance of several object-recognition algorithms trained on ImageNet, the researchers say. They also show that other algorithms built with those object-recognition algorithms are similarly unaffected. “We hope this proof-of-concept paves the way for more privacy-aware visual data collection practices in the field,” Russakovsky says.
It isn’t the first effort to adjust the famous library of images. In December 2019, the ImageNet team deleted biased and derogatory terms introduced by human labelers after a project called Excavating AI drew attention to the issue.
In July 2020 Vinay Prabhu, a machine learning scientist at UnifyID and Abeba Birhane, a PhD candidate at University College Dublin in Ireland, published research showing they could identify individuals, including computer science researchers, in the data set. They also found pornographic images included in it.
Prabhu says blurring faces is good but is disappointed that the ImageNet team did not acknowledge the work that he and Birhane did. Russakovsky says a citation will appear in an updated version of the paper.