YOLO (You only look once) is a “state-of-the-art, real-time object detection system.”1 In other words: a machine learning algorithm that recognizes stuff in pictures. YOLO frames what it “sees” in confetti-colored rectangles, each neatly labelled with a single category. In digital hues that contrast with the everyday, they highlight answers but don't give explanations. To the uninitiated, it seems almost magical, or at least akin with some sort of intelligence.
Computers can't really see, of course. They recognize pixel formations statistically similar to the data they have previously learned from. Consider this image: a red blob in the center, surrounded by colder hues; a smaller, green-ish oval in the upper third? Apple, probably. Each prediction by a computer vision algorithm is an encounter with an image collection — a corpus of annotated training data showing what's what in the real world. This encounter is generally invisible, the data seen as a given, the incredible amount of labour that went into its creation unacknowledged.
Declassifier, the first of three pieces in this series, processes pictures using the YOLO object detection algorithm. Instead of showing the program's prediction, the viewer sees the picture superimposed with images from COCO, the training dataset from which the algorithm learned in the first place.
Declassifier exposes the myth of magically intelligent machines, highlighting that prediction is based on others’ past experience; that it takes a thousand dog photos of to recognize another one. The piece visualizes which data exactly conditioned a certain prediction and encourages chance encounters with the biases and glitches present in the dataset. Thus, ultimately, it helps intuit how machines see.
COCO is an acronym for common objects in context. Originally published by researchers at Microsoft, the dataset is maintained by, and attributed to a consortium of 13 contributors (all but one are male) at major American universities and tech companies. COCO consists of 328,000 annotated images containing “91 object types that would be easily recognizable by a 4 year old.” To make sure these types would “form a representative set of all categories” — all presumably understood in the sense of all there is — the authors started by editing an existing list of most frequently used words to come up with an initial list of classes. Then they voted — among themselves — on the best categories, and finally consulted several actual children in ages from four to eight.2 What a responsibility for a four-year-old! Soccer ball didn’t make the cut, baseball bat did. What is common, of course, depends on who is looking.
The next step was to get pictures. The authors collected the 328,000 images in COCO from the photography platform Flickr, presumably without the knowledge or consent from the amateur photographers who had uploaded photos to the platform. Collected, of course, means scraped: downloading algorithmically at the speed of multiple images per second, the heist was over in less than a day.
For COCO, the authors wanted, say, pictures of cats on couches, but not of cats posing in front of a white background for a feline studio shoot. To avoid images of single objects in isolation, the authors used combinations of object categories as search terms. This is how common objects are put in context: not primarily by any natural context in which they might appear, but explicitly by their appearance together with other COCO classes. It explains why the objects in COCO frequently seem neither common, nor in context.
Considering that Flickr already had ten billion pictures in their database by 2015 and more than a million are uploaded every day, it is reasonable to assume that there would be a photo of almost anything. Of course there are pictures online for the search term cat+sink (COCO contains 198 images showing both categories) and it isn’t at all inconceivable that cats could be in sinks, yet I would argue that they generally have no business in there. Similarly, one finds umbrella+toilet rarely together, or one in the other, unless the encounter is amplified, thus naturalized, by search query.3
None of these instances is particularly concerning in itself. What I am suggesting is that, as a result, COCO represents more than images of objects. It captures a logic of how things should be connected: In this world, umbrellas should be in toilets, and cats in sinks. Innocent, sure. The looming question is whether data or the artificially intelligent system that learn from them, in their far-reaching entrenchment within people’s lives, are ultimately able to instill their logic in the minds of people subjected to their algorithmic world views.4
Once the images were collected, crowdworkers on Amazon's Mechanical Turk platform spent a combined 70,000 hours on cleaning, segmenting, and annotating the dataset with 2.5 million instances of objects — a task that would have a taken a single person working 40-hour weeks, without any time off, over 33 years. This workforce remains anonymous and we can only honor their collective labor, as artist Sebastian Schmieg does with segmentation.network. But for the photographers — whose contributions to artificial intelligence remain equally unacknowledged — we can do better!
According to COCO's metadata, many photographers released their photos under a non-commercial license or require attribution, neither of which is respected in the use of the dataset. Although the type of license for each photo is included, COCO does not include picture credits — perhaps to protect their privacy, perhaps because asking for consent seemed unfeasible, or to avoid alerting them of this mishap. Fortunately, it is possible to reverse-engineer the Flickr username from the included photo URL, as well as to retrieve title and description of any picture that is still public on the site. Reenacting the original heist, I have crawled Flickr once again and created a new dataset called coCOCO, restoring context for Common Objects in Context.5
Based on this new dataset the second piece in this series, Humans of AI: Contributors, awards 34,248 certificates of appreciation to all known contributors. This includes several public figures, none of whom was available for comment.
The third piece in this series is a Slideshow of 123,288 COCO images — a seemingly endless stream of the mundane mixed with the absurd, never intended for a human audience. Presented at the pace of a family picture presentation on Christmas eve, the piece restores previously discarded human-ness: these are the humans of AI. Slideshow attributes authors as requested, states image titles, and links to the Flickr photo page where available.
In their sum, the images in COCO are mere one-dimensional representations. Individually, with their context restored, they become alive, ripe with stories one can reconstruct (and reimagine) from image titles, descriptions, and Flickr pages.
Image #261563, for example, is officially captioned in COCO as “two dogs are playing with a frisbee in tall grass.” Fair enough. The original title, by Flickr user Rebecca Siegel, is “Stella, frisbee, Gryfe.” The dogs, unsurprisingly, have names. They have a family. They are real. The picture was taken in September 2008, when Gryfe was a puppy. Siegel has uploaded 139 more photos to the dog’s dedicated photo album since then. Scrolling through the pictures, I see the animal growing older, often in the loving company of a child who gradually becomes a teenager, then a young adult. In the final picture, taken in October 2018, Gryfe looks back from afar at the camera held close to the forest ground. The title is “Last walk.”
Computer vision, or artificial intelligence in general, certainly isn’t magic. Neither is it merely data. Each prediction by a computer vision algorithm is an encounter with the work of collective memory. Lives and loved ones, travels and homes, become fossils in datasets and AI models relying on vast amounts of data to learn to excavate value from information. Seen another way, their memories — even if neglected and forgotten individually — live on in computer models looking at the world alongside with us. Every rectangle with the label “dog” contains a little Gryfe.
This projected originated in a New School class, Data, Archives, Infrastructure, taught by the brilliant, and exceptionally dedicated to her students, Prof. Shannon Mattern. The 2019 update was generously supported by NYU ITP and ml5.js. I am grateful for this invaluable support.
2 Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." European conference on computer vision. Springer, Cham, 2014.
3 It is worth noting that not every query that appears odd to me is simply normal for someone else. COCO contains images of living spaces, for example, that are computer renderings, not photographs. Some images in the dog category are company logos rather than pictures of an actual animal. As researchers are increasingly using synthetic data to train machine learning models, absurd logic might only be increasing.
4 I have adopted this thinking in reference to another incident. Regarding non-sensical children’s content on YouTube that is generated automatically not following a narrative arc, but popular keyword combinations, the writer Geoff Manaugh suggested that the absurd logic of the content’s creation might change the audience cognitively: http://www.bldgblog.com/2017/11/the-ghost-of-cognition-past-or-thinking-like-an-algorithm
5 CoCOCO was crawled from Flickr over the course of two days in November 2018. The dataset contains metadata for 123,288 COCO images (train/val 2017 subset) by 34,247 known authors. 10,515 entries are without data (deleted or no longer public) and thus not attributable.