Inside the Alexa-Friendly World of Wikidata

Virtual assistants do their jobs better thanks to Wikidata, which aims to (eventually) represent everything in the universe in a way computers can understand.

Humans pricked by info-hunger pangs used to hunt and peck for scraps of trivia on the savanna of the internet. Now we sit in screen-glow-flooded caves and grunt, “Alexa!” Virtual assistants do the dirty work for us. Problem is, computers can’t really speak the language.

Many of our densest, most reliable troves of knowledge, from Wikipedia to (ahem) the pages of WIRED, are encoded in an ancient technology largely opaque to machines—prose. That’s not a problem when you Google a question. Search engines don’t need to read; they find the most relevant web pages using patterns of links. But when you ask Google Assistant or one of its sistren for a celebrity’s date of birth or the location of a famous battle, it has to go find the answer. Yet no machine can easily or quickly skim meaning from the internet’s tangle of predicates, complements, sentences, and paragraphs. It requires a guide.

Wikidata, an obscure sister project to Wikipedia, aims to (eventually) represent everything in the universe in a way computers can understand. Maintained by an army of volunteers, the database has come to serve an essential yet mostly unheralded purpose as AI and voice recognition expand to every corner of digital life. “Language depends on knowing a lot of common sense, which computers don’t have access to,” says Denny Vrandečić, who founded Wikidata in 2012. A programmer and regular Wikipedia editor, Vrandečić saw the need for a place where humans and bots could share knowledge on more equal terms.

Inside the bot-friendly world of Wikidata, every concept and thing is represented with a numeric code dubbed a QID. WIRED is known, not so snappily, as Q520154. (The Q prefix on every entry is a tribute to Vrandečić’s wife, Qamarniso.) In December, the project added its 60 millionth item—a protein found in the mitochondria of the parasite that causes human malaria, a k a Q133969.

In turn, Q-coded entities are interlinked and categorized by tags called properties, so that computers can parse relationships between them. Instead of having to deduce from Wikipedia whose spirit possessed Harry Potter (Q3244512), a bot can see that the tag for “possessed by spirit” (P4292) points to Lord Voldemort (Q176132). In other cases, a property denoting “disputed by” (P1310) helps Wikidata reflect that not all truths are universally acknowledged, like whether Jerusalem is Israel’s capital.

Data can be woven into this tapestry by both people and machines. Human editors add new factoids and provide links to their sources, just as they would in Wikipedia. Some information is piped in automatically from other databases, as when biologists backed by the National Institutes of Health unleashed Wikidata bots to add details of all human and mouse genes and proteins. Institutions like New York’s MoMA and the British Library have used software and crowdsourcing to link their catalogs to Wikidata. Some Wikipedia pages auto-update themselves by drawing on Wikidata.

Wikidata’s regimented representation of the world’s complexity still leaves room for whimsy. Pleasingly, Q1 is assigned to the universe. Author Douglas Adams is Q42, a reference to what his fictional supercomputer Deep Thought calculated to be “the Answer to the Ultimate Question of Life, the Universe, and Everything.” Editors made Q1337 leetspeak, 0f c0urs3, and gave Q13 to triskaidekaphobia. (If you don’t get it, ask Alexa.)

This exercise in robot epistemology can’t yet help computers interpret the staccato vocalizations—see Q170579, laughter—that nerdy Easter eggs can elicit from humans. Making machines more like people isn’t the point; the codes are intended to help machines update, find, and remix knowledge in new ways. The connections forged between nuggets of knowledge in Wikidata allow computers to answer complex questions in fractions of a second, without having to trawl through multiple web pages or databases. How many animal species are named after Barack Obama? Wikidata immediately finds and reports 11, the most of any US president. (Donald Trump currently has two, a blond moth and a sea urchin.)

Virtual assistants do their jobs better because of Wikidata. Their corporate creators scrape the data and combine it with other sources—though exactly how they use the information, or to what extent, hasn’t been made public. Siri sometimes cites the database as a source, but Apple declined to discuss its use of Wikidata. So did Amazon, but the company did publish a paper last year on how Wikidata taught Alexa to recognize the pronunciation of song titles in different languages.

That the voice-enabled avatars of the world’s most sophisticated tech companies rely on a collective of unpaid enthusiasts is a reminder that AI is more limited than we are often led to believe. Wikidata is incomplete and messy. A quarter of items lack references. There are many errors, one of which caused Siri to spookily foretell, by four months, the death of 95-year-old comic book legend Stan Lee last year. Apple and others use Wikidata anyway, because our dumb algorithms so desperately need help comprehending the world.

Such dependence may serve us well. The knowledge of future machines could be shaped by you and me, not just tech companies and PhDs. Wikidata is supported by the German chapter of the Wikimedia Foundation, the nonprofit that keeps the server lights blinking for Wikipedia and related projects. After Wikimedia’s executive director, Katherine Maher, called out megacorporations for tapping those free resources without offering much in return, Amazon and Facebook ponied up $1 million each. Google recently announced a $3.1 million donation.

The funds will help the foundation’s efforts to make its communities and information stores more representative. Almost 4 million people have a Wikidata entry listing their gender; only 18 percent are female. The resource’s knowledge of the global south is sketchy. Maher is confident we can fix those blind spots, as long as companies do more than just take from Wikipedia and Wikidata. “The only way that that’s going to happen is if the commons is treated as a renewable resource, not one to be strip-mined,” she says. If society makes a collective effort to build out the informational backbone of AI, we and our future bot friends might just achieve Q238651, world peace.

Tom Simonite (@tsimonite) covers intelligent machines for WIRED.

This article appears in the March issue. Subscribe now.