By the Numbers

Google Books: A Complex and Controversial Experiment

Credit...Laurent Cilluffo

By Stephen Heyman

Oct. 28, 2015

Google won a decisive victory this month when a United States appeals court ruled that its massive project to digitize all the world’s books did not violate authors’ copyrights. The decision ended nearly a decade of challenges that seemed, as recently as September, to place the fate of Google Books in legal limbo. But while litigation is over, the value of the project — to Internet users, academics, and the company itself — remains a matter of debate.

Google began its quest to create a universal library in 2004, providing technology and financing to digitize the collections of some of the world’s largest research libraries in exchange for the right to make those scanned books part of its searchable database. The project, which scanned not only titles in the public domain but also those under copyright, was almost immediately beset by controversy.

Throughout all this, the scale and speed of book-scanning only accelerated. In 2002, when Google began experimenting with book-scanning, it took 40 minutes to scan a 300-page book. Now, a scanning operator can digitize up to 6,000 pages in an hour, according to Maggie Shiels, a Google spokeswoman. In total, more than 25 million volumes have been scanned, including texts in 400 languages from more than 100 countries.

The breadth of knowledge contained in Google Books has given rise to a new field of academic inquiry. Recent papers have used Google “Ngrams,” which track the popularity of a word in a language across the Google Books database, to make cultural insights — such as noting a precipitous decline in the use of words like “virtue” and “decency.” The sheer size of what’s in Google Books — 10 billion pages — seems to give such theories credibility.

But skeptics say it’s hard to draw conclusions based on Google Books because the company has not shared information about what kinds of texts are in the collection, or how representative they are of culture at large. Geoffrey Nunberg, a linguist at the University of California, Berkeley, has highlighted problems in the way titles are dated and categorized and has called Google Books a “metadata train wreck.”

This month, a team of data scientists at the University of Vermont published an analysis of Google Ngrams that questioned the value of this collection for scholars. “We came to this dataset rubbing our hands together, very excited, because we were going to find out how culture evolves,” said Peter Dodds, one of the paper’s authors. “But when you get into the data, it actually looks like a big mess. It’s hard to use it to justify the popularity of anything.”

He said there were two main problems with using Google Books to understand culture. One is that each book counts only once in the database. “‘Moby Dick’ in principle appears once and so do novels no one has ever read,” he said. “So they both get the same weight.” The other problem is that Google Books appears to contain a large percentage of scientific literature that might skew the Ngram results. In their paper, Mr. Dodds and his colleagues, Eitan Adam Pechenick and Christopher M. Danforth, show that specific terms endemic to scientific literature begin to appear with disproportionate frequency around 1900.

Mr. Dodds said the slick interface of the Ngram viewer and the huge number of texts it searches offer a convincing illusion. Type in the word “empathy” and one might assume that the world is becoming crueler or kinder depending on the result. “It’s pretty fraught,” he said.“Because Google made such a nice front-end, we put some trust into it. But we’re pretty concerned because we know it’s much more complicated underneath.”

A version of this article appears in print on in The New York Times International Edition. Order Reprints | Today’s Paper | Subscribe

SKIP ADVERTISEMENT