The Wayback Machine - https://web.archive.org/web/20070202023101/http://www.newyorker.com:80/fact/content/articles/070205fa_fact_toobin






 


search:    

GOOGLE’S MOON SHOT
The quest for the universal library.
by JEFFREY TOOBIN
Issue of 2007-02-05
Posted 2007-01-29

Every weekday, a truck pulls up to the Cecil H. Green Library, on the campus of Stanford University, and collects at least a thousand books, which are taken to an undisclosed location and scanned, page by page, into an enormous database being created by Google. The company is also retrieving books from libraries at several other leading universities, including Harvard and Oxford, as well as the New York Public Library. At the University of Michigan, Google’s original partner in Google Book Search, tens of thousands of books are processed each week on the company’s custom-made scanning equipment.

Google intends to scan every book ever published, and to make the full texts searchable, in the same way that Web sites can be searched on the company’s engine at google.com. At the books site, which is up and running in a beta (or testing) version, at books.google.com, you can enter a word or phrase—say, Ahab and whale—and the search returns a list of works in which the terms appear, in this case nearly eight hundred titles, including numerous editions of Herman Melville’s novel. Clicking on “Moby-Dick, or The Whale” calls up Chapter 28, in which Ahab is introduced. You can scroll through the chapter, search for other terms that appear in the book, and compare it with other editions. Google won’t say how many books are in its database, but the site’s value as a research tool is apparent; on it you can find a history of Urdu newspapers, an 1892 edition of Jane Austen’s letters, several guides to writing haiku, and a Harvard alumni directory from 1919.

No one really knows how many books there are. The most volumes listed in any catalogue is thirty-two million, the number in WorldCat, a database of titles from more than twenty-five thousand libraries around the world. Google aims to scan at least that many. “We think that we can do it all inside of ten years,” Marissa Mayer, a vice-president at Google who is in charge of the books project, said recently, at the company’s headquarters, in Mountain View, California. “It’s mind-boggling to me, how close it is. I think of Google Books as our moon shot.”

Google’s is not the only book-scanning venture. Amazon has digitized hundreds of thousands of the books it sells, and allows users to search the texts; Carnegie Mellon is hosting a project called the Universal Library, which so far has scanned nearly a million and a half books; the Open Content Alliance, a consortium that includes Microsoft, Yahoo, and several major libraries, is also scanning thousands of books; and there are many smaller projects in various stages of development. Still, only Google has embarked on a project of a scale commensurate with its corporate philosophy: “to organize the world’s information and make it universally accessible and useful.”

In part because of that ambition, Google’s endeavor is encountering opposition. A federal court in New York is considering two challenges to the project, one brought by several writers and the Authors Guild, the other by a group of publishers, who are also, curiously, partners in Google Book Search. Both sets of plaintiffs claim that the library component of the project violates copyright law. Like most federal lawsuits, these cases appear likely to be settled before they go to trial, and the terms of any such deal will shape the future of digital books. Google, in an effort to put the lawsuits behind it, may agree to pay the plaintiffs more than a court would require; but, by doing so, the company would discourage potential competitors. To put it another way, being taken to court and charged with copyright infringement on a large scale might be the best thing that ever happens to Google’s foray into the printed word.

Though Google has more than ten thousand employees—about fifty new ones are hired each week—and a market capitalization of more than a hundred and fifty billion dollars, the company cultivates the air of a college campus at its headquarters, in Silicon Valley. Now and then, there are self-consciously wacky stunts, like Pajama Day, which happened to take place when I visited. (The event was to be madcap within reason; supervisors were told to convey the message that “pajamas means ‘pajamas,’ not ‘what you sleep in.’ ”) When I met with Sergey Brin, a co-founder of Google, he was wearing bright-blue p.j.s, with the company’s logo stitched on the breast pocket.

The story of how Brin and Google’s other co-founder, Larry Page, met as graduate students in computer science at Stanford in the mid-nineties, and devised a series of elegant software algorithms that allowed Web searchers to find relevant information quickly and efficiently, has become part of Silicon Valley lore. Less well known is that, at the time, Brin and Page were also working on Stanford’s Digital Library Technologies Project, an attempt, funded by the federal government, to organize different kinds of stored information, including books, articles, and journals, in digital form. “There was an attitude in computer science that putting things on dead trees was obsolete and getting it all into a searchable, digital format was a quest that had to be accomplished someday,” Terry Winograd, a Stanford professor who was a mentor to Page and Brin, said.

After founding Google, in 1998, Page and Brin—who are now in their mid-thirties and worth around fourteen billion dollars each—began to talk about how to include books in the company’s database. Page, in particular, embraced the idea of putting books online; at one point, he set up a primitive lab in his office, with a scanner and a page-turning machine. “I think it was motivating to have those kinds of aspirations, but nobody really took it seriously,” Brin told me. The men were less interested in making it easy for people to obtain the full texts of books online than in making accessible the information those books contained. “We really care about the comprehensiveness of a search,” Brin said. “And comprehensiveness isn’t just about, you know, total number of words or bytes, or whatnot. But it’s about having the really high-quality information. You have thousands of years of human knowledge, and probably the highest-quality knowledge is captured in books. So not having that—it’s just too big an omission.” As Marissa Mayer put it, “Google has become known for providing access to all of the world’s knowledge, and if we provide access to books we are going to get much higher-quality and much more reliable information. We are moving up the food chain.”

In 2002, Google quietly made overtures to several libraries at major universities. The company proposed to digitize the entire collection free of charge, and give the library an electronic copy of each of its books. “Larry is an undergrad alum here at Michigan, and he knew we were already interested in digitizing the library as part of our preservation efforts,” John Wilkin, an associate university librarian at Michigan, told me. “There was a lot of back-and-forth between Google and us in the process. We wanted to insure that the materials wouldn’t be damaged and that what came out could be used as a preservation surrogate. They started experimenting with different ways of copying the images, and we started a pilot project in July, 2004. We’ve been getting better, going faster. We’re doubling our output all the time.” The Michigan library holds seven million volumes, and Wilkin believes that Google will have copied the entire collection in about six years.

Last month, at the New York Public Library, Google hosted a conference on the future of the publishing industry. About four hundred people—mainly publishing executives and agents—attended, most of them grimly aware of the simultaneous lethargy and panic that have characterized their industry’s response to the digital age. Nearly all attempts to sell books in an electronic format have been disappointing, and now Google appeared to be encroaching on the publishers’ domain. The implicit message of the conference was summed up by a quotation from Charles Darwin that was projected on a screen: “It is not the strongest of the species that survive, nor the most intelligent, but the ones most responsive to change.” As Laurence Kirschbaum, a longtime publishing executive who recently became a literary agent, told me at the conference, “Google is now the gatekeeper. They are reaching an audience that we as publishers and authors are not reaching. It makes perfect sense to use the specificity of a search engine as a tool for selling books.”

Google thought so, too, and designed the books project accordingly. In addition to forming partnerships with libraries, the company has signed contracts with nearly every major American publisher. When one of these publishers’ books is called up in response to search queries, Google displays a portion of the total work and shows links to the publisher’s Web site and online shops like Amazon, where users can buy the book. “We are helping the publishers reach consumers that otherwise might not have known about their books and helping them market their books by giving limited but relevant previews of the books,” Jim Gerber, Google’s director of content partnerships, told me. “The Internet and search are custom made for marketing books. When there are a hundred and seventy-five thousand new books each year, you can’t market each one of those books in mass market. When someone goes into a search engine to learn more about a topic, that is a perfect time to make them aware that a given book exists. Publishers know that ‘browse leads to buy.’ ” (Google says that it does not take a cut of sales made through its books site.)

Still, on October 19, 2005, several leading publishers, including Simon & Schuster, the Penguin Group, and McGraw Hill—all of which are partners in Google Book Search—filed a lawsuit against the company, seeking to stop the project. The publishers don’t object to Google’s plan for helping them sell new books, but they assert that the library component of the project is illegal. They claim that Google’s “massive, wholesale and systematic copying of entire books still protected by copyright” infringes on the publishers’ rights. They demand that Google stop further copying and “destroy all unauthorized copies made by Google through the Google Library Project of any copyrighted works.” (The Authors Guild filed its lawsuit around the same time.) The publishers, who have the support of the Association of American Publishers, are suffering from a version of the problem that John Kerry had in the last Presidential campaign: they are for Google Book Search at the same time that they are against it.

Copyright law dates to the birth of the Republic. Article I of the Constitution assigns Congress the right to pass laws “securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.” The first copyright law was passed in 1790, and it has been frequently and confusingly amended over the years, most recently in the Sonny Bono Copyright Term Extension Act of 1998, which extended copyright terms by twenty years. (The law is also known as the Mickey Mouse Protection Act, because the Walt Disney Company, seeking to protect its copyright on early animated classics like “Steamboat Willie,” lobbied heavily for it.) The twisted history of copyright law has insured an awkward passage into the digital age.

The legal assertion at the core of Google’s business plan is its purported right to scan millions of copyrighted books without payment to or permission from the copyright owners. Approximately twenty per cent of all books are in the public domain; these include books that were never copyrighted, like government publications, and works whose copyrights have expired, like “Moby-Dick.” Google has simply copied such books and made them available on the Web. Roughly ten per cent of books are copyrighted and in print—that is, actively being sold by publishers. Many of these books are covered by Google’s arrangement with its publisher partners, which allows the company to scan and display parts of the works.

The vast majority of books belong to a third category: still protected by copyright, or of uncertain status, and out of print. These books are at the center of the conflict between Google and the publishers. Google is scanning these books in full but making only “snippets” (the company’s term) available on the Web. (Google searches turn up only the search term and about twenty words on either side of it.) Copyright law has never forbidden all “copying” of a protected work; scholars and journalists have long been allowed to quote portions of copyrighted material under the doctrine of fair use. Google maintains that the chunks of copyrighted material that it makes available on its books site are legal under fair use. “We really analogized book search to Web search, and we rely on fair use every day on Web search,” David C. Drummond, a senior vice-president at Google who is overseeing the response to the lawsuits, told me. “Web sites that we crawl are copyrighted. People expect their Web sites to be found, and Google searches find them. So, by scanning books, we give books the chance to be found, too.” (Google also has an “opt out” policy, which allows copyright holders to request that specific titles be omitted from the company’s database.)

However, according to the plaintiffs in the cases against Google, the act of copying the complete text amounts to an infringement, even if only portions are made available to users. “What they are doing, of course, is scanning literally millions of copyrighted books without permission,” Paul Aiken, the executive director of the Authors Guild, said. “Google is doing something that is likely to be very profitable for them, and they should pay for it. It’s not enough to say that it will help the sales of some books. If you make a movie of a book, that may spur sales, but that doesn’t mean you don’t license the books. Google should pay. We should be finding ways to increase the value of the stuff on the Internet, but Google is saying the value of the right to put books up there is zero.”

Google asserts that its use of the copyrighted books is “transformative,” that its database turns a book into essentially a new product. “A key part of the line between what’s fair use and what’s not is transformation,” Drummond said. “Yes, we’re making a copy when we digitize. But surely the ability to find something because a term appears in a book is not the same thing as reading the book. That’s why Google Books is a different product from the book itself.” In other words, Google says that being able to search books on its site—which it describes as the equivalent of a giant library card catalogue—is not the same as making the books themselves available. But the publishers cite another factor in fair-use analysis: the amount of the copyrighted work that is used in the creation of the new one. Google is copying entire books, which doesn’t sound “fair” to the plaintiff publishers and authors. “Traditional copyright analysis says that a transformation leads to the creation of a new and independent work, like a parody or a work of criticism,” Jane Ginsburg, a professor at Columbia Law School, said. “Copying the entire work, which is what Google is doing, does not preclude a finding of fair use, but it does fall outside the traditional paradigm.”

Harvard, Stanford, and Oxford have prohibited Google from scanning copyrighted works in their collections, limiting the company to books that are in the public domain. Because of the opacity of copyright law, and the extension of protections mandated by the 1998 act, it’s not always clear which works are still protected. (Copyright status can become murky when authors die or publishing houses go out of business.) Stanford has drawn a line at 1964 and prohibited Google from copying most works published since that date. “When Google got sued, we got nervous,” Michael A. Keller, the university librarian at Stanford, told me. “We’re not a public institution. We don’t have any state immunity from being sued ourselves, so we started sorting out the stuff that we know is public domain.” (Several of the public institutions that are Google’s partners, including the Universities of Michigan, California, Virginia, and Texas at Austin, are allowing the scanning of copyrighted material.)

The chief engineer of Google’s system for scanning books in the library collections is Dan Clancy, who joined the company after eight years at NASA, where he supervised teams of Ph.D.s. working on problems related to artificial intelligence. Google provides its employees with free food twenty-four hours a day, and Clancy, a tall, shambling man with a shock of white-blond hair, conducted most of our conversations with bits of granola bar clinging to his shirt.

“Previously, when people have done scanning, they always were constrained by their budget and their scale,” Clancy told me. “They had to spend all this time figuring out which were the perfect ten thousand books, so they spent as much time in selection as in scanning. All the technology out there developed solutions for what I’ll call low-rate scanning. There was no need for a company to build a machine that could scan thirty million books. Doing this project just using commercial, off-the-shelf technology was not feasible. So we had to build it ourselves.”



BACK TO THE TOP
Click here for INTERNATIONAL ORDERS >>
Click here to GIVE A GIFT >>
E-mail address
State
Name
 
Mailing address 1
Zip
Mailing address 2
City
 
Copyright © CondéNet 2006. All rights reserved.
Use of this Site constitutes acceptance of our User Agreement and Privacy Policy.

This Site looks and works best when viewed using browsers enabled with JavaScript 1.2 and CSS, such as Netscape 7+ and Internet Explorer 6+.



give the gift of the new yorker subscribe to the new yorker