Inside the Quest to Put the World's Libraries Online

The Digital Public Library of America wants to make millions of books, records, and images available to any American with an Internet connection. Can it succeed where others have failed?

The Digital Public Library of America wants to make millions of books, records, and images available to any American with an Internet connection. Can it succeed where others have failed?

yi_library_post.jpg
Reuters

In his short story "The Library of Babel," Jorge Luis Borges imagines the universe as a "total library," whose 410-page books have achieved all possible combinations of letters and punctuation. No two books are the same. Some, of course, are gibberish. But others carry the answer to life's deepest mysteries. In Borges's library can be found every thought ever had, every turn of phrase ever uttered, every masterpiece penned by Shakespeare, and even the ones that he never got to write—simply stated, everything.

Borges's fearsome fantasy builds upon a centuries-old conception of the library as an enclosed instantiation of the universe's mighty sprawl. In Advice on Establishing a Library, a classic manual on the creation of a library, the 17th-century French scholar Gabriel Naudé argued that a library "erected for the public benefit ought to be universal," observing that "there is nothing which renders a Library more recommendable, then when every man finds in it that which he is in search of, and could no where else encounter." This sort of accumulation has sometimes come hand-in-hand with power, as the historian Jacob Soll has shown with his study of Jean-Baptiste Colbert, the finance minister to the great French king Louis XIV who sought to establish a universal library and state archive because he believed it made a firm foundation for national intelligence.

From Colbert to Borges, and still onward from there: The fascination with completeness is as timeless as it is ingrained. In the last decade, the Internet has made the ambition of universality appear closer to realization than ever before: What is the Web, if not a vast collection, and an accessible one? But as with any new frontier, formidable challenges attend exciting possibilities—and nowhere has this been more apparent than in the efforts of the Digital Public Library of America, a coalition spearheading the largest effort yet to curate and make publicly available the "cultural and scientific heritage of humanity," with a focus on materials from the U.S., by harnessing the Internet's capabilities. The DPLA hopes to create a platform that will orchestrate millions of materials—books from public and university libraries, records from local historical societies, museums, and archives—into a single, user-friendly interface accessible to every American with Internet access. It will launch a prototype in April 2013. If successful, the resource has the potential to revolutionize the way information is organized and found online, to radically expand public access to knowledge, and to represent a sharp counterpoint to the model already offered by search-giant Google, whose "Google Books" program is now eight years old.

THE RALLYING CALL FOR THE DPLA was circulated in the fall of 2010. Summoned by Robert Darnton, the great book historian and current director of Harvard's library system, about 40 people came together in October at the Radcliffe Institute for Advanced Study. There had been some concern that the attendees—a diverse group hailing from different parts of the library science universe—might have trouble fusing their agenda. But after a half hour, the effort found solid ground. "We were able to come up with a single sentence: 'It's a worthy effort, and we are willing to work together toward it,' " recalls DPLA chair John Palfrey, who is also director of Harvard's Berkman Center for Internet and Society, and a former professor at Harvard Law School. The spirit of unanimity had legs: A steering committee quickly formed, and the Alfred P. Sloan Foundation, a non-profit organization that supports a variety of digital information projects, offered to fund a planning process. The attendants conceived of

an open, distributed network of comprehensive online resources that would draw on the nation's living heritage from libraries, universities, archives, and museums in order to educate, inform and empower everyone in the current and future generations.

While ambitious, the project was not unprecedented. The creation of a large-scale digital library catering to public access has been attempted for decades, by a cast of characters worth noting. Aside from Google, there's the Internet Archive, a non-profit digital library based in San Francisco that sees itself as a bulwark against a modern-day version of the loss of the Library of Alexandria. Brewster Kahle, who founded the Internet Archive in 1996 and is now on the DPLA steering committee, aims to supplement this digital reserve with a physical copy of every book in existence, collected and stored in a mammoth warehouse in California; he currently has about 500,000 volumes and hopes to reach 10 million one day. His efforts are complemented by the HathiTrust ("Hathi" is the Hindi word for "elephant," an animal that, as the saying goes, never forgets), a digital preservation repository founded in 2008 that has digitized over 10 million volumes contributed by participating research institutions and libraries. The 3 billion-plus pages amount to over 8,000 tons (but weigh close to nothing online, of course). Meanwhile, national institutions like the Library of Congress have been digitizing their in-house materials for years. The DPLA is not the first player to step onto the field.

But that doesn't make it any less of a milestone. Consider these facts: The Library of Congress, the largest library in the world, added 480,000 books to its collections in the last fiscal year alone, and now boasts more than 34 million books and other print materials. Add other items like maps and manuscripts, and the collection towers at 150 million items. And then there's information stored in digital forms (e-mails, websites, even President Obama's Twitter feed), which compounds things astronomically. Speaking in digital terms, the world produced more data in 2009 than in the entire history of mankind through 2008, according to the former chief scientist at Amazon.com. In one way, this explosion and the digital platforms that support it have been a boon for librarians and archivists, who specialize in collecting information and making it available to users. But in others, it has been a scourge, rendering the goal of staying abreast of the world's intellectual output (not to mention the hardware and software needed to store and display it), more quixotic than ever. Simply to reap the accessibility benefits that the Internet so tantalizingly affords, the centuries-worth of items currently extant only in cloth and paper need to be imaged into bits and bytes—a monumental, manpower-intensive, and prohibitively expensive task. And that is to say nothing of figuring how to cull and catalog the terabytes of information that have spent their whole life in digital format. All of which goes to show that the problem of networking the nation's "living heritage" online has barely begun to be addressed. The problem is one of time, money, and most of all, scale—massive scale.

The DPLA is the most ambitious entrant on the digital library scene precisely because it claims to recognize this need for scale, and to be marshaling its resources and preparing its infrastructure accordingly. With hundreds of librarians, technologists, and academics attending its meetings (and over a thousand people on its email listserv), the DPLA has performed the singular feat of convening into one room the best minds in digital and library sciences. It has endorsement: The Smithsonian Institution, National Archives, Library of Congress, and Council on Library and Information Resources are just some of the big names on board. It has funding: The Sloan Foundation put up hundreds of thousands of dollars in support. It has pedigree: The decorated historian Darnton has the pages of major publications at his disposal; Palfrey is widely known for his scholarship on intellectual property and the Internet; the staging of the first meeting on Harvard's hallowed campus is not insignificant. Ideally, the consolidation of resources—specialized expertise, raw manpower, institutional backing and funding—means that the DPLA can expand its clout within the community, attract better financial support, and direct large-scale digitization projects to move toward a national resource of unparalleled scope and functionality. "We believe that no one entity—not the Library of Congress, not Harvard, not the local public library—could create this system on its own," Palfrey says. "We believe strongly that by working together, we will build something greater."

The DPLA hopes to calibrate its network specifically for growth, and it will provide the armature to ensure that future expansions and assimilations can occur in a coordinated and standardized fashion. This framing caters to a void that is not present in countries—like France and Sweden—with clearly designated national libraries, readymade centers that can extend into digital analogs and serve as overseers for online expansion. In the United States, the Library of Congress does not have the same mandate, and the lack of a center of gravity at the national level has therefore led to fragmented and disorienting results for a library community already known for fierce competition among its silos ("Cooperation is an unnatural act," remarks David Ferriero, the Archivist of the United States, recalling an adage from his tenure at other libraries). "If you think about where we are today in the digital library space, there are a whole lot of efforts that are not pulling in the same direction," says Palfrey, noting the obscurity of digitizing projects that have developed under discrete directives. "I defy you to find them. They tend to be in proprietary repositories that are very hard to find."

Consider how you usually come across digitized materials on the web: Haphazardly, likely with the help of a search engine, without much of a sense for what repositories have made a particular text or image available in the first place, or how best to find similar materials in the future. The DPLA—by taking up the mantle of "national library," of a command center for the country's published heritage—would put its users high above street level, offering easier and more systematic navigation. The DPLA does not plan to supplant or swallow up the institutions already contributing to the digital cause, Palfrey emphasizes. These groups have been building their own digital collections on a scale that befits their resources, leaving the DPLA to hone its responsibility into one of supporting, managing, and organizing—rather than of generating all the raw material. "We want to build infrastructure that will support public and academic libraries," Palfrey says. "We're not building the end-all, be-all digital library." The DPLA's humility in this area may be its very ingenuity: It is happily indebted to the rich, if perhaps messy, constellation of digitized materials that already exist.

But connecting the stars is a job all its own. "In the '60s, our challenge was, 'How do we build roads?' I think, in this day and age, it's the knowledge infrastructure. That is our big challenge," says Carl Malamud, DPLA member and president of public.resource.org, a non-profit group that shares public domain works. "The DPLA is a task force that figures out how to do that, as opposed to building yet another library."

NEARLY TWO YEARS HAVE PASSED since Darnton's call for a first meeting at Harvard, but it remains unclear what the DPLA—as a website, a thing with practical function—will look like when the prototype launches in the spring. "What is the it?" as Palfrey says. But while its form remains elusive, the DPLA has developed a good sense of what it won't be.

Nearly everyone on the steering committee shares a critical opinion of Google Books. On the one hand, Darnton concedes that the DPLA likely would not exist had there been no Google Books, an enterprise of "tremendous imagination and technical virtuosity," he says, that was the first to actualize—if not with perfect success—the idea of a comprehensive digital library. At the same time, the DPLA has derived much of its momentum from how it can rethink those areas in which it perceives that Google has failed. According to those present at the preliminary meeting, the question that drove the discussion was: "Could we do better than Google?" As Palfrey puts it: "The idea of having any single company control such an extraordinary public resource strikes me as a bad idea. Could we do better if we were to have a massive effort toward creating a digital public library that was not driven by a single corporation, but driven by a broad coalition of people?" Palfrey is careful to note that the DPLA is not a replacement for, or adversary to, Google Books (in fact, the DPLA hopes to draw upon Google's digital reserve in some sort of collaborative arrangement). But there's no denying the company's pervasive specter: If Google is a dubious father figure, the DPLA is a son attempting to get out from under the shadow of its forebear.

Charles Nesson, a Harvard Law School professor and intellectual property expert, recalls his shock upon hearing, in 2004, that Harvard had made a deal with Google to allow the Internet giant access to the university's library collections. The idea was for the company to digitize 40,000 of Harvard's materials on site, free of charge, as a pilot program aimed at erecting an online catalog. "Google Print," as it was then known, had made its debut at the Frankfurt Book Fair in October of that year, and by December had announced agreements to begin scanning the holdings of some of the world's greatest libraries—among them, Stanford, Michigan, Oxford, and New York Public Library. Users of Google's resource would be able to view and download full PDF versions of public-domain books, and see "snippets" from works under copyright as a searchable index.

At a Christmas party around the time of the announcement, Nesson raised the issue, which sounded to him like "no deal at all": as he saw it, Harvard had given up access to its world-class collection in return for digital copies whose use would be under constraint. "Don't ask me," said one Harvard administrator, according to Nesson's recollection, pointing him instead to then-university president Lawrence Summers, also at the party: "Ask Larry. He's right over there." But Summers had little to say. "He had signed away the family jewels," Nesson remembers, "and he wasn't on top of the terms on which he'd done it."

Nesson's foreboding would turn out to be warranted, but at Christmastime 2004 he was swimming very much against the tide, as the quick assent of the world's top universities suggested. And why not? Google's pitch—to use technology to make information available on a hitherto unprecedented scale—was promising, and its capacities were prodigious. When librarians at the University of Michigan said that the institution's seven million books would take over a millennium to digitize, Google said it could do it in six years. Librarians at Harvard said they believed making books searchable online would help students to draw information from published sources.Publishers were told that online access offered marketing opportunities. The project barreled ahead, and Google has digitized more than 20 million books to date.

But in time, Nesson's concerns would prove prescient. Some of the less savory elements of the Google Books Library Project passed quietly: Google employees were often negligent, filing Whitman's Leaves of Grass under "Gardening," for example, and what was supposed to have been a "free" program cost Harvard nearly $2 million (the university had to process 850,000 books to be digitized by Google). The more public bungle began in 2005, when the Authors Guild and the Association of American Publishers sued Google for violation of their copyrights. This marked a critical juncture: Google could have made a case for fair use to help expand the public's access to the literature.Instead, the company entered a period of intense, secret negotiations with the plaintiffs.

In October 2008, the groups emerged with a proposed settlement that effectively mutated Google's original vision of a digital archive into its ugly sister: a library and bookstore business. Under the terms of the settlement, users would be able to buy individual e-books, and libraries would be able to purchase a subscription for access to Google's entire catalog of books (books that these libraries, in some cases, had provided and processed themselves). The cryptic-sounding "Book Rights Registry"—a body composed of representatives from the Authors Guild, AAP, and Google—would determine the prices. Thirty-seven percent of the profits would go to Google, and the rest to the authors and publishers. Not surprisingly, the proposed settlement triggered widespread accusations of the commercialization and monopolization of knowledge, and it even prompted an investigation from the Department of Justice about the possible violation of the Sherman Antitrust Act. Harvard backed out of its partnership with Google (many of the other libraries continue to work with the company).

In November 2009, the three groups filed an amended settlement and awaited a decision, which would not come for another few years. Thus the first gathering in October 2010 of what would become the DPLA came at a time of taut energy. With the settlement on the table, along with the corresponding possibility that Google might be closing the door on public access and standing to profit from books that universities had made available, the matter seemed urgent. "I think the main point is that Google turned into a commercial digital library," Darnton says, "one without any constraints on its pricing policy." How had a private company come so close to controlling the fates of millions of books—and of possibly convincing users to agree to this arrangement? Who was defending the public interest?

Fortunately for the DPLA, the timing could not have been better. In March 2011, Judge Denny Chin rejected the amended settlement, arguing that it would give the company "a significant advantage over competitors, rewarding it for engaging in wholesale copying of copyrighted works without permission." The lawsuit continues to this day. But most pundits are pessimistic about the future of Google's legal travails: "The settlement we all grew to know and love, all that high drama, is over. It's not coming back in anything like its old form," says James Grimmelmann, an associate professor at New York Law School, who has been following the case closely. Meanwhile, Google has continued its work: "We continue to scan books with our library and publisher partners around the world," says a company spokesman. "In general we are supportive of efforts to make more books discoverable online, including those of digital libraries."

Google and the DPLA present a study in opposites. The former is a company with an interest in profits; the latter, a non-commercial project that claims as its highest incentive the promotion of a healthy ecology of information production and dispersion. Google is often opaque to the point of unaccountability (many DPLA members whose institutions once worked with the company recall signing nondisclosure agreements); the DPLA is open to the point of exhibitionism ("Everything, everything!" Palfrey exclaims when asked what could be found online about the DPLA, which posts meeting notes and progress reports on its wiki). Google works like a centralized brain (the company declined to comment even on the number of employees involved in the books project); the DPLA prides itself on the diversity of its membership and encourages innovation around the edges (last fall, it held an open competition called the "Beta Sprint," inviting entries on anything from prototype ideas to technical tools that could be used by the DPLA). The DPLA wants to promote public, not privileged, access; open, not privatized, knowledge; many beneficiaries, not just a select few. The DPLA may not have an "it" yet, but that's because it's considering every possible "it" there is. In a way, the "it" has no choice but to be anything but. The term, after all, lacks the flavor of pluralism.

For all their differences, Google and the DPLA do share a major hurdle: Copyright law, which prevents the digitization of orphan works, numbering around 5 million and constituting about 50 to 70 percent of books published after 1923. Orphans are works whose rights holders are not known; they may be dead or unaware of their entitlement. Google's settlement would have given the company license to appropriate orphan works for posterity—a move that would have opened up a trove of previously unavailable works, at the expense of granting Google unprecedented control through litigation. The DPLA faces a similar problem: As some members pointed out in a gathering last year, out-of-print and orphan works—content in the "yellow zone" of copyright—outnumber both public domain and in-copyright works, "making legal reforms necessary for the success of a DPLA," according to meeting notes. Jason Schultz, an assistant professor at UC Berkeley School of Law and a DPLA member focusing on legal issues, says that the coalition wants to strike the right balance between the rights of copyright owners to be properly compensated and the rights of public access. The DPLA will not violate copyright, and it will begin with a foundation of public-domain works. The organization is trying to figure out the best case for fair use of out-of-print or unpublished works to argue that public access to this literature benefits society and serves a "higher" purpose.

ONE MEMBER OF THE DPLA is concerned that the project may move in the very direction it professes to spurn, in spite of—or perhaps, because of—its best intentions. The vision of a platform that supports various distributed networks and marshals them into navigable order is promising. But Kahle, the Internet Archive founder, worries that the model could eat its own tail, betraying the very principles of a distributive model by centralizing too much, making itself monarch and filtering access to the universe of knowledge through a single pair of hands. For Kahle, Darnton's vision of the DPLA, outlined last year in the New York Review of Books, looked "very Google library-like." The DPLA, Darnton wrote, would "contain nearly everything available in the walled-in repositories of human culture"; the library would be "the greatest that ever existed." These sorts of pronouncements—invoking the notion of a universal library in its traditional, unitary form—are unsettling to Kahle, who worries that the spirit of grandeur could be a primrose path to a closed, as opposed to an open, system. The undertone is dark. "One library to rule them all," quips Kahle. "Our opportunity is to build an open system ... The idea is not to build a single library, but to get the library system to go digital."

A decade ago, libraries—particularly those at universities—were willing to accept the restrictions imposed by Google, so firm was their belief that they needed the company's help to go digital, according to Kahle. But he sees the progress of groups like the Internet Archive as proof that "we can actually do this ourselves." So what new purpose will the DPLA serve? There is a fine difference between supporting, rather than competing with, the digitization efforts of member institutions—including small public libraries, whose own collections are modest in comparison to what is being made available online. (Amy Ryan, president of the Boston Public Library system, is confident that the DPLA will enlarge, not constrict, the public library's capabilities. "This adds to our usefulness," she says of the DPLA. "When leveraged with partnership organizations and new technology, the power [of the library] is really amplified.")

Public libraries constitute just one of many groups that the DPLA must take into consideration as it moves toward a unified model. The coalition boasts a motley cast of characters tethered to external institutions and projects, making them indispensable for their specialized expertise, but licensed to strong, and often clashing, points of view. The conversations are laborious, with an emphasis on exhaustiveness, not efficiency ("Your brain is tired," Ryan says of a meeting's aftermath). This is by design part of the DPLA's distributive, as opposed to top-down, culture: there is "no king or queen or president," as Palfrey says, to make the last call. Palfrey, a soft-spoken man who recently became president of the Phillips Academy in Andover, is regarded as an ideal leader for the DPLA because he would never think of himself as such. He is "the antithesis of somebody who wants to be the Daddy Mac," says Maura Marx, a fellow at the Berkman Center and executive director of the Open Knowledge Commons.

This determination to avoid prescription allows openness, but it has costs for efficiency. "We are going forward in a way that is not especially coordinated," Palfrey admits. "Some people's vision may be incompatible, and that's the job of the process we're in: to come up with the best idea that has the strongest consensus behind it." The indeterminacy of the coalition's identity has manifested in very basic questions about its web architecture. Will the DPLA function as an independent and self-sustaining site, or will it be a "federation of other content repositories," a central platform that links outward? Palfrey envisions a resource that will strike a careful balance between the two models: Ideally, users will be able to scan and share materials on their own through the DPLA's platform, while they also benefit from the digitized content of others. Content production, dissemination, and consumption—all of it would take place on the same hub, one that manages to be user-friendly without compromising the varied utility, the multiplicity, of its functions and users.

This vision is not so new, really. Palfrey says that the model he likes best is that of the Internet itself, a global network of networks so complexly generative, yet somehow functional, that one easily forgets that it has no centralized governing body. Nobody owns the Internet. Instead, it's loosely overseen by three organizations, and several DPLA members point to one in particular—the Internet Engineering Task Force, which develops technical standards for the Internet—as a useful model of governance. Anyone is welcome to join the IETF's unmoderated mailing list, or to attend its "big tent"-style meetings—"gatherings of the tribes," as they're called—where the dress is informal and decisions are made by rough consensus.

The digital reserve of knowledge belongs to nobody and everybody at the same time. This means that it's both too precious to entrust to a single entity, but also too precious to fall nowhere at all. It begs to be left alone and free, but somehow wisely monitored as well. This is the paradoxical task that the DPLA, a universal library for a new age, seeks to address. Knowledge should belong to everyone, but someone has to tailor it for use—and take care to do no more than that. "I don't think we have to have—and we shouldn't have—a completely centralized system. But nor do we have to accept it as fragmented as it is today," says Palfrey. "Think about it [the DPLA] as inter-operable, rather than totally standardized. Think about it as distributed, but also a place where people can come to get resources."

FOR ALL OF THE STERILE SPARKLE that we expect of our technological novelties, digital libraries are built by living, breathing human beings. Tucked somewhere deep in the bowels of the National Archives, Bing, Earl, Naomi, and Peggy—volunteers, all—sit in a brightly lit room. Here, among the warehouse-like vaults, footsteps ring in the dark hallways, where unusually low ceilings attest architecturally to the Archives' chronic space shortage. The Archives, which is responsible for the preservation of government records, holds some 10 billion pieces of paper. It is from the material in rooms like these, far from Cambridge, that the DPLA, if it succeeds, will be pieced together.

In one storeroom, the brown boxes lining the shelves hold 1.28 million files of pension cases of Civil War widows (to prove her relationship to the deceased, one lady sent the government a mole skin, all four polydactyl hands intact, gifted by her beloved). Since 2007, the Archives has been working with two websites that specialize in historical document research, FamilySearch and Fold3, to process and digitize the pension cases. Before any scanning can happen, the "Civil War Conservation Corps," as the volunteers are called, must process and assess the documents, which have crisped brown with age, for conservation concerns, creases, irregularities—anything that might slow the imaging process.

Everything that passes inspection is circulated to the digitization lab, where the camera operators are stationed. Now work moves at a brisker clip. The overhead lights are off, but the soft glow from each camera station collects in a dreamy haze. Rodney, an affable camera operator, sits at a black desk ensconced in a wall of dark fabric that concentrates the light streaming down from lamps positioned high above. A document lays flat on his desk, underneath a camera. With a few snaps and clicks on a set of brightly colored buttons, a precise digital scan pops up on the computer screen before him. "I could do almost two boxes a day—22,400 images," Rodney says. "And I'm a little slow." Nearby, a married couple sits side by side, operating separate stations. "When they put us on one camera, it wasn't very romantic," the woman jokes.

Ferriero, the Archivist of the United States and active member of the DPLA team, wants "every stinking thing" in the National Archives digitized, but the agency has "just a toe in the water" (over 74 million pieces of paper, or less than 1 percent of total holdings, have been digitized). "Everything we do is piece by piece by piece, page by page by page, image by image by image—and that is a huge task," says Brenda Kepley, chief of the processing section at the Archives. To make substantial progress, the Archives has had to forge digitizing partnerships with universities and commercial companies—but the concern that the agency is years behind persists ("You mean, it's not all online?" people asked Archives staff in the mid-1990s, when the Internet was still in its infancy). Ferriero hopes that the DPLA will expedite digitizing at the Archives and draw greater attention to its untapped resources.

Ferriero's sense of urgency, juxtaposed with the unavoidably slow pace of digitizing, underscores the sheer magnitude of the task that lies before the DPLA. There is work to be done, at the ground-level, by real people in real buildings in real cities that exist offline. The DPLA's great challenge—and, perhaps, its eventual success—lies in discovering a way to support this activity from afar without smothering it. Its universalism will not be uniform but ecumenical, not a timeless gift of posterity but a timebound work in progress. The universal has become less of an absolute, and more of a penumbra in constant flux, a shapeless thing of infinite possibility—not infinite being. If it succeeds, the DPLA will give us not an immobile collection of knowledge, but a universe that is always expanding.

Esther Yi is a journalist living in Berlin.