Internet Archive founder Brewster Kahle and some of the Archive's servers in 2006. (AP Photo/Ben Margot)

To most of the web surfing public, the Internet Archive’s Wayback Machine is the face of the Archive’s web archiving activities. Via a simple interface, anyone can type in a URL and see how it has changed over the last 20 years. Yet, behind that simple search box lies an exquisitely complex assemblage of datasets and partners that make possible the Archive’s vast repository of the web. How does the Archive really work, what does its crawl workflow look like, how does it handle issues like robots.txt, and what can all of this teach us about the future of web archiving?

Perhaps the first and most important detail to understand about the Internet Archive’s web crawling activities is that it operates far more like a traditional library archive than a modern commercial search engine. Most large web crawling operations today operate vast farms of standardized crawlers all operating in unison, sharing a common set of rules and behaviors. They traditionally operate in continuous crawling mode, in which the goal is to scour the web 24/7/365 and attempt to identify and ingest every available URL.

In contrast, the Internet Archive is comprised of a myriad independent datasets, feeds and crawls, each of which has very different characteristics and rules governing its construction, with some run by the Archive and others by its many partners and contributors. In the place of a single standardized continuous crawl with stable criteria and algorithms, there is a vibrant collage of inputs that all feed into the Archive’s sum holdings. As Mark Graham, Director of the Wayback Machine put in an email, the Internet Archive’s web materials are comprised of “many different collections driven by many organizations that have different approaches to crawling.” At the time of this writing, the primary web holdings of the Archive total more than 4.1 million items across 7,357 distinct collections, while its Archive-It program has over 440 partner organizations overseeing specific targeted collections. Contributors range from middle school students in Battle Ground, WA to the National Library of France.

Those 4.1 million items comprise a treasure trove covering nearly every imaginable topic and data type. There are crawls contributed by the Sloan Foundation and Alexa, crawls run by IA on behalf of NARA and the Internet Memory Foundation, mirrors of Common Crawl and even DNS inventories containing more than 2.5 billion records from 2013. Many specialty archives preserve the final snapshots of now-defunct online communities like GeoCities and Wretch. Dedicated Archive-It crawls preserve a myriad hand-selected or sponsored websites on an ongoing basis such as the Wake Forest University Archives. These dedicated Archive-IT crawls can be accessed directly and in some cases appear to feed into the Wayback Machine, accounting for why the Wake Forest site is captured almost every Thursday and Friday over the last two years like clockwork.

Alexa Internet has been a major source of the Archive’s regular crawl data since 1996, with the Archive’s FAQ page stating “much of our archived web data comes from our own crawls or from Alexa Internet's crawls … Internet Archive's crawls tend to find sites that are well linked from other sites … Alexa Internet uses its own methods to discover sites to crawl. It may be helpful to install the free Alexa toolbar and visit the site you want crawled to make sure they know about it.”

Another prominent source is the Archive’s “Worldwide Web Crawls,” which are described as “Since September 10th, 2010, the Internet Archive has been running Worldwide Web Crawls of the global web, capturing web elements, pages, sites and parts of sites. Each Worldwide Web Crawl was initiated from one or more lists of URLs that are known as ‘Seed Lists’ … various rules are also applied to the logic of each crawl. Those rules define things like the depth the crawler will try to reach for each host (website) it finds.” With respect to how frequently the Archive crawls each site, the only available insight is “For the most part a given host will only be captured once per Worldwide Web Crawl, however it might be captured more frequently (e.g. once per hour for various news sites) via other crawls.”

The most recent crawl appears to be Wide Crawl Number 13, created on January 9, 2015 and running through present. Few details are available regarding the crawls, though the March 2011 crawl (Wide 2) states it ran from March 9, 2011 to December 23, 2011, capturing 2.7 billion snapshots of 2.3 billion unique URLs from a total of 29 million unique websites. The documentation notes that it used the Alexa Top 1 Million ranking as its seed list and excluded sites with robots.txt directives. As a warning for researchers, the collection notes “We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.”

Augmenting these efforts, the Archive’s No More 404 program provides live feeds from the GDELT Project, Wikipedia and WordPress. The GDELT Project provides a daily list of all URLs of online news coverage it monitors around the world, which the Archive then crawls and archives, vastly expanding the Archive’s reach into the non-Western world. The Wikipedia feed monitors the “[W]ikipedia IRC channel for updated article[s], extracts newly added citations, and feed[s] those URLs for crawling,” while the WordPress feed scans “WordPress's official blog update stream, and schedules each permalink URL of new post for crawling.” These greatly expand the Archive’s holdings of news and other material relating to current events.

Some crawls are designed to make a single one-time capture to ensure that at least one copy of everything on a given site is preserved, while others are designed to intensively recrawl a small subset of hand-selected sites on a regular interval to ensure both that new content is found and that all previously-identified content is checked for any changes and freshly archived. In terms of how frequently the Archive recrawls a given site Mr. Graham wrote that “it is a function of the hows, whats and whys of our crawls. The Internet Archive does not crawl all sites equally nor is our crawl frequency strictly a function of how popular a site is.” He goes on to caution “I would expect any researcher would be remiss to not take the fluid nature of the web, and the crawls of the [Internet Archive], into consideration” with respect to interpreting the highly variable nature of the Archive’s recrawl rate.

Though it acts as the general public’s primary gateway to the Archive’s web materials, the Wayback Machine is merely a public interface to a limited subset of all these holdings. Only a portion of what the Archive crawls or receives from external organizations and partners is made available in the Wayback Machine, though as Mr. Graham noted there is at present “no master flowchart of the source of captures that are available via the Wayback Machine” so it is difficult to know what percent of the holdings above can be found through the Wayback Machine’s public interface. Moreover, large portions of the Archive’s holdings carry notices that access to them is restricted, often due to embargos, license agreements, or other processes and policies of the Archive.

In this way, the Archive is essentially a massive global collage of crawls and datasets, some conducted by the Archive itself, others contributed by partners. Some focus on the open web, some focus on the foundations of the web’s infrastructure, and others focus on very narrow slices of the web as defined by contributing sponsors or Archive staff. Some are obtained through donations, some through targeted acquisitions, and others compiled by the Archive itself, much in the way a traditional paper archive operates. Indeed, the Archive is even more similar to traditional archives in its use of a dark archive in which only a portion of its holdings are publically accessible, with the rest having various access restrictions and documentation ranging from detailed descriptions to simple item placeholders.

This is in marked contrast to the description that is often portrayed of the Archive by outsiders as a traditional centralized continuous crawl infrastructure, with a large farm of standardized crawlers ingesting the open web and feeding the Wayback Machine akin to what a traditional commercial search engine might do. The Archive has essentially taken the traditional model of a library archive and brought it into the digital era, rather than take the model of a search engine and add a preservation component to it.

There are likely many reasons for this architectural decision. It is certainly not the difficulty of building such systems – there are numerous open source infrastructures and technologies that make it highly tractable to build continuous web-scale crawlers given the amount of hardware available to the Archive. Indeed, I myself have been building global web scale crawling systems since 1995 and while still a senior in high school in 2000 launched a whole-of-web continuous crawling system with sideband recrawlers and an array of realtime content analysis and web mining algorithms running at the NSF-supported supercomputing center NCSA.

Why then has the Archive employed such a patchwork approach to web archival, rather than the established centralized and standardized model of its commercial peers? Part of this may go back to the Archive’s roots. When the Internet Archive was first formed Alexa Internet was the primary source of its collections, donating its daily open crawl data. The Archive therefore had little need to run its own whole-of-web crawls, since it had a large commercial partner providing it such a feed. It could instead focus on supplementing that general feed with specialized crawls focusing on particular verticals and partner with other crawling organizations to mirror their archives.

From the chronology of datasets that make up its web holdings, the Archive appears to have evolved in this way as a central repository and custodian of web data, taking on the role of archivist and curator, rather than trying to build its own centralized continuous crawl of the entire web. Over time it appears to have taken on an ever-expanding collection role of its own, running its own general purpose web-scale crawls and bolstering them with a rapidly growing assortment of specialized crawls.

With all of this data pouring in from across the world, a key question is how the Internet Archive deals with exclusions, especially the ubiquitous “robots.txt” crawler exclusion protocol.

The Internet Archive’s Archive-It program appears to strictly enforce robots.txt files, requiring special permission for a given crawl to ignore them: “By default, the Archive-It crawler honors and respects all robots.txt exclusion requests. On a case by case basis institutions can set up rules to ignore robots.txt blocks for specific sites, but this is not available in Archive-It accounts by default. If you think you may need to ignore robots.txt for a site, please contact the Archive-It team for more information or to enable this feature for your account.”

In contrast, the Library of Congress uses a strict opt-in process and “notifies each site that we would like to include in the archive (with the exception of government websites), prior to archiving. In some cases, the e-mail asks permission to archive or to provide off-site access to researchers.” The Library uses the Internet Archive to perform its crawling and ignores robots.txt for those crawls: “The Library of Congress has contracted with the Internet Archive to collect content from websites at regular intervals … the Internet Archive uses the Heritrix crawler to collect websites on behalf of the Library of Congress. Our crawler is instructed to bypass robots.txt in order to obtain the most complete and accurate representation of websites such as yours.” In this case, the Library views the written archival permission as taking precedent over robots.txt directives: “The Library notifies site owners before crawling which means we generally ignore robots.txt exclusions.”

The British Library appears to ignore robots.txt in order to preserve page rendering elements and for selected content deemed culturally important, stating “Do you respect robots.txt? As a rule, yes: we do follow the robots exclusion protocol. However, in certain circumstances we may choose to overrule robots.txt. For instance: if content is necessary to render a page (e.g. Javascript, CSS) or content is deemed of curatorial value and falls within the bounds of the Legal Deposit Libraries Act 2003.”

Similarly, the National Library of France states “In accordance with the Heritage Code (art L132-2-1), the BnF is authorized to disregard the robot exclusion protocol, also called robots.txt. … To accomplish its legal deposit mission, the BnF can choose to collect some of the files covered by robots.txt when they are needed to reconstruct the original form of the website (particularly in the case of image or style sheet files). This non-compliance with robots.txt does not conflict with the protection of private correspondence guaranteed by law, because all data made ​​available on the Internet are considered to be public, whether they are or are not filtered by robots.txt.”

The Internet Archive’s general approach to handling robots.txt exclusions on the open web appears to have evolved over time. The first available snapshot of the Archive’s FAQ, dating to October 4, 2002, states “The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine.” This statement is preserved without modification for the next decade, through at least April 2nd, 2013. A few weeks later on April 20th, 2013, the text had been rewritten to state “You can exclude your site from display in the Wayback Machine by placing a simple robots.txt file on your Web server.” The new language removed the statement “you can exclude your site from being crawled” and replaced it with “you can exclude your site from display.” Indeed, this new language has carried through to present.

From its very first snapshot of October 4, 2002 through sometime the week of November 8th, 2015 the FAQ further stated “Alexa Internet, the company that crawls the web for the Internet Archive, does respect robots.txt instructions, and even does so retroactively. If a web site owner decides he / she prefers not to have a web crawler visiting his / her files and sets up robots.txt on the site, the Alexa crawlers will stop visiting those files and will make unavailable all files previously gathered from that site. This means that sometimes, while using the Internet Archive Wayback Machine, you may find a site that is unavailable due to robots.txt.”

Yet, just a few days later on November 14th, 2015 the FAQ had been revised to state only “Such sites may have been excluded from the Wayback Machine due to a robots.txt file on the site or at a site owner’s direct request. The Internet Archive strives to follow the Oakland Archive Policy for Managing Removal Requests And Preserving Archival Integrity.” The current FAQ points to an archived copy of the Oakland Archive Policy from December 2002 that states “To remove a site from the Wayback Machine, place a robots.txt file at the top level of your site … It will tell the Internet Archive's crawler not to crawl your site in the future” and notes that “ia_archiver” is the proper user agent to exclude the Archive’s crawlers from accessing a site.

The Archive’s evolving stance with respect to robots.txt files appears to explain why attempting to access the Washington Post through the Wayback Machine yields an error that it has been blocked due to robots.txt, yet the site is being crawled and preserved by the Internet Archive every few days over the last four years. Similarly, accessing USA Today or the Bangkok Post through the Wayback Machine yields the error message “This URL has been excluded from the Wayback Machine,” but happily both sites are being preserved through regular snapshots. Here the robots.txt exclusion appears to be used only to govern display in the Wayback Machine’s public interface, with excluded sites continuing to be crawled and preserved in Archive’s dark archive for posterity to ensure they are not lost.

Despite having several programs dedicated to crawling online news, including both International News Crawls and a special “high-value news sites” collection, not all news sites are equally represented in the Archive’s stand-alone archives, whether or not they have robots.txt exclusions. The Washington Post has over 303 snapshots in its archive, while the New York Times has 124 and the Daily Mail has 196. Yet, Der Spiegel has just 34 captures in its stand-alone archive from 2012 to 2014, with none since. Just two of the five national newspapers of Japan have such archives, Asahi Shimbun (just 64 snapshots since 2012), Nihon Keizai Shimbun (just 22 snapshots since 2012), while the other three have no such archives: Mainichi Shimbun, Sankei Shimbun, and Yomiuri Shimbun. In India, of the top three newspapers by circulation as of 2013, The Times of India had just 32 snapshots since 2012, The Hindu does not have its own archive, and the Hindustan Times had 250 snapshots since 2012. Of the top three newspapers, one is not present at all and The Times of India has nearly 8 times fewer snapshots than the Hindustan Times, despite having 2.5 times the circulation in 2013.

Each of these newspapers is likely to be captured through any one of the Archive’s many other crawls and feeds, but the lack of standalone dedicated collections for these papers and the apparent Western bias in the existence of such standalone archives suggests further community input may be required. Indeed, it appears that a number of the Archive’s dedicated site archives are driven by their Alexa Top 1 Million rankings.

Why is it important to understand how web archives work? As I pointed out this past November, there has been very little information published in public forums documenting precisely how our major web archives work and what feeds into them. As the Internet Archive and its peers begin to expand their support of researcher use of their collections, it is critically important that we understand how precisely these archives have been built and the implications of those decisions and their biases for the findings we are ultimately able to derive. Moreover, given how fast the web is disappearing before our eyes, having greater transparency and community input into our web archives will help ensure that they are not overly biased towards the English-speaking Western world and are able to capture the web’s most vulnerable materials.

Greater insight is not an all-or-none proposition of having petabytes of crawler log files or no information at all. It is not necessary to have access to a log of every single action taken by any of the Archive’s crawlers in its history. Yet, it is also the case that simply treating archives as black boxes without the slightest understanding of how they were constructed and basing our findings on those hidden biases is no longer feasible as the scholarly world of data analysis grows up and matures. As web archives transition from being simple “as-is” preservation and retrieval sites towards being our only records of society’s online existence and powering an ever-growing fraction of scholarly research, we need to at least understand how they function at a high level and what data sources they draw from.

Putting this all together, what can we learn from these findings? Perhaps most importantly, we have seen that the Internet Archive operates far more like a traditional library archive than a modern commercial search engine. Rather than a single centralized and standardized continuous crawling farm, the Archive’s holdings are comprised of millions of files in thousands of collections from hundreds of partners, all woven together into a rich collage which the Archive preserves as custodian and curator. The Wayback Machine is seen to be merely a public interface to an unknown fraction of these holdings, with the Archive’s real treasure trove of millions of web materials being scattered across its traditional item collections. From the standpoint of scholarly research use of the Archive, the patchwork composition of its web holdings and vast and incredibly diverse landscape of inputs presents unique challenges that have not been adequately addressed or discussed. At the same time, those fearful that robots.txt exclusions are leading to whole swaths of the web being lost can breathe a bit easier given the Archive’s evolving treatment of them, which appears to be in line with an industry-wide movement towards ignoring exclusions when it comes to archival.

In the end, as the Internet Archive turns 20 this year, its evolution over the last two decades offers a fascinating look back at how the web itself has evolved, from its changing views on robots.txt to its growing transition from custodian to curator to collector. Along the way we get an incredible glimpse at just how hard it really is to try and archive the whole web for perpetuity and the tireless work of the Archive to build one of the Internet’s most unique collections.