Internet Archive Wayback Machine

The Wayback Machine - https://web.archive.org/web/20110415130934/http://web.archive.org/collections/web/faqs.html

Enter Web Address: Adv. Search

FAQs
For the curious surfer, we've gathered the following commonly asked questions. For the supremely curious, we recommend contacting us directly at wayback@archive.org.

Questions

The Internet Archive Wayback Machine

Answers

What is the Internet Archive Wayback Machine?
The Internet Archive Wayback Machine is a service that allows people to visit archived versions of stored websites. Visitors to the Internet Archive Wayback Machine can type in an URL, select a date, and then begin surfing on an archived version of the web. Imagine surfing circa 1999 and looking at all the Y2K hype, or revisiting an older copy of your favorite website. The Internet Archive Wayback Machine can make all of this possible. See the Press Release.
Can I link to old pages on the Internet Archive Wayback Machine?
Yes! Alexa Internet has built the Internet Archive Wayback Machine so that it can be used and referenced by anybody and everybody. If you find an archived page that you would like to reference on your web page or in an article, you can copy the URL and share it with others. You can even use fuzzy URL matching and date specifications... but that's a bit more advanced.
I don't want my site's pages in the archive. How do I remove them?
By installing a robots.txt file on your web server, you can exclude your site from being archived, as well as block access to them on the archive. For information, see our FAQ about removing documents.
Are other sites available in the Internet Archive Wayback Machine?
The Internet Archive is attempting to archive the entire publicly available web. Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl. It's also possible that some sites were not archived because they were password protected or otherwise inaccessible to our automated systems.
What does it mean when a site's archive date has been "updated"?
When our automated systems crawl the web every few months or so, we find that only about 50% of all pages on the web have changed from our previous visit. This means that much of the content in our archive is duplicate material. If you don't see "*" next to an archived document, then the content on the archived page is identical to the previously archived copy.
Who was involved in creating the Internet Archive Wayback Machine?
The original idea for the Internet Archive Wayback Machine began in 1996, when the Internet Archive first began archiving the web. Now, five years later, with over 100 terabytes and a dozen web crawls completed, the Internet Archive has made the Internet Archive Wayback Machine available to the public. The Internet Archive has relied on donations of web crawls, technology and expertise from Alexa Internet and others. The Internet Archive Wayback Machine is owned and operated by the Internet Archive.
How was the Wayback Machine made?
Over 100 terabytes of data are stored on several dozen modified servers. Alexa Internet, in cooperation with the Internet Archive, has designed a three dimensional index that allows browsing of web documents over multiple time periods, and turned this unique feature into the Wayback Machine.
How large is the Archive?
The Internet Archive Wayback Machine contains over 100 terabytes of data and is currently growing at a rate of 12 terabytes per month. The archive contains multiple copies of the entire publicly available web. This eclipses the amount of data contained in the world's largest libraries, including the Library of Congress. If you tried to place the entire contents of the archive onto floppy disks (I don't recommend this!) and laid them end to end, it would stretch from New York, past Los Angeles, and halfway to Hawaii.
Can I search the Archive?
Using the Internet Archive Wayback Machine, it is possible to search for the names of sites contained in the Archive and to specify date ranges for your search. However, we do not yet have an indexed text search of the documents in the collection. The collection is a bit too large and complicated for that. We continue to work on it and should have a full text search soon.
What type of machinery is used in the Internet Archive?
The Internet Archive is stored on dozens of slightly modified Hewlett Packard servers. The computers run on the FreeBSD operating system. Each computer has 512Mb of memory and can hold just over 300 gigabytes of data on IDE disks.
How do you archive dynamic pages?
There are many different kinds of dynamic pages, some of which are easily stored in an archive and some of which fall apart completely. When a dynamic page renders standard html, the archive works beautifully. When a dynamic page contains forms, JavaScript, or other elements that require interaction with the originating host, the archive will not accurately reflect the original site's functionality.
Why are some sites harder to archive than others?
If you look at our collection of archived sites, you will find some broken pages, missing graphics, and some sites that aren't archived at all. We have tried to create a complete archive, but have had difficulties with some sites. Here are some things that make it difficult to archive a web site:
- Robots.txt -- If our robot crawler is forbidden from visiting a site, we can't archive it.
- Javascript -- Javascript elements are often hard for us to archive, but especially if it generates links without having the full name in the page. Plus, if javascript needs to contact with the originiating server in order to work, it will fail when archived.
- Server side image maps -- Like any functionality on the web, if it needs to contact the originating server in order to work, it will fail when archived.
- Unknown sites -- If Alexa doesn't know about your site, it won't be archived. Use the Alexa service, and we will know about your page. Or you can visit our Archive Your Site page.
- Orphan pages -- If there are no links to your pages, our robot won't find it (our robots don't enter queries in search boxes.)
As a general rule of thumb, simple html is the easiest to archive.
Some sites are not available because of Robots.txt or other exclusions.
What does that mean?
The Standard for Robot Exclusion (SRE) is a means by which web site owners can instruct automated systems not to crawl their sites. Web site owners can specify files or directories that are allowed or disallowed from a crawl, and they can even create specific rules for different automated crawlers. All of this information is contained in a file called robots.txt. While robots.txt has been adopted as the universal standard for robot exclusion, compliance with robots.txt is strictly voluntary. In fact most web sites do not have a robots.txt file, and many web crawlers are not programmed to obey the instructions anyway. However, Alexa, the company that crawls the web for the Internet Archive, does respect robots.txt instructions, and even does so retroactively. If a web site owner ever decides he / she prefers not to have a web crawler visiting his / her files and sets up robots.txt on the site, the Alexa crawlers will stop visiting those files and mark all files previously gathered as unavailable. This means that sometimes, while using the Internet Archive Wayback Machine, you may find a site that is unavailable due to robots.txt or other exclusions. Other exclusions? Yes, sometimes a web site owner will contact us directly and ask us to stop crawling or archiving a site. We comply with these requests.
How can I get my site included in the Archive?
Alexa Internet has been crawling the web since 1996, which has resulted in a massive archive. If you have a web site, and you would like to ensure that it is saved for posterity in the Archive, chances are that it's already there. We make every effort to crawl the entire publicly available web. However, if you wish to take extra measures to ensure that we archive your site, you can visit the "Archive Your Site" page.
How can I help?
The Internet Archive actively seeks donations of digital materials for preservation. Alexa Internet provides access to a web-wide crawl that contains copies of the publicly accessible web. If you have digital materials that may be of interest to future generations, let us know. The Internet Archive is also seeking additional funding to continue this important mission. Please contact us if you wish to make a contribution.

Home | Help

Internet Archive | Terms of Use | Privacy Policy

Oct	APR	May
	15
2002	2011	2012