The Internet Archive discovers and captures web pages through many different web crawls. At any given time several distinct crawls are running, some for months, and some every day or longer. View the web archive through the Wayback Machine.
Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.
What¢s in the data set:
Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069
The seed list for this crawl was a list of Alexa¢s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.
However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.
We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available ?warts and all? for people to experiment with. We have also done some further analysis of the content.
If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you¢re hoping to do with it. We may not be able to say ?yes? to all requests, since we¢re just figuring out whether this is a good idea, but everyone will be considered.
TIMESTAMPS
The Wayback Machine - https://web.archive.org/web/20110812090033/http://www.boxofficemojo.com/news/?id=2882&p=.htm
The Other Guys saw a lot of action in its opening over the weekend, busting more moves than fellow debut Step Up 3-D, while Inception continued to be the exception with another small decline. Overall weekend business, though, was relatively soft for the time of year and was down nine percent from last year, when G.I. Joe: The Rise of Cobra was unleashed.
Capturing $35.5 million on approximately 4,900 screens at 3,651 locations, The Other Guys opened well above the average for its cop comedy sub-genre, and it nearly doubled the last entry, Cop Out. Backed by a clear, comedic marketing campaign riffing on action conventions, star Will Ferrell batted in his wheelhouse after his Land of the Lost misfire last summer and posted his second highest-grossing start after Talladega Nights: The Ballad of Ricky Bobby. In terms of estimated attendance, though, Other Guys was on par with Anchorman: The Legend of Ron Burgundy, and it was also spot on with Starsky & Hutch. Distributor Sony Pictures' exit polling indicated that the audience was 56 percent male and 55 percent under 25 years old (18 percent was 17 years old and younger).
Inception lost little steam in its fourth weekend, drawing $18.5 million on close to 4,800 screens at 3,418 locations. The dream heist eased 33 percent, which was the best hold of the weekend among major nationwide releases. Its total grew to $227.6 million in 24 days, rising to sixth place among 2010 releases.
With advertising that rested on the brand name and promise of dance moves popping out in 3D, Step Up 3-D slipped up compared to its predecessors, making $15.8 million on around 3,000 screens at 2,435 locations. The first Step Up opened to $20.7 million, while Step Up 2 the Streets bagged $18.9 million, and the disparity was much greater in terms of attendance. Step Up 3-D had an exceptionally high 3D-to-2D screen ratio, and its 3D presentations accounted for 81 percent of its business. Distributor Walt Disney Pictures reported that 64 percent of the audience was female and 67 percent was under 25 years old.
Sliding 44 percent, Salt took another generic hit but fell behind The Bourne Identity's traffic through the same point. It snared $10.9 million, increasing its sum to $91.8 million in 17 days. Dinner for Schmucks showed little traction in its second weekend, down 56 percent to $10.4 million to bring its total to $46.6 million in ten days. Compared to recent late July comedies, Schmucks held better than Funny People but tumbled harder than Step Brothers.
Despicable Me stayed in the game with $9.3 million, off 40 percent for a strong $209.3 million tally in 31 days. On the other hand, Cats & Dogs: The Revenge of Kitty Galore retreated 44 percent to $6.9 million, lifting its total to a mere $26.4 million in ten days. Meanwhile, Charlie St. Cloud sank 62 percent to $4.7 million for a $23.5 million ten-day sum, and The Kids Are All Right performed modestly again with $2.6 million for a $14 million total through its fifth weekend.