Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.
What?s in the data set:
Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069
The seed list for this crawl was a list of Alexa?s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.
However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.
We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available ?warts and all? for people to experiment with. We have also done some further analysis of the content.
If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you?re hoping to do with it. We may not be able to say ?yes? to all requests, since we?re just figuring out whether this is a good idea, but everyone will be considered.
A new web tool was launched this week. The WikiScanner allows users to track changes made to the phenomenally popular online encyclopaedia, Wikipedia. By comparing those changes with blocks of IP addresses, the editors of Wikipedia entries may be identified according to their location and the organisation from which they post.
The removal of unflattering references to particular corporations has been traced back to computers at the relevant companies. Someone at Labour’s headquarters altered a section about the Labour Students organisation to remove a reference to career politicians.
The development of technology that exposes such shenanigans could be taken as evidence of the self-correcting nature of cyberspace. It ought to be seen instead as a lesson in how easily information can be manipulated in a culture that prizes “user-generated content”.
Wikipedia relies on the wisdom of crowds. Knowledge is fluid. A definition contained in a reference work can never be regarded as complete and definitive. More reliable information emerges through continual revision. Consequently, anyone can edit an entry in Wikipedia. Many articles are plainly useless, but owing to the democratic nature of the medium the way is always open to incremental improvement.
Some may find this a seductive vision of the spread of knowledge. I find it alarming. It combines the free-market dogmatism of the libertarian Right with the anti-intellectualism of the populist Left. There is no necessary reason that Wikipedia’s continual revisions enhance knowledge. It is quite as conceivable that an early version of an entry in Wikipedia will be written by someone who knows the subject, and later editors will dissipate whatever value is there. Wikipedia seeks not truth but consensus, and like an interminable political meeting the end result will be dominated by the loudest and most persistent voices.
This is an inherent flaw. The problem is not that there are too few voices in the editorial process, who can skew the result, but the opposite. Participation is prized more than competence. When a prominent Wikipedian who claimed to be a tenured professor of divinity was revealed instead to be a young college dropout, the site’s founder Jimmy Wales responded that he was unconcerned. The notion that a false claim to knowledge is wrong is not part of Wikipedia’s culture.
The WikiScanner is thus an important development in bringing down a pernicious influence on our intellectual life. Critics of the web decry the medium as the cult of the amateur. Wikipedia is worse than that; it is the province of the covert lobby. The most constructive course is to stand on the sidelines and jeer at its pretensions.
Contact us | Terms and Conditions | Privacy Policy | Site Map | FAQ | Syndication | Advertising
© Times Newspapers Ltd 2010 Registered in England No. 894646 Registered office: 3 Thomas More Square, London, E98 1XY