Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.
What?s in the data set:
Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069
The seed list for this crawl was a list of Alexa?s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.
However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.
We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available ?warts and all? for people to experiment with. We have also done some further analysis of the content.
If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you?re hoping to do with it. We may not be able to say ?yes? to all requests, since we?re just figuring out whether this is a good idea, but everyone will be considered.
They claim that Cell Press, part of the Anglo-Dutch Reed Elsevier group, is charging excessive subscription fees and profiteering at the expense of the academic community.
Peter Walter and Keith Yamamoto, from the University of California at San Francisco, are urging fellow scientists not to submit articles to Cell or to five sister publications until the publishers change their policy. “We can all think of better ways to spend our time than providing free services to support a publisher that values profit above its academic mission,” Dr Walter and Dr Yamamoto said.
Behind the dispute lies a growing reluctance by universities to pay the large fees charged by publishers for electronic access to their journals, and the growth of a new type of scholarly publishing online, where the costs are paid by the authors and not the readers.
If this new model succeeds, it will make all scientific information available free on the internet. But the conventional publishers behind the huge prestige of journals such as Cell, Nature, Science, Lancet and The New England Journal of Medicine, are hoping that “open access” will come to nothing. Some have moved to weaken its appeal by making their own journal content freely available six months after publication.
The dispute in the United States centres on efforts by the University of California at San Francisco to make Elsevier journals available to all its academics online. In 2002, the university paid Elsevier $8 million (£4.8 million) for this privilege, but the six Cell titles were not included in the package.
This year, Dr Walter and Dr Yamamoto said that Elsevier wanted substantial increases in the fee, plus an additional $90,000 for Cell Press titles. Lynne Herndon, President of Cell Press, said its pricing was fair and reasonable. The library has refused to pay.
A new report by the French bank, BNP Paribas, has estimated that journals in science, technology and medicine generate $8 billion a year. Elsevier accounted for $1.3 billion of that. A Wellcome Trust report said that year-on-year subscription increases have averaged more than 10 per cent.
The Open Access pioneers believe that the internet is the answer. Two groups have established online journals: Biomed Central in Britain, and the Public Library of Science in the US. Biomed Central now has 90 journals, with its top title the Journal of Biology. It remains to be seen whether scientists and doctors are willing to give up the cachet of publication in a top journal.
The company says that once online access to journals is in place, individuals can cancel subscriptions, presenting “the opportunity for savings”.
Contact us | Terms and Conditions | Privacy Policy | Site Map | FAQ | Syndication | Advertising
© Times Newspapers Ltd 2010 Registered in England No. 894646 Registered office: 3 Thomas More Square, London, E98 1XY