ABSTRACT
This paper describes how to automatically cross-reference documents with Wikipedia: the largest knowledge base ever known. It explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles. The resulting link detector and disambiguator performs very well, with recall and precision of almost 75%. This performance is constant whether the system is evaluated on Wikipedia articles or "real world" documents.
This work has implications far beyond enriching documents with explanatory links. It can provide structured knowledge about any unstructured fragment of text. Any task that is currently addressed with bags of words - indexing, clustering, retrieval, and summarization to name a few - could use the techniques described here to draw on a vast network of concepts and semantics.
- Auer, S. and Bizer, C. and Kobilarov, G. and Lehmann, J. and Cyganiak, R. and Ives, Z. (2007) DBpedia: A Nucleus for a Web of Open Data. In Proceedings of the 6th International Semantic Web Conference, Busan, Korea. Google ScholarDigital Library
- Banerjee, S. and Ramanathan, K. and Gupta, A. (2007) Clustering short texts using Wikipedia. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, pp. 787--788. Google ScholarDigital Library
- Barr, J. and Cabrera, L. F. (2006) AI gets a brain. In ACM Queue 4(4), pp. 24--29. Google ScholarDigital Library
- David, C., L. Giroux, S. Bertrand-Gastaldy, and D. Lanteigne (1995) Indexing as problem solving: A cognitive approach to consistency. In Proceedings of the ASIS Annual Meeting, Medford, NJ, pp. 49--55.Google Scholar
- Dolan, S. (2008) Six Degrees of Wikipedia. Retrieved June 2008 from www.netsoc.tcd.ie/~mu/wiki/Google Scholar
- Drenner, S., Harper, M., Frankowski, D., Riedl, J. and Terveen, L. (2006) Insert movie reference here: a system to bridge conversation and item-oriented web sites. In Proceedings of the SIGCHI conference on Human Factors in computing systems, New York, NY, pp. 951--954 Google ScholarDigital Library
- Gabrilovich, E. and Markovitch, S. (2007) Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the Twenty-First National Conference on Artificial Intelligence, Boston, MA.Google Scholar
- Howe, J. (2006) The Rise of Crowdsourcing. In Wired Magazine 14(6).Google Scholar
- Lih, A. (2004) Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news resource. In Proceedings of the 5th International Symposium on Online Journalism, Austin, Texas.Google Scholar
- Maron, M. E. (1977) On indexing, retrieval and the meaning of about. In Journal of the American Society for Information Science 28(1), pp. 38--43Google ScholarCross Ref
- Medelyan, O., Witten, I. H. and Milne, D. (2008) Topic Indexing with Wikipedia. In Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), Chicago, IL.Google Scholar
- Mihalcea, R. and Csomai, A. (2007) Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge management (CIKM'07), Lisbon, Portugal, pp. 233--242 Google ScholarDigital Library
- Milne, D., Witten, I. H. and Nichols, D. M. (2007). A Knowledge-Based Search Engine Powered by Wikipedia. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM'2007), Lisbon, Portugal. Google ScholarDigital Library
- Milne, D., and Witten, I. H. (2008) An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), Chicago, IL.Google Scholar
- Mossberg, W. (2001) New Windows XP Feature Can Re-Edit Others' Sites. The Wall Street Journal, June 2001Google Scholar
- Ponzetto, S. P. and Strube, M. (2007) Deriving a Large Scale Taxonomy from Wikipedia. In Proceedings of the 22st National Conference on Artificial Intelligence (AAAI'07), Vancouver, British Columbia, pp. 1440--1445. Google ScholarDigital Library
- Quinlan, J. R. (1993) C4. 5: Programs for Machine Learning. Morgan Kaufmann Google ScholarDigital Library
- Suchanek, F. M. and Kasneci, G. and Weikum, G. (2007) Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web (WWW'07), Alberta, Canada, pp. 697--706. Google ScholarDigital Library
- Völkel, M. and Krötzsch, M. and Vrandecic, D. and Haller, H. and Studer, R. (2006) Semantic Wikipedia. In Proceedings of the 15th international conference on World Wide Web (WWW'06), Edinburgh, Scotland, pp. 585--594 Google ScholarDigital Library
Index Terms
-
Learning to link with wikipedia
-
Recommendations
-
Wikify!: linking documents to encyclopedic knowledge
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementThis paper introduces the use of Wikipedia as a resource for automatic keyword extraction and word sense disambiguation, and shows how this online encyclopedia can be used to achieve state-of-the-art results on both these tasks. The paper also shows how ...
-
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
-
TAGME: on-the-fly annotation of short text fragments (by wikipedia entities)
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementWe designed and implemented TAGME, a system that is able to efficiently and judiciously augment a plain-text with pertinent hyperlinks to Wikipedia pages. The specialty of TAGME with respect to known systems [5,8] is that it may annotate texts which are ...
Comments