ABSTRACT
Wikidata, like Wikipedia, is a knowledge base that anyone can edit. This open collaboration model is powerful in that it reduces barriers to participation and allows a large number of people to contribute. However, it exposes the knowledge base to the risk of vandalism and low-quality contributions. In this work, we build on past work detecting vandalism in Wikipedia to detect vandalism in Wikidata. This work is novel in that identifying damaging changes in a structured knowledge-base requires substantially different feature engineering work than in a text-based wiki like Wikipedia. We also discuss the utility of these classifiers for reducing the overall workload of vandalism patrollers in Wikidata. We describe a machine classification strategy that is able to catch 89% of vandalism while reducing patrollers' workload by 98%, by drawing lightly from contextual features of an edit and heavily from the characteristics of the user making the edit.
- B. Adler, L. de Alfaro, and I. Pye. Detecting wikipedia vandalism using wikitrust. Notebook papers of CLEF, 1:22--23, 2010.Google Scholar
- B. T. Adler, L. De Alfaro, S. M. Mola-Velasco, P. Rosso, and A. G. West. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In Computational linguistics and intelligent text processing, pages 277--288. Springer, 2011. Google ScholarDigital Library
- O. Arazy and O. Nov. Determinants of wikipedia quality: the roles of global and local contribution inequality. In Proceedings of the 2010 ACM conference on Computer supported cooperative work, pages 233--236. ACM, 2010. Google ScholarDigital Library
- L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarDigital Library
- R. S. Geiger and A. Halfaker. When the levee breaks: without bots, what happens to wikipedia's quality control processes? In Proceedings of the 9th International Symposium on Open Collaboration, page 6. ACM, 2013. Google ScholarDigital Library
- R. S. Geiger and D. Ribes. The work of sustaining order in wikipedia: the banning of a vandal. In Proceedings of the 2010 ACM conference on Computer supported cooperative work, pages 117--126. ACM, 2010. Google ScholarDigital Library
- J. Giles. Internet encyclopaedias go head to head. Nature, 438(7070):900--901, 2005. Google ScholarCross Ref
- M. Harpalani, M. Hart, S. Singh, R. Johnson, and Y. Choi. Language of vandalism: Improving wikipedia vandalism detection via stylometric analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 83--88. Association for Computational Linguistics, 2011. Google ScholarDigital Library
- S. Heindorf, M. Potthast, B. Stein, and G. Engels. Towards vandalism detection in knowledge bases: Corpus construction and analysis. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 831--834. ACM, 2015. Google ScholarDigital Library
- A. Kittur and R. E. Kraut. Harnessing the wisdom of crowds in wikipedia: quality through coordination. In Proceedings of the 2008 ACM conference on Computer supported cooperative work, pages 37--46. ACM, 2008. Google ScholarDigital Library
- A. Kolbe. Whither wikidata? https://en.wikipedia.org/w/index.php?oldid=694206756, 2015. {Online; accessed 10-February-2016}.Google Scholar
- P. Neis, M. Goetz, and A. Zipf. Towards automatic vandalism detection in openstreetmap. ISPRS International Journal of Geo-Information, 1(3):315--332, 2012. Google ScholarCross Ref
- M. Potthast. Crowdsourcing a wikipedia vandalism corpus. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 789--790. ACM, 2010. Google ScholarDigital Library
- J. Schneider, B. S. Gelley, and A. Halfaker. Accept, decline, postpone: How newcomer productivity is reduced in english wikipedia by pre-publication review. In Proceedings of the international symposium on open collaboration, page 26. ACM, 2014. Google ScholarDigital Library
- K. Smets, B. Goethals, and B. Verdonk. Automatic vandalism detection in wikipedia: Towards a machine learning approach. In AAAI workshop on Wikipedia and artificial intelligence: An Evolving Synergy, pages 43--48, 2008.Google Scholar
- B. Stvilia, M. B. Twidale, L. C. Smith, and L. Gasser. Assessing information quality of a community-based encyclopedia. In IQ, 2005.Google Scholar
- B. Stvilia, M. B. Twidale, L. C. Smith, and L. Gasser. Information quality work organization in wikipedia. Journal of the American society for information science and technology, 59(6):983--1001, 2008. Google ScholarDigital Library
- C. H. Tan, E. Agichtein, P. Ipeirotis, and E. Gabrilovich. Trust, but verify: Predicting contribution quality for knowledge base construction and curation. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 553--562. ACM, 2014. Google ScholarDigital Library
- W. Y. Wang and K. R. McKeown. Got you!: automatic vandalism detection in wikipedia with web-based shallow syntactic-semantic modeling. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1146--1154. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- M. Warncke-Wang, D. Cosley, and J. Riedl. Tell me more: An actionable quality model for wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration, page 8. ACM, 2013. Google ScholarDigital Library
- A. G. West, S. Kannan, and I. Lee. Detecting wikipedia vandalism via spatio-temporal analysis of revision metadata? In Proceedings of the Third European Workshop on System Security, pages 22--28. ACM, 2010. Google ScholarDigital Library
Index Terms
-
Building Automated Vandalism Detection Tools for Wikidata
-
Recommendations
-
Vandalism Detection in Wikidata
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementWikidata is the new, large-scale knowledge base of the Wikimedia Foundation. Its knowledge is increasingly used within Wikipedia itself and various other kinds of information systems, imposing high demands on its integrity. Wikidata can be edited by ...
-
Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalWe report on the construction of the Wikidata Vandalism Corpus WDVC-2015, the first corpus for vandalism in knowledge bases. Our corpus is based on the entire revision history of Wikidata, the knowledge base underlying Wikipedia. Among Wikidata's 24 ...
-
Wikidata based Location Entity Linking
ICSCA '20: Proceedings of the 2020 9th International Conference on Software and Computer ApplicationsOnline news reading has become general among people and suggesting relevant news articles to readers is a non-trivial task. News recommender systems (NRS) are built to provide appropriate stories to readers based on their interest. News articles usually ...
Comments