skip to main content
10.1145/3041021.3053366acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Building Automated Vandalism Detection Tools for Wikidata

Published:03 April 2017Publication History

ABSTRACT

Wikidata, like Wikipedia, is a knowledge base that anyone can edit. This open collaboration model is powerful in that it reduces barriers to participation and allows a large number of people to contribute. However, it exposes the knowledge base to the risk of vandalism and low-quality contributions. In this work, we build on past work detecting vandalism in Wikipedia to detect vandalism in Wikidata. This work is novel in that identifying damaging changes in a structured knowledge-base requires substantially different feature engineering work than in a text-based wiki like Wikipedia. We also discuss the utility of these classifiers for reducing the overall workload of vandalism patrollers in Wikidata. We describe a machine classification strategy that is able to catch 89% of vandalism while reducing patrollers' workload by 98%, by drawing lightly from contextual features of an edit and heavily from the characteristics of the user making the edit.

References

  1. B. Adler, L. de Alfaro, and I. Pye. Detecting wikipedia vandalism using wikitrust. Notebook papers of CLEF, 1:22--23, 2010.Google ScholarGoogle Scholar
  2. B. T. Adler, L. De Alfaro, S. M. Mola-Velasco, P. Rosso, and A. G. West. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In Computational linguistics and intelligent text processing, pages 277--288. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. O. Arazy and O. Nov. Determinants of wikipedia quality: the roles of global and local contribution inequality. In Proceedings of the 2010 ACM conference on Computer supported cooperative work, pages 233--236. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. S. Geiger and A. Halfaker. When the levee breaks: without bots, what happens to wikipedia's quality control processes? In Proceedings of the 9th International Symposium on Open Collaboration, page 6. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. S. Geiger and D. Ribes. The work of sustaining order in wikipedia: the banning of a vandal. In Proceedings of the 2010 ACM conference on Computer supported cooperative work, pages 117--126. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Giles. Internet encyclopaedias go head to head. Nature, 438(7070):900--901, 2005. Google ScholarGoogle ScholarCross RefCross Ref
  8. M. Harpalani, M. Hart, S. Singh, R. Johnson, and Y. Choi. Language of vandalism: Improving wikipedia vandalism detection via stylometric analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 83--88. Association for Computational Linguistics, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Heindorf, M. Potthast, B. Stein, and G. Engels. Towards vandalism detection in knowledge bases: Corpus construction and analysis. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 831--834. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Kittur and R. E. Kraut. Harnessing the wisdom of crowds in wikipedia: quality through coordination. In Proceedings of the 2008 ACM conference on Computer supported cooperative work, pages 37--46. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Kolbe. Whither wikidata? https://en.wikipedia.org/w/index.php?oldid=694206756, 2015. {Online; accessed 10-February-2016}.Google ScholarGoogle Scholar
  12. P. Neis, M. Goetz, and A. Zipf. Towards automatic vandalism detection in openstreetmap. ISPRS International Journal of Geo-Information, 1(3):315--332, 2012. Google ScholarGoogle ScholarCross RefCross Ref
  13. M. Potthast. Crowdsourcing a wikipedia vandalism corpus. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 789--790. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Schneider, B. S. Gelley, and A. Halfaker. Accept, decline, postpone: How newcomer productivity is reduced in english wikipedia by pre-publication review. In Proceedings of the international symposium on open collaboration, page 26. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Smets, B. Goethals, and B. Verdonk. Automatic vandalism detection in wikipedia: Towards a machine learning approach. In AAAI workshop on Wikipedia and artificial intelligence: An Evolving Synergy, pages 43--48, 2008.Google ScholarGoogle Scholar
  16. B. Stvilia, M. B. Twidale, L. C. Smith, and L. Gasser. Assessing information quality of a community-based encyclopedia. In IQ, 2005.Google ScholarGoogle Scholar
  17. B. Stvilia, M. B. Twidale, L. C. Smith, and L. Gasser. Information quality work organization in wikipedia. Journal of the American society for information science and technology, 59(6):983--1001, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. H. Tan, E. Agichtein, P. Ipeirotis, and E. Gabrilovich. Trust, but verify: Predicting contribution quality for knowledge base construction and curation. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 553--562. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Y. Wang and K. R. McKeown. Got you!: automatic vandalism detection in wikipedia with web-based shallow syntactic-semantic modeling. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1146--1154. Association for Computational Linguistics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Warncke-Wang, D. Cosley, and J. Riedl. Tell me more: An actionable quality model for wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration, page 8. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. G. West, S. Kannan, and I. Lee. Detecting wikipedia vandalism via spatio-temporal analysis of revision metadata? In Proceedings of the Third European Workshop on System Security, pages 22--28. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Building Automated Vandalism Detection Tools for Wikidata

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Other conferences
        WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion
        April 2017
        1738 pages
        ISBN:9781450349147

        Publisher

        International World Wide Web Conferences Steering Committee

        Republic and Canton of Geneva, Switzerland

        Publication History

        • Published: 3 April 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        WWW '17 Companion Paper Acceptance Rate164of966submissions,17%Overall Acceptance Rate1,899of8,196submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader