skip to main content
research-article
Open Access

With Few Eyes, All Hoaxes are Deep

Published:01 November 2018Publication History
Skip Abstract Section

Abstract

Quality control is critical to open production communities like Wikipedia. Wikipedia editors enact border quality control with edits (counter-vandalism) and new article creations (new page patrolling) shortly after they are saved. In this paper, we describe a long-standing set of inefficiencies that have plagued new page patrolling by drawing a contrast to the more efficient, distributed processes for counter-vandalism. Further, to address this issue, we demonstrate an effective automated topic model based on a labeling strategy that leverages a folksonomy developed by subject specific working groups in Wikipedia (WikiProject tags) and a flexible ontology (WikiProjects Directory) to arrive at a hierarchical and uniform label set. We are able to attain very high fitness measures (macro ROC-AUC: 95.2%, macro PR-AUC: 74.5%) and real-time performance using word2vec-based features. Finally, we present a proposal for how incorporating this model into current tools will shift the dynamics of new article review positively.

References

  1. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research , Vol. 3, Jan (2003), 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT '92). ACM, New York, NY, USA, 144--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Leo Breiman. 2001. Random Forests. Mach. Learn. , Vol. 45, 1 (Oct. 2001), 5--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jerome H. Friedman. 2000. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics , Vol. 29 (2000), 1189--1232.Google ScholarGoogle ScholarCross RefCross Ref
  5. R. Stuart Geiger and Aaron Halfaker. 2013. When the Levee Breaks: Without Bots, What Happens to Wikipedia's Quality Control Processes?. In Proceedings of the 9th International Symposium on Open Collaboration (WikiSym '13). ACM, New York, NY, USA, Article 6, bibinfonumpages6 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Stuart Geiger and David Ribes. 2010. The work of sustaining order in wikipedia: the banning of a vandal. In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, CSCW 2010, Savannah, Georgia, USA, February 6--10, 2010 , , Kori Inkpen Quinn, Carl Gutwin, and John C. Tang (Eds.). ACM, 117--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Aaron Halfaker. 2017. Interpolating Quality Dynamics in Wikipedia and Demonstrating the Keilana Effect. In Proceedings of the 13th International Symposium on Open Collaboration, OpenSym 2017, Galway, Ireland, August 23--25, 2017 , , Lorraine Morgan (Ed.). ACM, 19:1--19:9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Aaron Halfaker, R. Stuart Geiger, Jonathan Morgan, Amir Sarabadani, and Adam Wight. 2018. Topic modeling for short texts with auxiliary word embeddings. (2018). https://commons.wikimedia.org/wiki/File:ORES_-_Facilitating_re-mediation_of_Wikipedia%27s_socio-technical_problems.pdfGoogle ScholarGoogle Scholar
  9. Danny Horn. 2017. New pages partrol: Analysis and proposal. https://en.wikipedia.org/wiki/Wikipedia:New_pages_patrol/Analysis_and_proposalGoogle ScholarGoogle Scholar
  10. David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research , Vol. 5, Apr (2004), 361--397. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 165--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR , Vol. abs/1301.3781 (2013). arxiv: 1301.3781 http://arxiv.org/abs/1301.3781Google ScholarGoogle Scholar
  13. Martin Potthast, Benno Stein, and Robert Gerling. 2008. Automatic vandalism detection in Wikipedia. In European conference on information retrieval . Springer, 663--668. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 248--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sage Ross. 2016. Visualizing article history with structural completeness . https://wikiedu.org/blog/2016/09/16/visualizing-article-history-with-structural-completeness/Google ScholarGoogle Scholar
  16. Amir Sarabadani, Aaron Halfaker, and Dario Taraborelli. 2017a. Building Automated Vandalism Detection Tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, April 3--7, 2017 , , Rick Barrett, Rick Cummings, Eugene Agichtein, and Evgeniy Gabrilovich (Eds.). ACM , 1647--1654. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Amir Sarabadani, Aaron Halfaker, and Dario Taraborelli. 2017b. Building automated vandalism detection tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 1647--1654. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jodi Schneider, Bluma S. Gelley, and Aaron Halfaker. 2014a. Accept, Decline, Postpone: How Newcomer Productivity is Reduced in English Wikipedia by Pre-publication Review. In Proceedings of The International Symposium on Open Collaboration (OpenSym '14). ACM, New York, NY, USA, Article 26, bibinfonumpages10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jodi Schneider, Bluma S. Gelley, and Aaron Halfaker. 2014b. Accept, decline, postpone: How newcomer productivity is reduced in English Wikipedia by pre-publication review. In OpenSym. ACM , 26:1--26:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Nathan TeBlunthuis, Aaron Shaw, and Benjamin Mako Hill. 2018. Revisiting The Rise and Decline in a Population of Peer Production Projects. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics . Association for Computational Linguistics, 384--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Naonori Ueda and Kazumi Saito. 2003. Parametric mixture models for multi-labeled text. In Advances in neural information processing systems. 737--744. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Morten Warncke Wang. 2018. Autoconfirmed Article Creation Trial . https://meta.wikimedia.org/wiki/Research:Autoconfirmed_article_creation_trialGoogle ScholarGoogle Scholar
  24. Morten Warncke-Wang, Dan Cosley, and John Riedl. 2013. Tell me more: an actionable quality model for Wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration, Hong Kong, China, August 05 - 07, 2013 , , Ademar Aguiar and Dirk Riehle (Eds.). ACM, 8:1--8:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Torsten Zesch and Iryna Gurevych. 2007. Analysis of the Wikipedia category graph for NLP applications. In Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing. 1--8.Google ScholarGoogle Scholar

Index Terms

  1. With Few Eyes, All Hoaxes are Deep

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader