Abstract
Quality control is critical to open production communities like Wikipedia. Wikipedia editors enact border quality control with edits (counter-vandalism) and new article creations (new page patrolling) shortly after they are saved. In this paper, we describe a long-standing set of inefficiencies that have plagued new page patrolling by drawing a contrast to the more efficient, distributed processes for counter-vandalism. Further, to address this issue, we demonstrate an effective automated topic model based on a labeling strategy that leverages a folksonomy developed by subject specific working groups in Wikipedia (WikiProject tags) and a flexible ontology (WikiProjects Directory) to arrive at a hierarchical and uniform label set. We are able to attain very high fitness measures (macro ROC-AUC: 95.2%, macro PR-AUC: 74.5%) and real-time performance using word2vec-based features. Finally, we present a proposal for how incorporating this model into current tools will shift the dynamics of new article review positively.
- David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research , Vol. 3, Jan (2003), 993--1022. Google ScholarDigital Library
- Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT '92). ACM, New York, NY, USA, 144--152. Google ScholarDigital Library
- Leo Breiman. 2001. Random Forests. Mach. Learn. , Vol. 45, 1 (Oct. 2001), 5--32. Google ScholarDigital Library
- Jerome H. Friedman. 2000. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics , Vol. 29 (2000), 1189--1232.Google ScholarCross Ref
- R. Stuart Geiger and Aaron Halfaker. 2013. When the Levee Breaks: Without Bots, What Happens to Wikipedia's Quality Control Processes?. In Proceedings of the 9th International Symposium on Open Collaboration (WikiSym '13). ACM, New York, NY, USA, Article 6, bibinfonumpages6 pages. Google ScholarDigital Library
- R. Stuart Geiger and David Ribes. 2010. The work of sustaining order in wikipedia: the banning of a vandal. In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, CSCW 2010, Savannah, Georgia, USA, February 6--10, 2010 , , Kori Inkpen Quinn, Carl Gutwin, and John C. Tang (Eds.). ACM, 117--126. Google ScholarDigital Library
- Aaron Halfaker. 2017. Interpolating Quality Dynamics in Wikipedia and Demonstrating the Keilana Effect. In Proceedings of the 13th International Symposium on Open Collaboration, OpenSym 2017, Galway, Ireland, August 23--25, 2017 , , Lorraine Morgan (Ed.). ACM, 19:1--19:9. Google ScholarDigital Library
- Aaron Halfaker, R. Stuart Geiger, Jonathan Morgan, Amir Sarabadani, and Adam Wight. 2018. Topic modeling for short texts with auxiliary word embeddings. (2018). https://commons.wikimedia.org/wiki/File:ORES_-_Facilitating_re-mediation_of_Wikipedia%27s_socio-technical_problems.pdfGoogle Scholar
- Danny Horn. 2017. New pages partrol: Analysis and proposal. https://en.wikipedia.org/wiki/Wikipedia:New_pages_patrol/Analysis_and_proposalGoogle Scholar
- David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research , Vol. 5, Apr (2004), 361--397. Google ScholarDigital Library
- Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 165--174. Google ScholarDigital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR , Vol. abs/1301.3781 (2013). arxiv: 1301.3781 http://arxiv.org/abs/1301.3781Google Scholar
- Martin Potthast, Benno Stein, and Robert Gerling. 2008. Automatic vandalism detection in Wikipedia. In European conference on information retrieval . Springer, 663--668. Google ScholarDigital Library
- Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 248--256. Google ScholarDigital Library
- Sage Ross. 2016. Visualizing article history with structural completeness . https://wikiedu.org/blog/2016/09/16/visualizing-article-history-with-structural-completeness/Google Scholar
- Amir Sarabadani, Aaron Halfaker, and Dario Taraborelli. 2017a. Building Automated Vandalism Detection Tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, April 3--7, 2017 , , Rick Barrett, Rick Cummings, Eugene Agichtein, and Evgeniy Gabrilovich (Eds.). ACM , 1647--1654. Google ScholarDigital Library
- Amir Sarabadani, Aaron Halfaker, and Dario Taraborelli. 2017b. Building automated vandalism detection tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 1647--1654. Google ScholarDigital Library
- Jodi Schneider, Bluma S. Gelley, and Aaron Halfaker. 2014a. Accept, Decline, Postpone: How Newcomer Productivity is Reduced in English Wikipedia by Pre-publication Review. In Proceedings of The International Symposium on Open Collaboration (OpenSym '14). ACM, New York, NY, USA, Article 26, bibinfonumpages10 pages. Google ScholarDigital Library
- Jodi Schneider, Bluma S. Gelley, and Aaron Halfaker. 2014b. Accept, decline, postpone: How newcomer productivity is reduced in English Wikipedia by pre-publication review. In OpenSym. ACM , 26:1--26:10. Google ScholarDigital Library
- Nathan TeBlunthuis, Aaron Shaw, and Benjamin Mako Hill. 2018. Revisiting The Rise and Decline in a Population of Peer Production Projects. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 355. Google ScholarDigital Library
- Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics . Association for Computational Linguistics, 384--394. Google ScholarDigital Library
- Naonori Ueda and Kazumi Saito. 2003. Parametric mixture models for multi-labeled text. In Advances in neural information processing systems. 737--744. Google ScholarDigital Library
- Morten Warncke Wang. 2018. Autoconfirmed Article Creation Trial . https://meta.wikimedia.org/wiki/Research:Autoconfirmed_article_creation_trialGoogle Scholar
- Morten Warncke-Wang, Dan Cosley, and John Riedl. 2013. Tell me more: an actionable quality model for Wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration, Hong Kong, China, August 05 - 07, 2013 , , Ademar Aguiar and Dirk Riehle (Eds.). ACM, 8:1--8:10. Google ScholarDigital Library
- Torsten Zesch and Iryna Gurevych. 2007. Analysis of the Wikipedia category graph for NLP applications. In Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing. 1--8.Google Scholar
Index Terms
-
With Few Eyes, All Hoaxes are Deep
-
Recommendations
-
Vandals and Hoaxes on the Web
CyberSafety'16: Proceedings of the First International Workshop on Computational Methods for CyberSafetyWeb is a space for all, where everybody can read, publish and share information. This has had tremendous positive impact on the lives of billions of people. Wikipedia, being the largest encyclopedia and free, is a major source of information for many. ...
-
Topic Modeling for Wikipedia Link Disambiguation
Many articles in the online encyclopedia Wikipedia have hyperlinks to ambiguous article titles; these ambiguous links should be replaced with links to unambiguous articles, a process known as disambiguation. We propose a novel statistical topic model ...
-
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and CommunicationIn natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Comments