With Few Eyes, All Hoaxes are Deep

Authors:
Sumit Asthana

Indian Institute of Technology, Patna, Patna, Bihar, India

Indian Institute of Technology, Patna, Patna, Bihar, India
View Profile

,
Aaron Halfaker

Wikimedia Foundation, San Francisco, CA, USA

Wikimedia Foundation, San Francisco, CA, USA
View Profile

Proceedings of the ACM on Human-Computer Interaction Volume 2 Issue CSCWArticle No.: 21pp 1–18https://doi.org/10.1145/3274290

Published:01 November 2018Publication History

Get Citation Alerts

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.
Manage my Alerts

New Citation Alert!

Please log in to your account
Publisher Site

eReader
PDF

Proceedings of the ACM on Human-Computer Interaction

Abstract

Quality control is critical to open production communities like Wikipedia. Wikipedia editors enact border quality control with edits (counter-vandalism) and new article creations (new page patrolling) shortly after they are saved. In this paper, we describe a long-standing set of inefficiencies that have plagued new page patrolling by drawing a contrast to the more efficient, distributed processes for counter-vandalism. Further, to address this issue, we demonstrate an effective automated topic model based on a labeling strategy that leverages a folksonomy developed by subject specific working groups in Wikipedia (WikiProject tags) and a flexible ontology (WikiProjects Directory) to arrive at a hierarchical and uniform label set. We are able to attain very high fitness measures (macro ROC-AUC: 95.2%, macro PR-AUC: 74.5%) and real-time performance using word2vec-based features. Finally, we present a proposal for how incorporating this model into current tools will shift the dynamics of new article review positively.

References

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research , Vol. 3, Jan (2003), 993--1022. Google ScholarDigital Library
Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT '92). ACM, New York, NY, USA, 144--152. Google ScholarDigital Library
Leo Breiman. 2001. Random Forests. Mach. Learn. , Vol. 45, 1 (Oct. 2001), 5--32. Google ScholarDigital Library
Jerome H. Friedman. 2000. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics , Vol. 29 (2000), 1189--1232.Google ScholarCross Ref
R. Stuart Geiger and Aaron Halfaker. 2013. When the Levee Breaks: Without Bots, What Happens to Wikipedia's Quality Control Processes?. In Proceedings of the 9th International Symposium on Open Collaboration (WikiSym '13). ACM, New York, NY, USA, Article 6, bibinfonumpages6 pages. Google ScholarDigital Library
R. Stuart Geiger and David Ribes. 2010. The work of sustaining order in wikipedia: the banning of a vandal. In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, CSCW 2010, Savannah, Georgia, USA, February 6--10, 2010 , , Kori Inkpen Quinn, Carl Gutwin, and John C. Tang (Eds.). ACM, 117--126. Google ScholarDigital Library
Aaron Halfaker. 2017. Interpolating Quality Dynamics in Wikipedia and Demonstrating the Keilana Effect. In Proceedings of the 13th International Symposium on Open Collaboration, OpenSym 2017, Galway, Ireland, August 23--25, 2017 , , Lorraine Morgan (Ed.). ACM, 19:1--19:9. Google ScholarDigital Library
Aaron Halfaker, R. Stuart Geiger, Jonathan Morgan, Amir Sarabadani, and Adam Wight. 2018. Topic modeling for short texts with auxiliary word embeddings. (2018). https://commons.wikimedia.org/wiki/File:ORES_-_Facilitating_re-mediation_of_Wikipedia%27s_socio-technical_problems.pdfGoogle Scholar
Danny Horn. 2017. New pages partrol: Analysis and proposal. https://en.wikipedia.org/wiki/Wikipedia:New_pages_patrol/Analysis_and_proposalGoogle Scholar
David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research , Vol. 5, Apr (2004), 361--397. Google ScholarDigital Library
Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 165--174. Google ScholarDigital Library
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR , Vol. abs/1301.3781 (2013). arxiv: 1301.3781 http://arxiv.org/abs/1301.3781Google Scholar
Martin Potthast, Benno Stein, and Robert Gerling. 2008. Automatic vandalism detection in Wikipedia. In European conference on information retrieval . Springer, 663--668. Google ScholarDigital Library
Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 248--256. Google ScholarDigital Library
Sage Ross. 2016. Visualizing article history with structural completeness . https://wikiedu.org/blog/2016/09/16/visualizing-article-history-with-structural-completeness/Google Scholar
Amir Sarabadani, Aaron Halfaker, and Dario Taraborelli. 2017a. Building Automated Vandalism Detection Tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, April 3--7, 2017 , , Rick Barrett, Rick Cummings, Eugene Agichtein, and Evgeniy Gabrilovich (Eds.). ACM , 1647--1654. Google ScholarDigital Library
Amir Sarabadani, Aaron Halfaker, and Dario Taraborelli. 2017b. Building automated vandalism detection tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 1647--1654. Google ScholarDigital Library
Jodi Schneider, Bluma S. Gelley, and Aaron Halfaker. 2014a. Accept, Decline, Postpone: How Newcomer Productivity is Reduced in English Wikipedia by Pre-publication Review. In Proceedings of The International Symposium on Open Collaboration (OpenSym '14). ACM, New York, NY, USA, Article 26, bibinfonumpages10 pages. Google ScholarDigital Library
Jodi Schneider, Bluma S. Gelley, and Aaron Halfaker. 2014b. Accept, decline, postpone: How newcomer productivity is reduced in English Wikipedia by pre-publication review. In OpenSym. ACM , 26:1--26:10. Google ScholarDigital Library
Nathan TeBlunthuis, Aaron Shaw, and Benjamin Mako Hill. 2018. Revisiting The Rise and Decline in a Population of Peer Production Projects. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 355. Google ScholarDigital Library
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics . Association for Computational Linguistics, 384--394. Google ScholarDigital Library
Naonori Ueda and Kazumi Saito. 2003. Parametric mixture models for multi-labeled text. In Advances in neural information processing systems. 737--744. Google ScholarDigital Library
Morten Warncke Wang. 2018. Autoconfirmed Article Creation Trial . https://meta.wikimedia.org/wiki/Research:Autoconfirmed_article_creation_trialGoogle Scholar
Morten Warncke-Wang, Dan Cosley, and John Riedl. 2013. Tell me more: an actionable quality model for Wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration, Hong Kong, China, August 05 - 07, 2013 , , Ademar Aguiar and Dirk Riehle (Eds.). ACM, 8:1--8:10. Google ScholarDigital Library
Torsten Zesch and Iryna Gurevych. 2007. Analysis of the Wikipedia category graph for NLP applications. In Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing. 1--8.Google Scholar

Index Terms

With Few Eyes, All Hoaxes are Deep
1. Human-centered computing
  1. Collaborative and social computing

Recommendations

Vandals and Hoaxes on the Web

CyberSafety'16: Proceedings of the First International Workshop on Computational Methods for CyberSafety

Web is a space for all, where everybody can read, publish and share information. This has had tremendous positive impact on the lives of billions of people. Wikipedia, being the largest encyclopedia and free, is a major source of information for many. ...

Read More
Topic Modeling for Wikipedia Link Disambiguation

Many articles in the online encyclopedia Wikipedia have hyperlinks to ambiguous article titles; these ambiguous links should be replaced with links to unambiguous articles, a process known as disambiguation. We propose a novel statistical topic model ...

Read More
Two-stage approach to named entity recognition using Wikipedia and DBpedia

IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...

Read More

Comments

comments powered by Disqus.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Human-Computer Interaction Volume 2, Issue CSCW

November 2018

4104 pages

EISSN:2573-0142

DOI:10.1145/3290265

Editors:

Karrie Karahalios
University of Illinois & Adobe
,

Andrés Monroy-Hernández
Snap Inc.
,

Airi Lampinen
Stockholm University
,

Geraldine Fitzpatrick
Vienna University of Technology

Issue’s Table of Contents
Copyright © 2018 Owner/Author

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs International 4.0 License.
Sponsors
In-Cooperation
Publisher

Association for Computing Machinery

New York, NY, United States
Publication History
- Published: 1 November 2018
Published in pacmhci Volume 2, Issue CSCW

Permissions

Request permissions about this article.
Request Permissions

Check for updates
Author Tags
collaborative review

social recommendation

topic modeling

wikipedia
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics

View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 2,019
  Total Downloads
- Downloads (Last 12 months)181
- Downloads (Last 6 weeks)16
Other Metrics

View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

With Few Eyes, All Hoaxes are Deep

Proceedings of the ACM on Human-Computer Interaction

Abstract

References

Cited By

Index Terms

Recommendations

Vandals and Hoaxes on the Web

Topic Modeling for Wikipedia Link Disambiguation

Two-stage approach to named entity recognition using Wikipedia and DBpedia