research-article

Error-bounded sampling for analytics on big sparse data

Authors:
Ying Yan

Microsoft Research

Microsoft Research
View Profile

,
Liang Jeff Chen

Microsoft Research

Microsoft Research
View Profile

,
Zheng Zhang

Microsoft Research

Microsoft Research
View Profile

Authors Info & Claims

Proceedings of the VLDB Endowment Volume 7 Issue 13pp 1508–1519https://doi.org/10.14778/2733004.2733022

Published:01 August 2014Publication History

Get Citation Alerts

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.
Manage my Alerts

New Citation Alert!

Please log in to your account
Publisher Site

Get Access

Proceedings of the VLDB Endowment

Abstract

Aggregation queries are at the core of business intelligence and data analytics. In the big data era, many scalable shared-nothing systems have been developed to process aggregation queries over massive amount of data. Microsoft's SCOPE is a well-known instance in this category. Nevertheless, aggregation queries are still expensive, because query processing needs to consume the entire data set, which is often hundreds of terabytes. Data sampling is a technique that samples a small portion of data to process and returns an approximate result with an error bound, thereby reducing the query's execution time. While similar problems were studied in the database literature, we encountered new challenges that disable most of prior efforts: (1) error bounds are dictated by end users and cannot be compromised, (2) data is sparse, meaning data has a limited population but a wide range. For such cases, conventional uniform sampling often yield high sampling rates and thus deliver limited or no performance gains. In this paper, we propose error-bounded stratified sampling to reduce sample size. The technique relies on the insight that we may only reduce the sampling rate with the knowledge of data distributions. The technique has been implemented into Microsoft internal search query platform. Results show that the proposed approach can reduce up to 99% sample size comparing with uniform sampling, and its performance is robust against data volume and other key performance metrics.

References

S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD, 2000. Google ScholarDigital Library
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. In In Proc. of ACM EuroSys 2013, 2013. Google ScholarDigital Library
B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD, 2003. Google ScholarDigital Library
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. PVLDB, 1(2), 2008. Google ScholarDigital Library
S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. R. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, 2001. Google ScholarDigital Library
S. Chaudhuri, G. Das, and V. R. Narasayya. Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst., 32(2), 2007. Google ScholarDigital Library
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears. Online aggregation and continuous query support in mapreduce. In SIGMOD, 2010. Google ScholarDigital Library
V. Ganti, M. L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In VLDB, pages 176--187, 2000. Google ScholarDigital Library
P. J. Haas. Large-sample and deterministic confidence intervals for online aggregation. In SSDBM, pages 51--63. IEEE Computer Society Press, 1996. Google ScholarDigital Library
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. Google ScholarDigital Library
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 1963.Google Scholar
N. Laptev, K. Zeng, and C. Zaniolo. Early accurate results for advanced analytics on mapreduce. PVLDB, 5(10), 2012. Google ScholarDigital Library
S. L. Lohr. Sampling: design and analysis. Thomson Brooks/Cole, 2010.Google Scholar
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11), 2011.Google Scholar
R. J. Serfling. Probability inequalities for the sum in sampling without replacement. Institute of Mathematical Statistics, 38, 1973.Google Scholar
P. Rösch and W. Lehner. Sample synopses for approximate answering of group-by queries. In EDBT, 2009.Google ScholarDigital Library
L. Sidirourgos, M. Kersten, and P. Boncz. Sciborq: Scientific data management with bounds on runtime and quality. In In Proc. of the Intl Conf. on Innovative Data Systems Research (CIDR, pages 296--301, 2011.Google Scholar
J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, Mar. 1985. Google ScholarDigital Library
Y. Wang, J. Luo, A. Song, J. Jin, and F. Dong. Improving online aggregation performance for skewed data distribution. In DASFAA (1), volume 7238 of Lecture Notes in Computer Science, pages 18--32. Springer, 2012. Google ScholarDigital Library
S. Wu, S. Jiang, B. C. Ooi, and K.-L. Tan. Distributed online aggregation. PVLDB, 2(1):443--454, 2009. Google ScholarDigital Library
S. Wu, B. C. Ooi, and K.-L. Tan. Continuous sampling for online aggregation over multiple queries. In SIGMOD, 2010. Google ScholarDigital Library
J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. Scope: parallel databases meet mapreduce. VLDB J., 21(5), 2012. Google ScholarDigital Library

Index Terms

Error-bounded sampling for analytics on big sparse data
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
  2. Information systems applications
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

comments powered by Disqus.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 7, Issue 13

August 2014

466 pages

ISSN:2150-8097

Editors:

H. V. Jagadish
University of Michigan
,

Aoying Zhou
East Normal University, China

Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher

VLDB Endowment
Publication History
- Published: 1 August 2014
Published in pvldb Volume 7, Issue 13
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics

View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 254
  Total Downloads
- Downloads (Last 12 months)46
- Downloads (Last 6 weeks)20
Other Metrics

View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Error-bounded sampling for analytics on big sparse data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics

Big Data Analytics with R and Hadoop

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Error-bounded sampling for analytics on big sparse data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics

Big Data Analytics with R and Hadoop

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media