Abstract
Aggregation queries are at the core of business intelligence and data analytics. In the big data era, many scalable shared-nothing systems have been developed to process aggregation queries over massive amount of data. Microsoft's SCOPE is a well-known instance in this category. Nevertheless, aggregation queries are still expensive, because query processing needs to consume the entire data set, which is often hundreds of terabytes. Data sampling is a technique that samples a small portion of data to process and returns an approximate result with an error bound, thereby reducing the query's execution time. While similar problems were studied in the database literature, we encountered new challenges that disable most of prior efforts: (1) error bounds are dictated by end users and cannot be compromised, (2) data is sparse, meaning data has a limited population but a wide range. For such cases, conventional uniform sampling often yield high sampling rates and thus deliver limited or no performance gains. In this paper, we propose error-bounded stratified sampling to reduce sample size. The technique relies on the insight that we may only reduce the sampling rate with the knowledge of data distributions. The technique has been implemented into Microsoft internal search query platform. Results show that the proposed approach can reduce up to 99% sample size comparing with uniform sampling, and its performance is robust against data volume and other key performance metrics.
- S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD, 2000. Google ScholarDigital Library
- S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. In In Proc. of ACM EuroSys 2013, 2013. Google ScholarDigital Library
- B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD, 2003. Google ScholarDigital Library
- R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. PVLDB, 1(2), 2008. Google ScholarDigital Library
- S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. R. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, 2001. Google ScholarDigital Library
- S. Chaudhuri, G. Das, and V. R. Narasayya. Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst., 32(2), 2007. Google ScholarDigital Library
- T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears. Online aggregation and continuous query support in mapreduce. In SIGMOD, 2010. Google ScholarDigital Library
- V. Ganti, M. L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In VLDB, pages 176--187, 2000. Google ScholarDigital Library
- P. J. Haas. Large-sample and deterministic confidence intervals for online aggregation. In SSDBM, pages 51--63. IEEE Computer Society Press, 1996. Google ScholarDigital Library
- J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. Google ScholarDigital Library
- W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 1963.Google Scholar
- N. Laptev, K. Zeng, and C. Zaniolo. Early accurate results for advanced analytics on mapreduce. PVLDB, 5(10), 2012. Google ScholarDigital Library
- S. L. Lohr. Sampling: design and analysis. Thomson Brooks/Cole, 2010.Google Scholar
- N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11), 2011.Google Scholar
- R. J. Serfling. Probability inequalities for the sum in sampling without replacement. Institute of Mathematical Statistics, 38, 1973.Google Scholar
- P. Rösch and W. Lehner. Sample synopses for approximate answering of group-by queries. In EDBT, 2009.Google ScholarDigital Library
- L. Sidirourgos, M. Kersten, and P. Boncz. Sciborq: Scientific data management with bounds on runtime and quality. In In Proc. of the Intl Conf. on Innovative Data Systems Research (CIDR, pages 296--301, 2011.Google Scholar
- J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, Mar. 1985. Google ScholarDigital Library
- Y. Wang, J. Luo, A. Song, J. Jin, and F. Dong. Improving online aggregation performance for skewed data distribution. In DASFAA (1), volume 7238 of Lecture Notes in Computer Science, pages 18--32. Springer, 2012. Google ScholarDigital Library
- S. Wu, S. Jiang, B. C. Ooi, and K.-L. Tan. Distributed online aggregation. PVLDB, 2(1):443--454, 2009. Google ScholarDigital Library
- S. Wu, B. C. Ooi, and K.-L. Tan. Continuous sampling for online aggregation over multiple queries. In SIGMOD, 2010. Google ScholarDigital Library
- J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. Scope: parallel databases meet mapreduce. VLDB J., 21(5), 2012. Google ScholarDigital Library
Index Terms
-
Error-bounded sampling for analytics on big sparse data
-
Comments