skip to main content
research-article

Error-bounded sampling for analytics on big sparse data

Published:01 August 2014Publication History
Skip Abstract Section

Abstract

Aggregation queries are at the core of business intelligence and data analytics. In the big data era, many scalable shared-nothing systems have been developed to process aggregation queries over massive amount of data. Microsoft's SCOPE is a well-known instance in this category. Nevertheless, aggregation queries are still expensive, because query processing needs to consume the entire data set, which is often hundreds of terabytes. Data sampling is a technique that samples a small portion of data to process and returns an approximate result with an error bound, thereby reducing the query's execution time. While similar problems were studied in the database literature, we encountered new challenges that disable most of prior efforts: (1) error bounds are dictated by end users and cannot be compromised, (2) data is sparse, meaning data has a limited population but a wide range. For such cases, conventional uniform sampling often yield high sampling rates and thus deliver limited or no performance gains. In this paper, we propose error-bounded stratified sampling to reduce sample size. The technique relies on the insight that we may only reduce the sampling rate with the knowledge of data distributions. The technique has been implemented into Microsoft internal search query platform. Results show that the proposed approach can reduce up to 99% sample size comparing with uniform sampling, and its performance is robust against data volume and other key performance metrics.

References

  1. S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. In In Proc. of ACM EuroSys 2013, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. PVLDB, 1(2), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. R. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chaudhuri, G. Das, and V. R. Narasayya. Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst., 32(2), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears. Online aggregation and continuous query support in mapreduce. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. V. Ganti, M. L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In VLDB, pages 176--187, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. J. Haas. Large-sample and deterministic confidence intervals for online aggregation. In SSDBM, pages 51--63. IEEE Computer Society Press, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 1963.Google ScholarGoogle Scholar
  12. N. Laptev, K. Zeng, and C. Zaniolo. Early accurate results for advanced analytics on mapreduce. PVLDB, 5(10), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. L. Lohr. Sampling: design and analysis. Thomson Brooks/Cole, 2010.Google ScholarGoogle Scholar
  14. N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11), 2011.Google ScholarGoogle Scholar
  15. R. J. Serfling. Probability inequalities for the sum in sampling without replacement. Institute of Mathematical Statistics, 38, 1973.Google ScholarGoogle Scholar
  16. P. Rösch and W. Lehner. Sample synopses for approximate answering of group-by queries. In EDBT, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Sidirourgos, M. Kersten, and P. Boncz. Sciborq: Scientific data management with bounds on runtime and quality. In In Proc. of the Intl Conf. on Innovative Data Systems Research (CIDR, pages 296--301, 2011.Google ScholarGoogle Scholar
  18. J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, Mar. 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Wang, J. Luo, A. Song, J. Jin, and F. Dong. Improving online aggregation performance for skewed data distribution. In DASFAA (1), volume 7238 of Lecture Notes in Computer Science, pages 18--32. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Wu, S. Jiang, B. C. Ooi, and K.-L. Tan. Distributed online aggregation. PVLDB, 2(1):443--454, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Wu, B. C. Ooi, and K.-L. Tan. Continuous sampling for online aggregation over multiple queries. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. Scope: parallel databases meet mapreduce. VLDB J., 21(5), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Error-bounded sampling for analytics on big sparse data
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 7, Issue 13
          August 2014
          466 pages
          ISSN:2150-8097
          Issue’s Table of Contents

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 August 2014
          Published in pvldb Volume 7, Issue 13

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader