skip to main content
10.1145/1553374.1553486acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

Large-scale deep unsupervised learning using graphics processors

Published:14 June 2009Publication History

ABSTRACT

The promise of unsupervised learning methods lies in their potential to use vast amounts of unlabeled data to learn complex, highly nonlinear models with millions of free parameters. We consider two well-known unsupervised learning models, deep belief networks (DBNs) and sparse coding, that have recently been applied to a flurry of machine learning applications (Hinton & Salakhutdinov, 2006; Raina et al., 2007). Unfortunately, current learning algorithms for both models are too slow for large-scale applications, forcing researchers to focus on smaller-scale models, or to use fewer training examples.

In this paper, we suggest massively parallel methods to help resolve these problems. We argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods. We develop general principles for massively parallelizing unsupervised learning tasks using graphics processors. We show that these principles can be applied to successfully scaling up learning algorithms for both DBNs and sparse coding. Our implementation of DBN learning is up to 70 times faster than a dual-core CPU implementation for large models. For example, we are able to reduce the time required to learn a four-layer DBN with 100 million free parameters from several weeks to around a single day. For sparse coding, we develop a simple, inherently parallel algorithm, that leads to a 5 to 15-fold speedup over previous methods.

References

  1. Andrew, G., & Gao, J. (2007). Scalable training of L 1-regularized log-linear models. International Conference on Machine Learning (pp. 33--40). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Banko, M., & Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. Annual Meeting of the Association for Computational Linguistics (pp. 26--33). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bengio, Y. (2007). Speeding up stochastic gradient descent. Neural Information Processing Systems Workshop on Efficient Machine Learning.Google ScholarGoogle Scholar
  4. Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy layer-wise training of deep networks. Neural Information Processing Systems (pp. 153--160).Google ScholarGoogle Scholar
  5. Bradley, D., & Bagnell, J. A. (2008). Differentiable sparse coding. Neural Information Processing Systems (pp. 113--120).Google ScholarGoogle Scholar
  6. Brants, T., Popat, A. C., Xu, P., Och, F. J., & Dean, J. (2007). Large language models in machine translation. Conference on Empirical Methods in Natural Language Processing (EMNLP-CoNLL).Google ScholarGoogle Scholar
  7. Catanzaro, B. C., Sundaram, N., & Keutzer, K. (2008). Fast support vector machine training and classification on graphics processors. International Conference on Machine Learning (pp. 104--111). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chellapilla, K., Puri, S., & Simard, P. (2006). High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition.Google ScholarGoogle Scholar
  9. Chu, C. T., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G. R., Ng, A. Y., & Olukotun, K. (2006). Map-reduce for machine learning on multicore. Neural Information Processing Systems (pp. 281--288).Google ScholarGoogle Scholar
  10. Dean, J., & Ghemawat, S. (2004). Mapreduce: Simplified data processing on large clusters. Operating System Design and Implementation (pp. 137--150). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Desjardins, G., & Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for vision. Tech Report.Google ScholarGoogle Scholar
  12. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Ann. Stat., 32, 407.Google ScholarGoogle ScholarCross RefCross Ref
  13. Frank, D. (2002). Power-constrained CMOS scaling limits. IBM Jour. of Res. and Devel., 46, 235--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Friedman, J., Hastie, T., Hfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. App. Stat., 2, 302--332.Google ScholarGoogle ScholarCross RefCross Ref
  15. Gelsinger, P. (2001). Microprocessors for the new millennium: Challenges, opportunities and new frontiers. ISSCC Tech. Digest, 22--25.Google ScholarGoogle Scholar
  16. Goto, K., & Van De Geijn, R. (2008). High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw., 35, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Harris, M. (2008). Many-core GPU computing with NVIDIA CUDA. Int. Conf. Supercomputing (p. 1). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14, 1771--1800. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527--1554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504--507.Google ScholarGoogle ScholarCross RefCross Ref
  21. Kavukcuoglu, K., Ranzato, M., & LeCun, Y. (2008). Fast inference in sparse coding algorithms with applications to object recognition. NYU Tech Report.Google ScholarGoogle Scholar
  22. Lee, H., Battle, A., Raina, R., & Ng, A. Y. (2006). Efficient sparse coding algorithms. Neural Information Processing Systems (pp. 801--808).Google ScholarGoogle Scholar
  23. Lee, H., Chaitanya, E., & Ng, A. Y. (2007). Sparse deep belief net model for visual area V2. Neural Information Processing Systems (pp. 873--880).Google ScholarGoogle Scholar
  24. Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. International Conference on Machine Learning (to appear). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Murray, J. F., & Kreutz-Delgado, K. (2006). Learning sparse overcomplete codes for images. J. VLSI Signal Processing Systems, 45, 97--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ng, A. Y. (2004). Feature selection, L 1 vs. L 2 regularization, and rotational invariance. International Conference on Machine Learning (pp. 78--85). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607--609.Google ScholarGoogle ScholarCross RefCross Ref
  28. Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y. (2007). Self-taught learning: Transfer learning from unlabeled data. International Conference on Machine Learning (pp. 759--766). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ranzato, M. A., & Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. International Conference on Machine Learning (pp. 792--799). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Salakhutdinov, R., & Hinton, G. (2007). Semantic Hashing. SIGIR Workshop on Information Retrieval and Applications of Graphical Models.Google ScholarGoogle Scholar
  31. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B., 58, 267--288.Google ScholarGoogle ScholarCross RefCross Ref
  32. van Hateren, J. H., & van der Schaaff, A. (1997). Independent component filters of natural images compared with simple cells in primary visual cortex. Royal Soc. Lond. B, 265, 359--366.Google ScholarGoogle ScholarCross RefCross Ref
  33. Whaley, R. C., Petitet, A., & Dongarra, J. J. (2001). Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27, 3--35.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Large-scale deep unsupervised learning using graphics processors

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Other conferences
                ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning
                June 2009
                1331 pages
                ISBN:9781605585161
                DOI:10.1145/1553374

                Copyright © 2009 Copyright 2009 by the author(s)/owner(s).

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 14 June 2009

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Acceptance Rates

                Overall Acceptance Rate140of548submissions,26%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader