ABSTRACT
The promise of unsupervised learning methods lies in their potential to use vast amounts of unlabeled data to learn complex, highly nonlinear models with millions of free parameters. We consider two well-known unsupervised learning models, deep belief networks (DBNs) and sparse coding, that have recently been applied to a flurry of machine learning applications (Hinton & Salakhutdinov, 2006; Raina et al., 2007). Unfortunately, current learning algorithms for both models are too slow for large-scale applications, forcing researchers to focus on smaller-scale models, or to use fewer training examples.
In this paper, we suggest massively parallel methods to help resolve these problems. We argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods. We develop general principles for massively parallelizing unsupervised learning tasks using graphics processors. We show that these principles can be applied to successfully scaling up learning algorithms for both DBNs and sparse coding. Our implementation of DBN learning is up to 70 times faster than a dual-core CPU implementation for large models. For example, we are able to reduce the time required to learn a four-layer DBN with 100 million free parameters from several weeks to around a single day. For sparse coding, we develop a simple, inherently parallel algorithm, that leads to a 5 to 15-fold speedup over previous methods.
- Andrew, G., & Gao, J. (2007). Scalable training of L 1-regularized log-linear models. International Conference on Machine Learning (pp. 33--40). Google ScholarDigital Library
- Banko, M., & Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. Annual Meeting of the Association for Computational Linguistics (pp. 26--33). Google ScholarDigital Library
- Bengio, Y. (2007). Speeding up stochastic gradient descent. Neural Information Processing Systems Workshop on Efficient Machine Learning.Google Scholar
- Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy layer-wise training of deep networks. Neural Information Processing Systems (pp. 153--160).Google Scholar
- Bradley, D., & Bagnell, J. A. (2008). Differentiable sparse coding. Neural Information Processing Systems (pp. 113--120).Google Scholar
- Brants, T., Popat, A. C., Xu, P., Och, F. J., & Dean, J. (2007). Large language models in machine translation. Conference on Empirical Methods in Natural Language Processing (EMNLP-CoNLL).Google Scholar
- Catanzaro, B. C., Sundaram, N., & Keutzer, K. (2008). Fast support vector machine training and classification on graphics processors. International Conference on Machine Learning (pp. 104--111). Google ScholarDigital Library
- Chellapilla, K., Puri, S., & Simard, P. (2006). High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition.Google Scholar
- Chu, C. T., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G. R., Ng, A. Y., & Olukotun, K. (2006). Map-reduce for machine learning on multicore. Neural Information Processing Systems (pp. 281--288).Google Scholar
- Dean, J., & Ghemawat, S. (2004). Mapreduce: Simplified data processing on large clusters. Operating System Design and Implementation (pp. 137--150). Google ScholarDigital Library
- Desjardins, G., & Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for vision. Tech Report.Google Scholar
- Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Ann. Stat., 32, 407.Google ScholarCross Ref
- Frank, D. (2002). Power-constrained CMOS scaling limits. IBM Jour. of Res. and Devel., 46, 235--244. Google ScholarDigital Library
- Friedman, J., Hastie, T., Hfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. App. Stat., 2, 302--332.Google ScholarCross Ref
- Gelsinger, P. (2001). Microprocessors for the new millennium: Challenges, opportunities and new frontiers. ISSCC Tech. Digest, 22--25.Google Scholar
- Goto, K., & Van De Geijn, R. (2008). High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw., 35, 1--14. Google ScholarDigital Library
- Harris, M. (2008). Many-core GPU computing with NVIDIA CUDA. Int. Conf. Supercomputing (p. 1). Google ScholarDigital Library
- Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14, 1771--1800. Google ScholarDigital Library
- Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527--1554. Google ScholarDigital Library
- Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504--507.Google ScholarCross Ref
- Kavukcuoglu, K., Ranzato, M., & LeCun, Y. (2008). Fast inference in sparse coding algorithms with applications to object recognition. NYU Tech Report.Google Scholar
- Lee, H., Battle, A., Raina, R., & Ng, A. Y. (2006). Efficient sparse coding algorithms. Neural Information Processing Systems (pp. 801--808).Google Scholar
- Lee, H., Chaitanya, E., & Ng, A. Y. (2007). Sparse deep belief net model for visual area V2. Neural Information Processing Systems (pp. 873--880).Google Scholar
- Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. International Conference on Machine Learning (to appear). Google ScholarDigital Library
- Murray, J. F., & Kreutz-Delgado, K. (2006). Learning sparse overcomplete codes for images. J. VLSI Signal Processing Systems, 45, 97--110. Google ScholarDigital Library
- Ng, A. Y. (2004). Feature selection, L 1 vs. L 2 regularization, and rotational invariance. International Conference on Machine Learning (pp. 78--85). Google ScholarDigital Library
- Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607--609.Google ScholarCross Ref
- Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y. (2007). Self-taught learning: Transfer learning from unlabeled data. International Conference on Machine Learning (pp. 759--766). Google ScholarDigital Library
- Ranzato, M. A., & Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. International Conference on Machine Learning (pp. 792--799). Google ScholarDigital Library
- Salakhutdinov, R., & Hinton, G. (2007). Semantic Hashing. SIGIR Workshop on Information Retrieval and Applications of Graphical Models.Google Scholar
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B., 58, 267--288.Google ScholarCross Ref
- van Hateren, J. H., & van der Schaaff, A. (1997). Independent component filters of natural images compared with simple cells in primary visual cortex. Royal Soc. Lond. B, 265, 359--366.Google ScholarCross Ref
- Whaley, R. C., Petitet, A., & Dongarra, J. J. (2001). Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27, 3--35.Google ScholarDigital Library
Index Terms
Large-scale deep unsupervised learning using graphics processors
-
Recommendations
-
Relational query coprocessing on graphics processors
Graphics processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs ...
-
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Comments