Abstract
The demand for artificial intelligence has grown significantly over the past decade, and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, to increase the quality of predictions and render machine learning solutions feasible for more complex applications, a substantial amount of training data is required. Although small machine learning models can be trained with modest amounts of data, the input for training larger models such as neural networks grows exponentially with the number of parameters. Since the demand for processing training data has outpaced the increase in computation power of computing machinery, there is a need for distributing the machine learning workload across multiple machines, and turning the centralized into a distributed system. These distributed systems present new challenges: first and foremost, the efficient parallelization of the training process and the creation of a coherent model. This article provides an extensive overview of the current state-of-the-art in the field by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning, discussing the techniques used for distributed machine learning, and providing an overview of the systems that are available.
Topics

No keywords indexed for this article. Browse by subject →

References
171
[1]
Martín Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Mané Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Viégas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/. Martín Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Mané Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Viégas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/.
[2]
Abadi Martin (2016)
[4]
Adapteva Inc. 2017. E64G401 Epiphany 64-core Microprocessor Datasheet. Retrieved from http://www.adapteva.com/docs/e64g401_datasheet.pdf. Adapteva Inc. 2017. E64G401 Epiphany 64-core Microprocessor Datasheet. Retrieved from http://www.adapteva.com/docs/e64g401_datasheet.pdf.
[6]
Amatya Vinay (2017)
[7]
Amazon Web Services. 2018. Amazon SageMaker. Retrieved from https://aws.amazon.com/sagemaker/developer-resources/. Amazon Web Services. 2018. Amazon SageMaker. Retrieved from https://aws.amazon.com/sagemaker/developer-resources/.
[8]
Amodei Dario (2016)
[9]
Anderson Edward (1990)
[10]
Apple. 2017. Core ML Model Format Specification. Retrieved from https://apple.github.io/coremltools/coremlspecification/. Apple. 2017. Core ML Model Format Specification. Retrieved from https://apple.github.io/coremltools/coremlspecification/.
[11]
Apple. 2018. A12 Bionic. Retrieved from https://www.apple.com/iphone-xs/a12-bionic/. Apple. 2018. A12 Bionic. Retrieved from https://www.apple.com/iphone-xs/a12-bionic/.
[12]
Bagdasaryan Eugene (2018)
[13]
Bagnell Drew
[14]
Balcan Maria-Florina (2012)
[15]
Baran Paul 10.7249/p2626
[17]
Bergstra James (2012)
[18]
Theano: A CPU and GPU Math Compiler in Python

James Bergstra, Olivier Breuleux, Frédéric Bastien et al.

Proceedings of the Python in Science Conference 10.25080/majora-92bf1922-003
[19]
Philip (2009)
[20]
An updated set of basic linear algebra subprograms (BLAS)
ACM Transactions on Mathematical Software 10.1145/567806.567807
[22]
Blei David M. "Latent Dirichlet allocation" J. Mach. Learn. Res. 3 (2003)
[23]
Bojarski Mariusz (2016)
[24]
Large-Scale Machine Learning with Stochastic Gradient Descent

Leon Bottou

Proceedings of COMPSTAT'2010 10.1007/978-3-7908-2604-3_16
[25]
Breiman Leo (2001)
[27]
Rajkumar Buyya et al. 1999. High Performance Cluster Computing: Architectures and Systems. Prentice Hall Upper SaddleRiver NJ 999. Rajkumar Buyya et al. 1999. High Performance Cluster Computing: Architectures and Systems. Prentice Hall Upper SaddleRiver NJ 999.
[29]
Canini K. "Sibyl: A system for large scale supervised machine learning" Tech. Talk (2012)
[30]
Chen Jianmin (2016)
[33]
Chen Tianqi (2015)
[34]
Chilimbi Trishul (2014)
[35]
François Chollet et al. 2015. Keras. Retrieved from https://keras.io/. François Chollet et al. 2015. Keras. Retrieved from https://keras.io/.
[36]
Chu Cheng-Tao
[37]
Clearwater Scott H. (1989)
[38]
Coates Adam (2013)
[40]
Coulouris George F. (2005)
[41]
Cui Henggang (2014)
[43]
Dean Jeffrey
[44]
Dean Jeffrey (2004)
[45]
Duchi John (2011)
[47]
Facebook. 2017. Gloo. Retrieved from https://github.com/facebookincubator/gloo. Facebook. 2017. Gloo. Retrieved from https://github.com/facebookincubator/gloo.
[49]
Ferdman Michael (2012)

Showing 50 of 171 references

Cited By
687
Journal of Optimization Theory and...
Journal of Intelligent Computing an...
IEEE Access
IEEE Communications Surveys & T...
Physical Review Letters
IEEE Transactions on Signal Process...
Social Network Analysis and Mining
ACM Computing Surveys
Small data methods in omics: the power of one

Kevin G. Johnston, Steven F. Grieco · 2024

Nature Methods
A review of Earth Artificial Intelligence

Ziheng Sun, Laura Sandoval · 2022

Computers & Geosciences
Metrics
687
Citations
171
References
Details
Published
Mar 20, 2020
Vol/Issue
53(2)
Pages
1-33
License
View
Cite This Article
Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, et al. (2020). A Survey on Distributed Machine Learning. ACM Computing Surveys, 53(2), 1-33. https://doi.org/10.1145/3377454
Related

You May Also Like

Data clustering

A. K. Jain, M. N. Murty · 1999

9,568 citations

Anomaly detection

Varun Chandola, Arindam Banerjee · 2009

8,799 citations

Machine learning in automated text categorization

Fabrizio Sebastiani · 2002

5,027 citations

Object tracking

Alper Yilmaz, Omar Javed · 2006

3,632 citations

A Survey on Bias and Fairness in Machine Learning

Ninareh Mehrabi, Fred Morstatter · 2021

3,466 citations