A Survey on Distributed Machine Learning

Joost Verbraeken; Matthijs Wolting; Jonathan Katzy; Jeroen Kloppenburg; Tim Verbelen; Jan S. Rellermeyer

doi:10.1145/3377454

journal article Open Access Mar 20, 2020

A Survey on Distributed Machine Learning

Joost Verbraeken Matthijs Wolting Jonathan Katzy Jeroen Kloppenburg Tim Verbelen Jan S. Rellermeyer

ACM Computing Surveys Vol. 53 No. 2 pp. 1-33 · Association for Computing Machinery (ACM)

View at Publisher Save 10.1145/3377454

Abstract

The demand for artificial intelligence has grown significantly over the past decade, and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, to increase the quality of predictions and render machine learning solutions feasible for more complex applications, a substantial amount of training data is required. Although small machine learning models can be trained with modest amounts of data, the input for training larger models such as neural networks grows exponentially with the number of parameters. Since the demand for processing training data has outpaced the increase in computation power of computing machinery, there is a need for distributing the machine learning workload across multiple machines, and turning the centralized into a distributed system. These distributed systems present new challenges: first and foremost, the efficient parallelization of the training process and the creation of a coherent model. This article provides an extensive overview of the current state-of-the-art in the field by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning, discussing the techniques used for distributed machine learning, and providing an overview of the systems that are available.

Topics

No keywords indexed for this article. Browse by subject →

References

171

[1]

Martín Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Mané Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Viégas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/. Martín Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Mané Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Viégas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/.

[2]

Abadi Martin (2016)

[3]

10.1145/2976749.2978318

[4]

Adapteva Inc. 2017. E64G401 Epiphany 64-core Microprocessor Datasheet. Retrieved from http://www.adapteva.com/docs/e64g401_datasheet.pdf. Adapteva Inc. 2017. E64G401 Epiphany 64-core Microprocessor Datasheet. Retrieved from http://www.adapteva.com/docs/e64g401_datasheet.pdf.

[5]

10.5555/2627435.2638571

[6]

Amatya Vinay (2017)

[7]

Amazon Web Services. 2018. Amazon SageMaker. Retrieved from https://aws.amazon.com/sagemaker/developer-resources/. Amazon Web Services. 2018. Amazon SageMaker. Retrieved from https://aws.amazon.com/sagemaker/developer-resources/.

[8]

Amodei Dario (2016)

[9]

Anderson Edward (1990)

[10]

Apple. 2017. Core ML Model Format Specification. Retrieved from https://apple.github.io/coremltools/coremlspecification/. Apple. 2017. Core ML Model Format Specification. Retrieved from https://apple.github.io/coremltools/coremlspecification/.

[11]

Apple. 2018. A12 Bionic. Retrieved from https://www.apple.com/iphone-xs/a12-bionic/. Apple. 2018. A12 Bionic. Retrieved from https://www.apple.com/iphone-xs/a12-bionic/.

[12]

Bagdasaryan Eugene (2018)

[13]

Bagnell Drew

[14]

Balcan Maria-Florina (2012)

[15]

Baran Paul 10.7249/p2626

[16]

10.1109/mm.2003.1196112

[17]

Bergstra James (2012)

[18]

Theano: A CPU and GPU Math Compiler in Python

James Bergstra, Olivier Breuleux, Frédéric Bastien et al.

Proceedings of the Python in Science Conference 10.25080/majora-92bf1922-003

[19]

Philip (2009)

[20]

An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software 10.1145/567806.567807

[21]

10.1145/2133806.2133826

[22]

Blei David M. "Latent Dirichlet allocation" J. Mach. Learn. Res. 3 (2003)

[23]

Bojarski Mariusz (2016)

[24]

Large-Scale Machine Learning with Stochastic Gradient Descent

Leon Bottou

Proceedings of COMPSTAT'2010 10.1007/978-3-7908-2604-3_16

[25]

Breiman Leo (2001)

[26]

10.1111/1467-9884.00117

[27]

Rajkumar Buyya et al. 1999. High Performance Cluster Computing: Architectures and Systems. Prentice Hall Upper SaddleRiver NJ 999. Rajkumar Buyya et al. 1999. High Performance Cluster Computing: Architectures and Systems. Prentice Hall Upper SaddleRiver NJ 999.

[28]

10.1137/140954362

[29]

Canini K. "Sibyl: A system for large scale supervised machine learning" Tech. Talk (2012)

[30]

Chen Jianmin (2016)

[31]

10.1109/icassp.2016.7472805

[32]

10.1145/2644865.2541967

[33]

Chen Tianqi (2015)

[34]

Chilimbi Trishul (2014)

[35]

François Chollet et al. 2015. Keras. Retrieved from https://keras.io/. François Chollet et al. 2015. Keras. Retrieved from https://keras.io/.

[36]

Chu Cheng-Tao

[37]

Clearwater Scott H. (1989)

[38]

Coates Adam (2013)

[39]

10.1016/j.jss.2018.03.032

[40]

Coulouris George F. (2005)

[41]

Cui Henggang (2014)

[42]

10.1137/s1052623497318992

[43]

Dean Jeffrey

[44]

Dean Jeffrey (2004)

[45]

Duchi John (2011)

[46]

10.1145/568522.568525

[47]

Facebook. 2017. Gloo. Retrieved from https://github.com/facebookincubator/gloo. Facebook. 2017. Gloo. Retrieved from https://github.com/facebookincubator/gloo.

[48]

10.1109/cvprw.2011.5981829

[49]

Ferdman Michael (2012)

[50]

10.5555/2627435.2697065

Showing 50 of 171 references

Cited By

687

Exploring New Frontiers in Vertical Federated Learning: the Role of Saddle Point Reformulation

Aleksandr Beznosikov, Georgiy Kormakov · 2026

Journal of Optimization Theory and...

When One-Shot Federated Learning Meets Diffusion Models at the Edge: Technological Advances and Applications

Wanxiang Chen, Dongshang Deng · 2026

Journal of Intelligent Computing an...

Navigating the Edge-Cloud Continuum: A State-of-Practice Survey

Loris Belcastro, Fabrizio Marozzo · 2026

IEEE Access

FedQCNN: A Privacy‐Preserving Federated Quantum Convolutional Neural Network for Retinal Image Classification

Mahua Nandy Pal, Debashis De · 2025

IET Quantum Communication

Enabling Resource-Efficient AIoT System With Cross-Level Optimization: A Survey

Sicong Liu, Bin Guo · 2024

IEEE Communications Surveys & T...

Blind Quantum Machine Learning with Quantum Bipartite Correlator

Changhao Li, Boning Li · 2024

Physical Review Letters

Accelerating Hybrid Federated Learning Convergence Under Partial Participation

Jieming Bian, Lei Wang · 2024

IEEE Transactions on Signal Process...

How clustering affects the convergence of decentralized optimization over networks: a Monte-Carlo-based approach

Mohammadreza Doostmohammadian, Shahaboddin Kharazmi · 2024

Social Network Analysis and Mining

Distributed Graph Neural Network Training: A Survey

Yingxia Shao, Hongzheng Li · 2024

ACM Computing Surveys

Small data methods in omics: the power of one

Kevin G. Johnston, Steven F. Grieco · 2024

Nature Methods

Edge Learning for B5G Networks With Distributed Signal Processing: Semantic Communication, Edge Computing, and Wireless Sensing

Wei Xu, Zhaohui Yang · 2023

IEEE Journal of Selected Topics in...

A review of Earth Artificial Intelligence

Ziheng Sun, Laura Sandoval · 2022

Computers & Geosciences

Metrics

687

Citations

171

References

Details

Published: Mar 20, 2020
Vol/Issue: 53(2)
Pages: 1-33
License: View

Authors

J

Joost Verbraeken

Delft University of Technology, Delft, Netherlands

M

Matthijs Wolting

Delft University of Technology, Delft, Netherlands

J

Jonathan Katzy

Delft University of Technology, Delft, Netherlands

J

Jeroen Kloppenburg

Delft University of Technology, Delft, Netherlands

T

Tim Verbelen

imec - Ghent University, Ghent, Belgium

J

Jan S. Rellermeyer

Delft University of Technology, Netherlands

Cite This Article

Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, et al. (2020). A Survey on Distributed Machine Learning. ACM Computing Surveys, 53(2), 1-33. https://doi.org/10.1145/3377454

A Survey on Distributed Machine Learning

You May Also Like