Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform

Yangguang Li; Zhen Ming (Jack) Jiang; Heng Li; Ahmed E. Hassan; Cheng He; Ruirui Huang; Zhengda Zeng; Mian Wang; Pinan Chen

doi:10.1145/3385187

journal article Apr 29, 2020

Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform

Yangguang Li

Zhen Ming (Jack) Jiang Heng Li

Ahmed E. Hassan

Cheng He

Ruirui Huang Zhengda Zeng

Mian Wang

Pinan Chen

ACM Transactions on Software Engineering and Methodology Vol. 29 No. 2 pp. 1-24 · Association for Computing Machinery (ACM)

View at Publisher Save 10.1145/3385187

Abstract

Many software services today are hosted on cloud computing platforms, such as Amazon EC2, due to many benefits like reduced operational costs. However, node failures in these platforms can impact the availability of their hosted services and potentially lead to large financial losses. Predicting node failures before they actually occur is crucial, as it enables DevOps engineers to minimize their impact by performing preventative actions. However, such predictions are hard due to many challenges like the enormous size of the monitoring data and the complexity of the failure symptoms. AIOps (
A
rtificial
I
ntelligence for IT
Op
eration
s
), a recently introduced approach in DevOps, leverages data analytics and machine learning to improve the quality of computing platforms in a cost-effective manner. However, the successful adoption of such AIOps solutions requires much more than a top-performing machine learning model. Instead, AIOps solutions must be trustable, interpretable, maintainable, scalable, and evaluated in context. To cope with these challenges, in this article we report our process of building an AIOps solution for predicting node failures for an ultra-large-scale cloud computing platform at Alibaba. We expect our experiences to be of value to researchers and practitioners, who are interested in building and maintaining AIOps solutions for large-scale cloud computing platforms.

Topics

No keywords indexed for this article. Browse by subject →

References

47

[1]

10.1145/3180155.3180197

[2]

10.1109/dsn.2012.6263919

[3]

Avritzer Alberto

[4]

Bergstra James "Random search for hyper-parameter optimization" Journal of Machine Learning Research 13 (2012)

[5]

Random Forests

Leo Breiman

Machine Learning 10.1023/a:1010933404324

[6]

10.1109/tse.2009.42

[7]

10.5555/1251203.1251223

[8]

10.1145/3298689.3347058

[9]

10.1145/2351676.2351735

[10]

10.1109/dsn.2014.39

[11]

A few useful things to know about machine learning

Pedro Domingos

Communications of the ACM 10.1145/2347736.2347755

[12]

10.1038/s41551-018-0315-x

[13]

10.1109/icdcs.2017.317

[14]

10.1109/icse.2015.144

[15]

Ghotra Baljinder

[16]

Mohamed

[17]

Framewise phoneme classification with bidirectional LSTM and other neural network architectures

Alex Graves, Jürgen Schmidhuber

Neural Networks 10.1016/j.neunet.2005.06.042

[18]

LSTM: A Search Space Odyssey

Klaus Greff, Rupesh K. Srivastava, Jan Koutnik et al.

IEEE Transactions on Neural Networks and Learning... 10.1109/tnnls.2016.2582924

[19]

10.1145/3236024.3236083

[20]

IDG. (2018)

[21]

10.1016/j.infsof.2007.02.015

[22]

Kuhn Max 10.1007/978-1-4614-6849-3

[23]

10.1109/icdm.2014.96

[24]

10.1145/3236024.3236060

[25]

10.1145/2884781.2884795

[26]

10.1561/1500000016

[27]

10.1109/ase.2013.6693105

[28]

10.1007/s10515-017-0218-1

[29]

10.1145/2623330.2623374

[30]

Machowinski Matthias (2016)

[31]

Molnar Christoph (2019)

[32]

10.1145/2783258.2788624

[33]

Nair Vivek "Finding faster configurations using FLASH" IEEE Transactions on Software Engineering (Early Access). arXiv (2018)

[34]

Prasad Pankaj (2018)

[35]

Rajaraman Anand 10.1017/cbo9781139058452

[36]

Sculley D. (2015)

[37]

10.1145/2889160.2889243

[38]

Tantithamthavorn Chakkrit

[39]

Tantithamthavorn C. "The impact of class rebalancing techniques on the performance and interpretation of defect prediction models" IEEE Transactions on Software Engineering (Early Access). arXiv (2018)

[40]

10.1145/2884781.2884857

[41]

10.1109/tse.2016.2584050

[42]

10.1109/tse.2018.2794977

[43]

10.1109/tse.2018.2877612

[44]

Xue J.

[45]

10.1109/tnsm.2018.2794409

[46]

10.23919/cnsm.2017.8255983

[47]

10.1109/tse.2016.2599161

Cited By

69

Battery Prognostics and Health Management: AI and Big Data

Di Li, Jinrui Nan · 2024

World Electric Vehicle Journal

An automatic model management system and its implementation for AIOps on microservice platforms

Ruibo Chen, Yanjun Pu · 2023

The Journal of Supercomputing

Internet-of-Things Edge Computing Systems for Streaming Video Analytics: Trails Behind and the Paths Ahead

Arun A. Ravindran · 2023

IoT

A Survey of AIOps Methods for Failure Management

Paolo Notaro, Michael Gerndt · 2021

ACM Transactions on Intelligent Sys...

Metrics

69

Citations

47

References

Details

Published: Apr 29, 2020
Vol/Issue: 29(2)
Pages: 1-24
License: View

Authors

Y

Yangguang Li

York University, Toronto, Ontario, Canada

Z

Zhen Ming (Jack) Jiang

York University, Toronto, Ontario, Canada

Queen's University

Queen's University