Abstract
Many software services today are hosted on cloud computing platforms, such as Amazon EC2, due to many benefits like reduced operational costs. However, node failures in these platforms can impact the availability of their hosted services and potentially lead to large financial losses. Predicting node failures before they actually occur is crucial, as it enables DevOps engineers to minimize their impact by performing preventative actions. However, such predictions are hard due to many challenges like the enormous size of the monitoring data and the complexity of the failure symptoms. AIOps (
A
rtificial
I
ntelligence for IT
Op
eration
s
), a recently introduced approach in DevOps, leverages data analytics and machine learning to improve the quality of computing platforms in a cost-effective manner. However, the successful adoption of such AIOps solutions requires much more than a top-performing machine learning model. Instead, AIOps solutions must be trustable, interpretable, maintainable, scalable, and evaluated in context. To cope with these challenges, in this article we report our process of building an AIOps solution for predicting node failures for an ultra-large-scale cloud computing platform at Alibaba. We expect our experiences to be of value to researchers and practitioners, who are interested in building and maintaining AIOps solutions for large-scale cloud computing platforms.
Topics

No keywords indexed for this article. Browse by subject →

References
47
[3]
Avritzer Alberto
[4]
Bergstra James "Random search for hyper-parameter optimization" Journal of Machine Learning Research 13 (2012)
[5]
Random Forests

Leo Breiman

Machine Learning 10.1023/a:1010933404324
[11]
A few useful things to know about machine learning

Pedro Domingos

Communications of the ACM 10.1145/2347736.2347755
[15]
Ghotra Baljinder
[16]
Mohamed
[17]
Framewise phoneme classification with bidirectional LSTM and other neural network architectures

Alex Graves, Jürgen Schmidhuber

Neural Networks 10.1016/j.neunet.2005.06.042
[18]
LSTM: A Search Space Odyssey

Klaus Greff, Rupesh K. Srivastava, Jan Koutnik et al.

IEEE Transactions on Neural Networks and Learning... 10.1109/tnnls.2016.2582924
[20]
IDG. (2018)
[30]
Machowinski Matthias (2016)
[31]
Molnar Christoph (2019)
[33]
Nair Vivek "Finding faster configurations using FLASH" IEEE Transactions on Software Engineering (Early Access). arXiv (2018)
[34]
Prasad Pankaj (2018)
[35]
Rajaraman Anand 10.1017/cbo9781139058452
[36]
Sculley D. (2015)
[38]
Tantithamthavorn Chakkrit
[39]
Tantithamthavorn C. "The impact of class rebalancing techniques on the performance and interpretation of defect prediction models" IEEE Transactions on Software Engineering (Early Access). arXiv (2018)
[44]
Xue J.
Cited By
69
World Electric Vehicle Journal
A Survey of AIOps Methods for Failure Management

Paolo Notaro, Michael Gerndt · 2021

ACM Transactions on Intelligent Sys...
Metrics
69
Citations
47
References
Details
Published
Apr 29, 2020
Vol/Issue
29(2)
Pages
1-24
License
View
Funding
Alibaba Innovative Research Program
Cite This Article
Yangguang Li, Zhen Ming (Jack) Jiang, Heng Li, et al. (2020). Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform. ACM Transactions on Software Engineering and Methodology, 29(2), 1-24. https://doi.org/10.1145/3385187
Related

You May Also Like

Software Engineering for AI-Based Systems: A Survey

Silverio Martínez-Fernández, Justus Bogner · 2022

238 citations

Programming pervasive and mobile computing applications

Marco Mamei, Franco Zambonelli · 2009

157 citations