Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform
A
rtificial
I
ntelligence for IT
Op
eration
s
), a recently introduced approach in DevOps, leverages data analytics and machine learning to improve the quality of computing platforms in a cost-effective manner. However, the successful adoption of such AIOps solutions requires much more than a top-performing machine learning model. Instead, AIOps solutions must be trustable, interpretable, maintainable, scalable, and evaluated in context. To cope with these challenges, in this article we report our process of building an AIOps solution for predicting node failures for an ultra-large-scale cloud computing platform at Alibaba. We expect our experiences to be of value to researchers and practitioners, who are interested in building and maintaining AIOps solutions for large-scale cloud computing platforms.
No keywords indexed for this article. Browse by subject →
Pedro Domingos
Alex Graves, Jürgen Schmidhuber
Klaus Greff, Rupesh K. Srivastava, Jan Koutnik et al.
Di Li, Jinrui Nan · 2024
Ruibo Chen, Yanjun Pu · 2023
Arun A. Ravindran · 2023
Paolo Notaro, Michael Gerndt · 2021
- Published
- Apr 29, 2020
- Vol/Issue
- 29(2)
- Pages
- 1-24
- License
- View
You May Also Like
Xinyi Hou, Yanjie Zhao · 2024
546 citations
Michele Tufano, Cody Watson · 2019
281 citations
Silverio Martínez-Fernández, Justus Bogner · 2022
238 citations
Marco Mamei, Franco Zambonelli · 2009
157 citations