journal article Open Access Dec 27, 2025

Latent Diffusion over Dynamic Service Graphs for Uncertainty-Quantified Failure Propagation Prediction

View at Publisher Save 10.71465/ajdsa3450
Abstract
Modern microservice architectures present unprecedented challenges for reliability engineering due to their complex dependency structures and dynamic failure propagation patterns. Traditional monitoring approaches struggle to predict cascading failures because they lack mechanisms to model the stochastic nature of fault diffusion across service graphs. We propose a novel framework that integrates latent diffusion models with temporal graph neural networks to achieve uncertainty-quantified failure propagation prediction. Our approach constructs dynamic service dependency graphs from distributed tracing data, identifying causal relationships including happens-before ordering, mutual exclusion patterns, and pipeline structures. A specialized graph neural architecture learns service representations through multi-task learning objectives spanning node-level, edge-level, and graph-level predictions. The failure propagation dynamics are modeled as a diffusion process in learned latent space, where graph structure refinement through density-based edge definition and sparsification enables efficient uncertainty quantification. Evaluation on production microservice systems demonstrates superior prediction accuracy with 23% improvement in F1-score over baseline methods, achieving 71% reduction in detection latency while providing calibrated uncertainty estimates that enable proactive incident management.
Topics

No keywords indexed for this article. Browse by subject →

Metrics
0
Citations
0
References
Details
Published
Dec 27, 2025
Vol/Issue
6(4)
Pages
40-61
License
View
Cite This Article
Jonas Richter, Elena Popescu (2025). Latent Diffusion over Dynamic Service Graphs for Uncertainty-Quantified Failure Propagation Prediction. American Journal of Data Science and Analysis, 6(4), 40-61. https://doi.org/10.71465/ajdsa3450