Accurate Sampling-Based Cardinality Estimation for Complex Graph Queries

Pan Hu; Boris Motik

doi:10.1145/3689209

journal article Open Access Sep 17, 2024

Accurate Sampling-Based Cardinality Estimation for Complex Graph Queries

Pan Hu

Boris Motik

ACM Transactions on Database Systems Vol. 49 No. 3 pp. 1-46 · Association for Computing Machinery (ACM)

View at Publisher Save 10.1145/3689209

Abstract

Accurately estimating the cardinality (i.e., the number of answers) of complex queries plays a central role in database systems. This problem is particularly difficult in graph databases, where queries often involve a large number of joins and self-joins. Recently, Park et al. [55] surveyed seven state-of-the-art cardinality estimation approaches for graph queries. The results of their extensive empirical evaluation show that a sampling method based on theWanderJoinonline aggregation algorithm [47] consistently offers superior accuracy.We extended the framework by Park et al. [55] with three additional datasets and repeated their experiments. Our results showed that WanderJoin is indeed very accurate, but it can often take a large number of samples and thus be very slow. Moreover, when queries are complex and data distributions are skewed, it often fails to find valid samples and estimates the cardinality as zero. Finally, complex graph queries often go beyond simple graph matching and involve arbitrary nesting of relational operators such as disjunction, difference, and duplicate elimination. Neither of the methods considered by Park et al. [55] is applicable to such queries.In this article, we present a novel approach for estimating the cardinality of complex graph queries. Our approach is inspired by WanderJoin, but, unlike all approaches known to us, it can process complex queries with arbitrary operator nesting. Our estimator is strongly consistent, meaning that the average of repeated estimates converges with probability one to the actual cardinality. We present optimisations of the basic algorithm that aim to reduce the chance of producing zero estimates and improve accuracy. We show empirically that our approach is both accurate and quick on complex queries and large datasets. Finally, we discuss how to integrate our approach into a simple dynamic programming query planner, and we confirm empirically that our planner produces high-quality plans that can significantly reduce end-to-end query evaluation times.

Topics

No keywords indexed for this article. Browse by subject →

References

77

[1]

10.5555/1325851.1325900

[2]

A. Aboulnaga and S. Chaudhuri. 1999. Self-tuning histograms: Building histograms without looking at data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99). ACM, 181–192. 10.1145/304182.304198

[3]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. 1999. Join synopses for approximate query answering. In Proceedings of the International Conference on Management of Data (SIGMOD’99). ACM Press, 275–286.

[4]

G. Aluç, O. Hartig, M. Tamer Özsu, and K. Daudjee. 2014. Diversified stress testing of RDF data management systems. In Proceedings of the 13th International Semantic Web Conference (ISWC’14). Springer, 197–212.

[5]

10.1007/s10115-014-0770-y

[6]

R. Angles, M. Arenas, P. Barceló, A. Hogan, J. L. Reutter, and D. Vrgoc. 2017. Foundations of modern query languages for graph databases. ACM Comput. Surv. 50, 5 (2017), 68:1–68:40.

[7]

10.1145/2463664.2465216

[8]

10.1016/0743-1066(91)90038-q

[9]

D. P. Bertsekas and J. N. Tsitsiklis. 2008. Introduction to Probability (2nd ed.). Athena Scientific, Belmont, MA, USA.

[10]

N. Bruno. 2003. Statistics on Query Expressions in Relational Database Management Systems. Ph. D. Dissertation. Columbia University.

[11]

N. Bruno and S. Chaudhuri. 2004. Conditional selectivity for statistics on query expressions. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’04). ACM, 311–322. 10.1145/1007568.1007604

[12]

10.1145/376284.375686

[13]

W. Cai, M. Balazinska, and D. Suciu. 2019. Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In Proceedings of the 40th International Conference on Management of Data (SIGMOD’19). ACM Press, 18–35.

[14]

10.1007/s007780100049

[15]

M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. 2000. Towards estimation error guarantees for distinct values. In Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’00). ACM, 268–279.

[16]

10.14778/3529337.3529339

[17]

X. Chen and J. C. S. Lui. 2016. Mining graphlet counts in online social networks. In Proceedings of the 16th International IEEE Conference on Data Mining (ICDM’16). IEEE Computer Society, 71–80.

[18]

X. Chen and J. C. S. Lui. 2018. Mining graphlet counts in online social networks. ACM Trans. Knowl. Discov. Data 12, 4 (2018), 1–38. 10.1145/3182392

[19]

C. A. Galindo-Legaria, M. Joshi, F. Waas, and M.-C. Wu. 2003. Statistics on views. In Proceedings of the 29th International Conference on Very Large Databases (VLDB’03). Morgan Kaufmann, 952–962.

[20]

H. Garcia-Molina, J. D. Ullman, and J. Widom. 2000. Database System Implementation. Prentice-Hall, Upper Saddle River, NJ, USA.

[21]

M. Garofalakis and P. B. Gibbons. 2002. Wavelet synopses with error guarantees. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’02). ACM, 476–487. 10.1145/564691.564746

[22]

L. Getoor, B. Taskar, and D. Koller. 2001. Selectivity estimation using probabilistic models. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’01). ACM, 461–472.

[23]

D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi. 2000. Approximating multi-dimensional aggregate range queries over real attributes. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’00). ACM, 463–474. 10.1145/342009.335448

[24]

10.1007/s00778-003-0090-4

[25]

10.1016/j.websem.2005.06.005

[26]

P. J. Haas. 1997. Large-sample and deterministic confidence intervals for online aggregation. In Proceedings of the 9th International Conference on Scientific and Statistical Database Management (SSDBM’97). IEEE Computer Society, 51–63.

[27]

P. J. Haas and J. M. Hellerstein. 1999. Ripple joins for online aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99). ACM Press, 287–298. 10.1145/304182.304208

[28]

10.1006/jcss.1996.0041

[29]

P. J. Haas and A. N. Swami. 1992. Sequential sampling procedures for query size estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’92). ACM Press, 341–350.

[30]

S. Harris and A. Seaborne. 2013. SPARQL 1.1 Query Language W3C Recommendation. Retrieved from https://www.w3.org/TR/sparql11-query/

[31]

S. Hasan, S. Thirumuruganathan, J. Augustine, N. Koudas, and G. Das. 2020. Deep learning models for selectivity estimation of multi-attribute queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’20). ACM, 1035–1050.

[32]

M. Heimel, M. Kiefer, and V. Markl. 2015. Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’15). ACM, 1477–1492. 10.1145/2723372.2749438

[33]

J. M. Hellerstein, P. J. Haas, and H. J. Wang. 1997. Online aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’97). ACM Press, 171–182.

[34]

A. Hertzschuch, G. Moerkotte, W. Lehner, N. May, F. Wolf, and L. Fricke. 2021. Small selectivities matter: Lifting the burden of empty samples. In Proceedings of the International Conference on Management of Data (SIGMOD’21). ACM, 697–709.

[35]

DeepDB

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa et al.

Proceedings of the VLDB Endowment 10.14778/3384345.3384349

[36]

A Generalization of Sampling Without Replacement from a Finite Universe

D. G. Horvitz, D. J. Thompson

Journal of the American Statistical Association 10.1080/01621459.1952.10483446

[37]

Accurate Sampling-Based Cardinality Estimation for Complex Graph Queries

Pan Hu, Boris Motik

ACM Transactions on Database Systems 10.1145/3689209

[38]

Y. E. Ioannidis. 2003. The history of histograms (abridged). In Proceedings of the 29th International Conference on Very Large Databases (VLDB’03). Morgan Kaufmann, Berlin, Germany, 19–30.

[39]

Y. E. Ioannidis and S. Christodoulakis. 1991. On the propagation of errors in the size of join results. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’91). ACM, 268–277.

[40]

Z. G. Ives and N. E. Taylor. 2008. Sideways information passing for push-style query processing. In Proceedings of the 24th International Conference on Data Engineering (ICDE’08). IEEE Computer Society, 774–783.

[41]

C. Jin, S. S. Bhowmick, B. Choi, and S. Zhou. 2012. PRAGUE: Towards blending practical visual subgraph query formulation and query processing. In Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE’12). IEEE Computer Society, 222–233.

[42]

M. A. Khamis, H. Q. Ngo, and D. Suciu. 2017. What do Shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’17). ACM, 429–444. 10.1145/3034786.3056105

[43]

A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, and A. Kemper. 2019. Learned cardinalities: Estimating correlated joins with deep learning. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research (CIDR’19). Retrieved from www.cidrdb.org

[44]

G. Klyne J. J. Carroll and B. McBride. 2014. RDF 1.1 Concepts and Abstract Syntax. Retrieved from https://www.w3.org/TR/rdf11-concepts/

[45]

J.-H. Lee, D.-H. Kim, and C.-W. Chung. 1999. Multi-dimensional selectivity estimation using compressed histogram information. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99). ACM, 205–214. 10.1145/304182.304200

[46]

10.14778/2850583.2850594

[47]

Wander Join and XDB

Feifei Li, Bin Wu, Ke Yi et al.

ACM Transactions on Database Systems 10.1145/3284551

[48]

R. J. Lipton and J. F. Naughton. 1990. Query size estimation by adaptive sampling. In Proceedings of the 9th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’90). ACM Press, 40–46. 10.1145/298514.298540

[49]

Y. Liu, T. Safavi, A. Dighe, and D. Koutra. 2018. Graph summarization methods and applications: A survey. ACM Comput. Surv. 51, 3 (2018), 62:1–62:34.

[50]

Y. Matias, J. S. Vitter, and M. Wang. 1998. Wavelet-based histograms for selectivity estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’98). ACM, 448–459. 10.1145/276304.276344

Showing 50 of 77 references

Cited By

4

Accurate Sampling-Based Cardinality Estimation for Complex Graph Queries

Pan Hu, Boris Motik · 2024

ACM Transactions on Database System...

Metrics

4

Citations

77

References

Details

Published: Sep 17, 2024
Vol/Issue: 49(3)
Pages: 1-46
License: View

Authors

P

Pan Hu

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China

B

Boris Motik

Department of Computer Science, Oxford University, Oxford, United Kingdom of Great Britain and Northern Ireland

Funding

NSFC Award: 62206169

EPSRC Award: EP/P025943/1

Cite This Article

Pan Hu, Boris Motik (2024). Accurate Sampling-Based Cardinality Estimation for Complex Graph Queries. ACM Transactions on Database Systems, 49(3), 1-46. https://doi.org/10.1145/3689209

Accurate Sampling-Based Cardinality Estimation for Complex Graph Queries

You May Also Like