Abstract
User-defined functions (UDFs) extend the capabilities of SQL by improving code reusability and encapsulating complex logic, but can hinder the performance due to optimization and execution inefficiencies. Prior approaches attempt to address this by rewriting UDFs into native SQL, which is then inlined into the SQL queries that invoke them. However, these approaches are either limited to simple pattern matching or require the synthesis of complex verification conditions from procedural code, a process that is brittle and difficult to automate. This limits coverage and makes the translation approaches less extensible to previously unseen procedural constructs. In this work, we present QURE, a framework that (1) leverages large language models (LLMs) to translate UDFs to native SQL, and (2) introduces a novel formal verification method to establish equivalence between the UDF and its translation. QURE uses the semantics of SQL operators to automate the derivation of verification conditions, in turn resulting in broad coverage and high extensibility. We model a large set of imperative constructs, particularly those common in Python and Pandas UDFs, in an intermediate verification language, allowing for the verification of their SQL translation. In our empirical evaluation of Python and Pandas UDFs, equivalence is successfully verified for 88% of UDF-SQL pairs (the rest lack semantically-equivalent SQLs) and LLMs correctly translate 84% of the UDFs. Executing the translated UDFs achieves median performance improvements of 23x on single-node clusters and 12x on 12-node clusters compared to the original UDFs, while also significantly reducing out-of-memory errors.
Topics

No keywords indexed for this article. Browse by subject →

References
55
[1]
Accessed 2025-01--10. Columnstore indexes guide. https://github.com/ByePy/examples.
[2]
Accessed 2025-01--10. Generators. https://wiki.python.org/moin/Generators.
[3]
Accessed: 2025-01--10. Logical and physical showplan operator reference. https://learn.microsoft.com/en-us/sql/ relational-databases/showplan-logical-and-physical-operators-reference?view=sql-server-ver16.
[4]
Accessed: 2025-01--10. NumPy. https://numpy.org/.
[5]
Accessed: 2025-01--10. Online UDF Compiler. https://apfel-db.cs.uni-tuebingen.de/.
[6]
Accessed: 2025-01--10. Pandas Documentation. https://pandas.pydata.org/pandas-docs/stable/index.html.
[7]
Accessed: 2025-01--10. Pandas.DataFrame.groupby. https://pandas.pydata.org/pandas/docs/stable/reference/api/pandas. DataFrame.groupby.html.
[8]
Accessed: 2025-01--10. PySpark Overview. https://spark.apache.org/docs/latest/api/python/index.html.
[9]
Accessed: 2025-01--10. QURE - MSR Webpage. https://www.microsoft.com/en-us/research/project/qure/.
[10]
Accessed: 2025-01--10. Satisfiability Modulo Theories. https://en.wikipedia.org/wiki/Satisfiability_modulo_theories.
[11]
Accessed: 2025-01--10. Spark SQL Built in Functions. https://spark.apache.org/docs/latest/api/sql/index.html.
[12]
Accessed 2025-01--10. Uninterpreted Function. https://en.wikipedia.org/wiki/Uninterpreted_function.
[13]
Accessed: 2025-01--10. What is a DataFrame? https://www.databricks.com/glossary/what-are-dataframes.
[14]
Accessed: 2025-01--10. Z3 Prover. https://github.com/Z3Prover/z3.
[16]
Saikat Chakraborty Shuvendu Lahiri Sarah Fakhoury Madan Musuvathi Akash Lal Aseem Rastogi Nikhil Swamy and Rahul Sharma. 2023. Ranking LLM-Generated Loop Invariants for Program Verification. In 2023 Empirical Methods in Natural Language Processing. 10.18653/v1/2023.findings-emnlp.614
[19]
K Venkatesh Emani, Avrilia Floratou, Carlo Curino, L Tanca, Q Luo, G Polese, L Caruccio, and X Oriol. 2024. PyFroid: Scaling Data Analysis on a Commodity Workstation.. In EDBT. 61--67.
[20]
Gregory Essertel, Ruby Tahboub, James Decker, Kevin Brown, Kunle Olukotun, and Tiark Rompf. 2018. Flare: Optimizing Apache Spark with Native Compilation for {Scale-Up} Architectures and {Medium-Size} Data. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 799--815.
[22]
Tim Fischer. 2023. To Iterate Is Human to Recurse Is Divine -- Mapping Iterative Python to Recursive SQL. In BTW. 1069--1074.
[24]
Kai Franz, Samuel Arch, Denis Hirn, Torsten Grust, Todd C. Mowry, and Andrew Pavlo. 2024. Dear User-Defined Functions, Inlining isn't working out so great for us. Let's try batching to make our relationship work. Sincerely, SQL. In Conference on Innovative Data Systems Research.
[26]
Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting Code Optimizations in {Data-Parallel} Pipelines through {PeriSCOPE}. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 121--133.
[28]
Stefan Hagedorn, Steffen Kläbe, and Kai-Uwe Sattler. 2021. Putting pandas in a box. In Conference on Innovative Data Systems Research (CIDR);(Online). 15.
[29]
Denis Hirn. 2023. Data is Data and Control Should be Data, Too -- Compiling Iterative Table-valued PL/SQL UDFs into Recursive SQL Code. In VLDB PhD Workshop.
[33]
Catalin Hritcu. 2019. Evolution Semantics and Engineering of the F Verification System. (2019).
[35]
Alekh Jindal et al. 2018. Computation reuse in analytics job service at microsoft. In Proceedings of the 2018 International Conference on Management of Data. 191--203.
[36]
Alekh Jindal, K Venkatesh Emani, Maureen Daum, Olga Poppe, Brandon Haynes, Anna Pavlenko, Ayushi Gupta, Karthik Ramachandra, Carlo Curino, Andreas Mueller, et al. 2021. Magpie: Python at Speed and Scale using Cloud Backends.. In CIDR.
[37]
Adharsh Kamath, Nausheen Mohammed, Aditya Senthilnathan, Saikat Chakraborty, Pantazis Deligiannis, Shuvendu K Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, and Rahul Sharma. 2024. Leveraging LLMs for Program Verification. In # PLACEHOLDER_PARENT_METADATA_VALUE#. TU Wien Academic Press, 107--118.
[39]
Erik Nijkamp et al. 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. ICLR (2023).
[40]
Erik Nijkamp et al. 2023. CodeGen2: Lessons for Training LLMs on Programming and Natural Languages. ICLR (2023).
[48]
Baptiste Roziere, Marie-Anne Lachaux, Lowik Chanussot, and Guillaume Lample. 2020. Unsupervised translation of programming languages (NIPS '20). Red Hook, NY, USA.
[49]
K. Rustan and M. Leino. Accessed: 2025-01--10. This is Boogie 2. https://www.microsoft.com/en-us/research/wpcontent/uploads/2016/12/krml178.pdf.
[50]
Hesam Shahrokhi et al. 2024. Pytond: Efficient python data science on the shoulders of databases. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 423--435.

Showing 50 of 55 references

Metrics
2
Citations
55
References
Details
Published
Feb 10, 2025
Vol/Issue
3(1)
Pages
1-26
License
View
Cite This Article
Tarique Siddiqui, Arnd Christian König, Jiashen Cao, et al. (2025). QURE: AI-Assisted and Automatically Verified UDF Inlining. Proceedings of the ACM on Management of Data, 3(1), 1-26. https://doi.org/10.1145/3709716