Abstract
We use machine learning to optimize LSM-tree structure, aiming to reduce the cost of processing various read/write operations. We introduce a new approach CAMAL, which boasts the following features: (1)
ML-Aided
: CAMAL is the first attempt to apply active learning to tune LSM-tree based key-value stores. The learning process is coupled with traditional cost models to improve the training process; (2)
Decoupled Active Learning
: backed by rigorous analysis, CAMAL adopts active learning paradigm based on a decoupled tuning of each parameter, which further accelerates the learning process; (3)
Easy Extrapolation
: CAMAL adopts an effective mechanism to incrementally update the model with the growth of the data size; (4)
Dynamic Mode
: CAMAL is able to tune LSM-tree online under dynamically changing workloads; (5)
Significant System Improvement
: By integrating CAMAL into a full system RocksDB, the system performance improves by 28% on average and up to 8x compared to a state-of-the-art RocksDB design.
Topics

No keywords indexed for this article. Browse by subject →

References
91
[1]
2016. HBase. http://hbase.apache.org/. [Online; accessed 12-January-2022].
[2]
2021. WiredTiger. https://github.com/wiredtiger/wiredtiger.
[4]
Hussam Abu-Libdeh, Deniz Altinbüken, Alex Beutel, Ed H Chi, Lyric Doshi, Tim Kraska, Andy Ly, Christopher Olston, et al. 2020. Learned indexes for a google-scale disk-based database. arXiv preprint arXiv:2012.12501 (2020).
[6]
Wail Y Alkowaileet, Sattam Alsubaiee, and Michael J Carey. 2019. An LSM-based Tuple Compaction Framework for Apache AsterixDB (Extended Version). arXiv preprint arXiv:1910.08185 (2019).
[7]
Apache. 2016. Cassandra. http://cassandra.apache.org.
[9]
Oana Balmau Diego Didona Rachid Guerraoui Willy Zwaenepoel Huapeng Yuan Aashray Arora Karan Gupta and Pavan Konka. 2017. {TRIAD}: Creating Synergies Between Memory Disk and Log in Log Structured Key-Value Stores. In 2017 {USENIX} Annual Technical Conference ({USENIX} {ATC} 17). 363--375.
[10]
Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan Gupta, Ravishankar Chandhiramoorthi, and Diego Didona. 2019. SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores.. In USENIX Annual Technical Conference. 753--766.
[11]
Christian Berthet. 2017. Approximation of LRU caches miss rate: Application to power-law popularities. arXiv preprint arXiv:1705.10738 (2017).
[14]
Zhao Cao, Shimin Chen, Feifei Li, Min Wang, and X Sean Wang. 2013. LogKV: Exploiting key-value stores for event log processing. In Proc. Conf. Innovative Data Syst. Res.
[15]
Helen HW Chan, Yongkun Li, Patrick PC Lee, and Yinlong Xu. 2018. Hashkv: Enabling efficient updates in {KV} storage via hashing. In 2018 {USENIX} Annual Technical Conference ({USENIX} {ATC} 18). 1007--1019.
[19]
Tianqi Chen Tong He Michael Benesty Vadim Khotilovich Yuan Tang Hyunsu Cho Kailong Chen Rory Mitchell Ignacio Cano Tianyi Zhou et al. 2015. Xgboost: extreme gradient boosting. R package version 0.4--2 1 4 (2015) 1--4.
[21]
Yifan Dai, Yien Xu, Aishwarya Ganesan, Ramnatthan Alagappan, Brian Kroth, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. 2020. From wisckey to bourbon: A learned index for log-structured merge trees. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 155--171.
[25]
Spooky

Niv Dayan, Tamar Weiss, Shmuel Dashevsky et al.

Proceedings of the VLDB Endowment 10.14778/3551793.3551853
[26]
Dayan, Niv and Idreos, Stratos. 2019. The log-structured merge-bush & the wacky continuum. In Proceedings of the 2019 International Conference on Management of Data. 449--466.
[30]
Facebook. 2016. RocksDB. https://github.com/facebook/rocksdb.
[32]
Google. 2016. LevelDB. https://github.com/google/leveldb/.
[34]
Charles R Harris, K Jarrod Millman, Stéfan J Van Der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. 2020. Array programming with NumPy. Nature 585, 7825 (2020), 357--362.
[35]
Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. Springer.
[36]
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. DeepDB: Learn from Data, not from Queries! Proceedings of the VLDB Endowment 13, 7 (2019).
[38]
Enhui Huang, Liping Peng, Luciano Di Palma, Ahmed Abdelkafi, Anna Liu, and Yanlei Diao. 2018. Optimization for active learning-based interactive database exploration. Ph.D. Dissertation. Ecole Polytechnique; University of Massachusetts Amherst.
[41]
Andy Huynh Harshal A. Chaudhari Evimaria Terzi and Manos Athanassoulis. 2023. Towards Flexibility and Robustness of LSM Trees. arXiv:2311.10005 [cs.DB] 10.1007/s00778-023-00826-9
[42]
Stratos Idreos Niv Dayan Wilson Qin Mali Akmanalp Sophie Hilgard Andrew Ross James Lennon Varun Jain Harshita Gupta David Li et al. 2019. Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn.. In CIDR.
[43]
William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, et al. 2015. BetrFS: A Right-Optimized Write-Optimized File System.. In FAST, Vol. 15. 301--315.
[49]
Honghao Lin, Tian Luo, and DavidWoodruff. 2022. Learning augmented binary search trees. In International Conference on Machine Learning. PMLR, 13431--13440.

Showing 50 of 91 references

Metrics
5
Citations
91
References
Details
Published
Sep 30, 2024
Vol/Issue
2(4)
Pages
1-26
License
View
Funding
Nanyang Technological University Award: 022029-00001
Cite This Article
Weiping Yu, Siqiang Luo, Zihao Yu, et al. (2024). CAMAL: Optimizing LSM-trees via Active Learning. Proceedings of the ACM on Management of Data, 2(4), 1-26. https://doi.org/10.1145/3677138