journal article Open Access Jun 18, 2021

Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment

Electronics Vol. 10 No. 12 pp. 1471 · MDPI AG
View at Publisher Save 10.3390/electronics10121471
Abstract
Data are important and ever growing in data-intensive scientific environments. Such research data growth requires data storage systems that play pivotal roles in data management and analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known storage technology combining multiple disks into a single large logical volume, has been widely used for the purpose of data redundancy and performance improvement. However, this requires RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file systems have been developed and are being actively used in various environments, especially in data-intensive computing facilities, where a tremendous amount of data have to be handled. In this study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS, Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect the read and write performance depending on the features of data, which have to be considered in data-intensive computing environments.
Topics

No keywords indexed for this article. Browse by subject →

References
28
[1]
Rydning, D.R.J.G.J. (2021, January 04). The Digitization of the World from Edge to Core. Available online: https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf.
[2]
CERN (2021, January 04). Storage|CERN. Available online: https://home.cern/science/computing/storage.
[3]
Mascetti "CERN Disk Storage Services: Report from last data taking, evolution and future outlook towards Exabyte-scale storage. EPJ Web of Conferences" EDP Sci. (2020)
[4]
OpenSFS (2021, January 04). About the Lustre® File System|Lustre. Available online: https://www.lustre.org/about/.
[5]
Bohossian "Computing in the RAIN: A reliable array of independent nodes" IEEE Trans. Parallel Distrib. Syst. (2001) 10.1109/71.910866
[6]
Szeredi, M. (2021, January 04). Libfuse: Libfuse API Documentation. Available online: http://libfuse.github.io/doxygen/.
[7]
Tarasov, V., Gupta, A., Sourav, K., Trehan, S., and Zadok, E. (2015, January 6–7). Terra Incognita: On the Practicality of User-Space File Systems. Proceedings of the 7th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 15), Santa Clara, CA, USA.
[8]
Ceph Foundation (2021, January 04). Architecture—Ceph Documentation. Available online: https://docs.ceph.com/en/latest/architecture/.
[9]
Weil, S.A., Brandt, S.A., Miller, E.L., and Maltzahn, C. (2006, January 11–17). CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data. Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, Tampa, FL, USA. SC ’06. 10.1109/sc.2006.19
[10]
CERN (2021, January 04). Introduction—EOS CITRINE Documentation. Available online: https://eos-docs.web.cern.ch/intro.html.
[11]
CERN (2021, January 04). RAIN—EOS CITRINE Documentation. Available online: https://eos-docs.web.cern.ch/using/rain.html.
[12]
Red Hat (2021, January 04). Introduction—Gluster Docs. Available online: https://docs.gluster.org/en/latest/Administrator-Guide/GlusterFS-Introduction/.
[13]
Red Hat (2021, January 04). Architecture—Gluster Docs. Available online: https://docs.gluster.org/en/latest/Quick-Start-Guide/Architecture/.
[14]
OpenSFS (2021, January 04). Introduction to Lustre—Lustre Wiki. Available online: https://wiki.lustre.org/Introduction_to_Lustre#Lustre_Architecture.
[15]
Gudu, D., Hardt, M., and Streit, A. (2014, January 27–30). Evaluating the performance and scalability of the Ceph distributed storage system. Proceedings of the 2014 IEEE International Conference on Big Data (Big Data), Washington, DC, USA. 10.1109/bigdata.2014.7004229
[16]
Zhang, X., Gaddam, S., and Chronopoulos, A.T. (2015, January 25–27). Ceph Distributed File System Benchmarks on an Openstack Cloud. Proceedings of the 2015 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), Bangalore, India. 10.1109/ccem.2015.12
[17]
Kumar, M. (2015). Characterizing the GlusterFS Distributed File System for Software Defined Networks Research. [Ph.D. Thesis, Rutgers The State University of New Jersey].
[18]
Acquaviva, L., Bellavista, P., Corradi, A., Foschini, L., Gioia, L., and Picone, P.C.M. (2018, January 9–13). Cloud Distributed File Systems: A Benchmark of HDFS, Ceph, GlusterFS, and XtremeFS. Proceedings of the 2018 IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates. 10.1109/glocom.2018.8647218
[19]
Li, X., Li, Z., Zhang, X., and Wang, L. (2010, January 10–12). LZpack: A Cluster File System Benchmark. Proceedings of the 2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, Huangshan, China. CYBERC ’10. 10.1109/cyberc.2010.88
[20]
Lee, J., Song, C., and Kang, K. (2016, January 10–14). Benchmarking Large-Scale Object Storage Servers. Proceedings of the 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), Atlanta, GA, USA. 10.1109/compsac.2016.72
[21]
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., and Sears, R. (2010, January 10–11). Benchmarking Cloud Serving Systems with YCSB. Proceedings of the 1st ACM Symposium on Cloud Computing, Indianapolis, IN, USA. SoCC ’10. 10.1145/1807128.1807152
[22]
(2021, January 04). Red Hat. Chapter 9. Benchmarking Performance Red Hat Ceph Storage 1.3|Red Hat Customer Portal. Available online: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.3/html/administration_guide/benchmarking_performance.
[23]
Li, J., Wang, Q., Jayasinghe, D., Park, J., Zhu, T., and Pu, C. (July, January 27). Performance Overhead among Three Hypervisors: An Experimental Study Using Hadoop Benchmarks. Proceedings of the 2013 IEEE International Congress on Big Data, Santa Clara, CA, USA. BIGDATACONGRESS ’13. 10.1109/bigdata.congress.2013.11
[24]
(2021, January 04). IEEE Standard for Information Technology–Portable Operating System Interface (POSIX(TM)) Base Specifications, Issue 7. IEEE Std 1003.1-2017 (Revision of IEEE Std 1003.1-2008); 2018; pp. 2641–2649. Available online: https://ieeexplore.ieee.org/document/8277153/.
[25]
Russel Cocker (2021, January 04). Bonnie++ Russell Coker’s Documents. Available online: https://doc.coker.com.au/projects/bonnie/.
[26]
Don capps (2021, January 04). Iozone Filesystem Benchmark. Available online: http://iozone.org.
[27]
Axboe, J. (2021, January 04). GitHub—axboe/fio: Flexible I/O Tester. Available online: https://github.com/axboe/fio.
[28]
OpenSFS (2021, January 04). Lustre Roadmap|Lustre. Available online: https://www.lustre.org/roadmap/.
Related

You May Also Like

Machine Learning Interpretability: A Survey on Methods and Metrics

Diogo V. Carvalho, Eduardo M. Pereira · 2019

1,384 citations

The k-means Algorithm: A Comprehensive Survey and Performance Evaluation

Mohiuddin Ahmed, Raihan Seraj · 2020

1,342 citations

Sentiment Analysis Based on Deep Learning: A Comparative Study

Nhan Cach Dang, María N. Moreno-García · 2020

550 citations