This article is part of the Academic Alibaba series and is taken from the paper entitled “PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Databases” by Wei Cao, Zhenjun Liu, Peng Wang, Sen Chen, Caifeng Zhu, Song Zheng, Yuhui Wang, and Guoqing Ma, accepted by VLDB 2018. The full paper can be read here.
Recently, cloud computing services are increasingly looking to decouple storage from compute functions. However, while there are a number benefits to this, it also poses some key challenges related to the use of distributed file systems with emerging hardware methods like RDMA, NVMe, and SPDK. In response to this, the Alibaba tech team have developed a new bespoke distributed file system called PolarFS to fully tap into the potential of these emerging technologies.
Decoupled cloud architecture is more flexible and allows for the exploitation of shared storage capabilities, which has several benefits:
1. Compute and storage can be customized independently and use different types of server hardware.
2. Disks on storage node clusters can form a single storage pool, reducing the risk of fragmentation, disk-usage imbalances, and space-wasting.
3. The capacity and throughput of a storage cluster can be scaled out transparently.
4. Compute nodes do not have local persistent state, as all data is stored on the storage cluster. As a result, data reliability is improved through the use of data replication and other high-availability features from the underlying distributed storage system.
Cloud database services also benefit from decoupling. Decoupling allows databases to build on more secure and easily-scalable environments based on virtualization techniques such as Xen, KVM, or Docker. Additionally, key features of databases, such as multiple read-only instances and checkpoints, can be enhanced with the support of back-end storage clusters that provide fast I/O, data sharing, and snapshots.
As data storage technology continues developing at a rapid pace, the current leading cloud platforms are struggling to take full advantage of emerging hardware methods like RDMA, NVMe, and SPDK. Certain widely-used open-source distributed file systems, such as HDFS and CEPH, have much higher latency than local disks. When used with current PCIe SSDs, this performance gap is increased dramatically. Moreover, the performance of relational databases like MySQL running directly on these storage systems is significantly worse than those on local PCIe SSDs with identical CPUs and memory configurations.
As a solution, vendors like Google Cloud, AWS, and Alibaba Cloud use a system called instance store. Instance store uses a local SSD and high I/O VM instance to meet customer requirements for high performance databases. Unfortunately, this system has several drawbacks:
1. Limited capacity that is unsuitable for large database service.
2. Databases must manage data replication by themselves to ensure data reliability.
3. General-purpose file systems (e.g. ext4 or XFS) compromise I/O throughput because the passing cost between kernel space and user space when used with low I/O latency hardware (e.g. RDMA, PCIe SSD).
4. Does not support share-everything database architecture, a key feature of advanced cloud database service.
In response to the issues with instance store, the Alibaba Tech team has designed and implemented the PolarFS distributed file system, which meets customer requirements for low-latency and high throughput in several ways:
1. Uses emerging hardware like RDMA and NVMe SSD with a lightweight network stack and an I/O stack in user space to avoid trapping into kernel and dealing with kernel locks.
2. Provides a file system API that is similar to POSIX that allows the entire I/O path to be kept in user space.
3. The data plane I/O mode eliminates locks, avoids context switches on the critical path, and eliminates all unnecessary memory copies so the DMA can be utilized for transferring data between main memory and RDMA NIC/NVMe disks.
These features combine to drastically reduce the end-to-end latency of PolarFS, making it function similar to a local file system on an SSD.
Despite the improved performance achieved by PolarFS, its I/O scalability was impeded when using low latency hardware under distributed systems developed with Raft. To meet this challenge, the Alibaba Tech team proposed ParallelRaft, an enhanced consensus protocol. Based on Raft, this new protocol allows out-of-order log acknowledging, committing and applying, while letting PolarFS comply with traditional I/O schematics, significantly improving its parallel I/O concurrency
As an addendum to this solution, the Tech team implemented a relational database system called POLARDB, which was modified from AliSQL, a newly available database service on the Alibaba cloud computing platform. POLARDB uses shared storage architecture and supports a multiple read-only instances feature. Its databases node are divided into primary nodes and read-only (RO) nodes. The primary node handles read and write queries, while the RO node provides read service. Both types share redo log files and data files under the same database directory in in PolarFS.
POLARDB architecture
In addition, PolarFS supports POLARDB with the following features:
1. Synchronized modification of file meta-data from primary nodes to RO nodes.
2. Ensures concurrent modifications to file metadata are serialized so the file system is consistent across database nodes.
3. In the event of a network partition, prevents data corruption by ensuring that only the real primary node is served successfully.
Polar FS and POLARDB provide significant benefits to cloud service vendors looking to decouple storage and compute architecture. As detailed in this article, PolarFS can provide:
· Ultra-low latency
· Maximize I/O throughput
· Decrease the potential for failures and other errors
For more information about PolarFS and its supporting systems, the full paper can be read here.
First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.