Guest post originally published on PingCAP’s blog by Ke’ao Yang, Software Engineer at PingCAP

At the heart of modern software is layered abstraction. In abstraction, each layer hides details that are not relevant to other layers and provides a simple, functioning interface.

However, because abstraction brings the extra overhead of system calls, its simplicity comes at the cost of performance. Thus, for performance-sensitive software such as databases, abstraction might bring unwanted consequences. How can we boost performance for these applications?

In this article, I’ll talk about the pitfalls of abstraction in the modern storage structure and how we tackled the problem at TiDB Hackathon 2020 by introducing the Storage Performance Development Kit (SPDK) in TiKV.

The pitfalls of abstraction

In the magic world of abstraction, everything is neatly organized. The application doesn’t make its execution plan; it lets the database decide. The database doesn’t write data to disks; it leaves that to the operating system’s discretion. The OS, of course, never operates directly on memory cells; instead, it asks the disk’s control chip to do it. More often than not, the concrete tasks are handed over to the party with more knowledge so that everyone is happy and content.

When a database reads and writes data, it takes advantage of abstraction in the file system. Via the file system, the database creates, writes, and reads files without directly operating on the hard disks. When the database starts a system call, it takes several steps before the data is finally persisted to the disks:

  1. The OS receives the system call.
  2. The OS operates on the virtual file system (VFS) and the page caching mechanism.
  3. The file system implements specific behaviors.
  4. Data is read from and written into the block device, where the I/O schedulers and device mappers take effect.
  5. The system sends instructions to storage devices via hardware drivers.
  6. The disk controller processes the instructions and operates on the storage medium.

This process may have four underlying problems:

The solution to all these problems is to push the application towards the lower level to reduce abstraction. The more chores the application does by itself, the greater performance boost it gets.

How we reduce abstraction in TiKV

TiKV is a distributed transactional key-value database. At TiDB Hackathon 2020, we tried to reduce abstraction in TiKV by pushing TiKV to step 5 mentioned above, that is, sending instructions to NVMe disks directly. This way, we eliminated the overhead of several abstractions.

SPDK and BlobFS

How many levels of abstraction should be reduced is a matter of trade-off. Because of NVMe’s popularity and its new features (4 KB atomic writes, low latency, and high concurrency queues), we believe that redesigning TiKV for NVMe disks is worth the effort.

Luckily, the open-source community thinks so, too. Intel provides SPDK, a set of open-source toolkits and libraries for writing high-performance storage applications, which includes user-mode drivers and packages for NVMe. Via methods like Virtual Function I/O (VFIO), user-mode drivers map the hardware I/O to memory accessible by the user mode so that the application can access the hardware without taking a detour to the OS. VFIO is widely used in virtual machines to allow them to directly access the graphics card or network card.

Beyond that, SPDK implements a file system called BlobFS, which provides function interfaces similar to the POSIX file system. With BlobFS, the application performs I/O operations in three steps:

  1. The application calls functions such as blobfs_create and blobfs_read.
  2. The filesystem operations are mapped to storage device operations.
  3. BlobFS sends the instructions to NVMe disks. (Strictly speaking, the instructions are written into the corresponding memory space in NVMe devices.)

Compared with the Linux I/O process, the SPDK I/O processes are much simpler and more efficient:

This solution solves the four problems we mentioned earlier: it removes the syscall overhead, uses data structures and caching algorithms more suitable for databases and NVMe disks, and simplifies file system logging. If we integrate SPDK BlobFS into TiKV, we can expect to see a huge performance increase.

Measuring TiKV improvements

To benchmark our experiment, we use YCSB Workload A (update heavy workload) to test the final I/O. As shown in the figures below, the results are beyond our expectations. The labels on data points refer to the number of client threads. Clearly, under the same latency, SPDK-based TiKV processes higher operations per second (OPS):

YCSB Workload A read performance chart

YCSB Workload A read performance

YCSB Workload A write performance chart

YCSB Workload A write performance

Going forward

How far could this Hackathon project go? The answer depends. As NVMe disks become more popular and TiKV keeps striving for higher performance, our project is sure to deliver concrete results.

Moreover, as TiDB Cloud, the fully managed TiDB service, becomes publicly available, TiDB users can start a cluster in a few clicks. They can enjoy the benefits of SPDK without enduring the pain of configuring it by themselves.

In the industry, people have been exploring related topics for years:

These approaches are promising in their own ways. One day, we may see one or more of them integrated into TiKV, creating a database with ever higher storage performance.