Budget Fair Queueing (BFQ) Storage-I/O Scheduler

In this page we report a selection of our blk-mq benchmarks with NONE, MQ-DEADLINE, KYBER, and the development version BFQ-MQ-V10 (which coincides with the mainline BFQ that will probably be available from Linux 5.1 or 5.2), on Linux 4.19.0, and on the following two devices:

PLEXTOR PX-256M5S SSD
HITACHI HTS72755 HDD

Results with previous versions of Linux and of BFQ, and, in case of old-enough versions of Linux, with legacy blk and with many more devices, can be found here. In general, our results mostly depend only on the BFQ version, and are essentially the same with any kernel version, and with either blk-mq or legacy blk. In addition, the relative performance of BFQ, with respect to the other I/O schedulers, is the same with any storage medium.

For each device, we report the results of our throughput, application-responsiveness (start-up time) and video-playing (frame-drop-rate) benchmarks. The last two benchmarks measure also total throughput during the test, but we do not report throughput measurements too for these benchmarks, as these values are little meaningful. In fact:

Starting applications and playing videos entail relatively short I/O, and we benchmark these tasks in hostile conditions, i.e., while a lot of extra I/O is being generated too;
blk-mq I/O schedulers are work-conserving, apart from BFQ, which, to privilege critical I/O, may occasionally plug I/O dispatching. However, plugging lasts at most a few milliseconds.
With a mostly work-conserving I/O scheduler, short I/O influences total throughput very little or not at all, if there is a lot of extra I/O in progress.

More precisely, these benchmarks are part of the S benchmark suite, and can be repeated with the following commands:

  git clone https://github.com/Algodev-github/S.git
  cd S/run_multiple_benchmarks
  sudo ./run\_main\_benchmarks.sh "throughput replayed-startup video-playing" "none mq-deadline kyber bfq"

In what follows, we call reader/writer a program (fio in the S suite) that just reads/writes a large file. In addition, we say that a reader/writer is sequential or random depending on whether it reads/writes the file sequentially or at random positions. For brevity, we report only our results with synthetic, heavy workloads. The goal is to show application start-up times in rather extreme conditions, i.e, with very heavy background workloads.

PLEXTOR SSD

Next figure shows the throughput reached by each I/O scheduler while one of the following four heavy workloads is being executed: 10 parallel sequential or random sync readers (10r-seq, 10r-rand), 5 parallel, sequential or random sync readers plus 5 parallel sequential or random writers (5r5w-seq, 5r5w-rand). The symbol X means that, for that workload and with that scheduler, the benchmark script failed to terminate within 10 seconds from due termination time (which implies that the system, and thus the results, were not reliable).

SSD throughput — **Figure 4**. Throughput on the Plextor SSD (higher is better).

BFQ reaches a ~3% lower throughput than the best-performing scheduler (NONE) with 10r-rand. This happens because some of the processes spawned by the benchmark script do occasional I/O during the test, and BFQ, in low-latency mode (default), is willing to trade throughput for the latency guaranteed to occasional, little I/O. If throughput is so critical that latency can be sacrificed, then just disable low-latency mode. On the other end, BFQ's low-latency heuristics virtually do not affect throughput with 10r-seq, and BFQ reaches basically the same throughput as the other schedulers.

BFQ gets a higher throughput than NONE and KYBER with 5r5w-rand and 5r5w-seq, because BFQ privileges reads over writes (as system-level latency mostly depends on reads), and random reads reach a higher throughput than random writes. MQ-DEADLINE reaches an even higher throughput for the opposite reason: it privileges writes at such an extent that it dispatches reads very rarely. This maximizes throughput, but evidently makes readers starve.

As for responsiveness, for gnome-terminal BFQ guarantees the lowest-possible start-up time with only reads in the background, and about twice the lowest-possible start-up time with reads and writes. The reason for the increase of the start-up time in the latter case is reported in the comments on lowriter start-up times in the extra result page.

SSD gnome-terminal start-up time — **Figure 5**. *gnome-terminal* start-up time on the Plextor SSD (lower is better).

The other schedulers cause a much higher start-up time, in spite of the high speed of the device. This happens also because the drive prefetches I/O requests, and, among internally-queued requests, privileges sequential ones. Problems get worse with writes. In contrast, BFQ prevents the drive from prefetching requests when that would hurt latency. Results with both smaller and larger applications can be found in this extra result page.

Finally, the next figure shows our video-playing results.

**Figure 6**. Video-playing frame-drop rate on the Plextor SSD (lower is better).

Results are essentially good with all schedulers. However, the figure does not show the fact that the player takes a lot of time to start up with all schedulers but BFQ.

HITACHI HDD

For each benchmark, we report our results for the same workloads as with the SSD.

For all workloads but 5r5w-rand, the benchmark simply fails with all schedulers but BFQ. For 5r5w-rand, BFQ outperforms the other schedulers.

Next figure shows the cold-cache start-up time of gnome-terminal, a medium-size application, while one of the above two heavy sequential workloads is being executed in the background. We consider only sequential workloads, because these are the nastiest background workloads for responsiveness. In fact, this is the I/O that both the kernel I/O stack and the storage-device firmware prefer, and thus privilege. The reason is that sequential I/O is the one that boosts throughput most, while sync reads are the most time-critical operations. The symbol X in the figure means that, for that workload and with that scheduler, the application failed to start in 60 seconds.

HITACHI HDD gnome-terminale start-up time — **Figure 2**. *gnome-terminal* start-up time on the HITACHI HDD (lower is better).

As can be seen, with any workload BFQ guarantees about the same start-up time as if the device was idle. With the other schedulers, the application in practice does not start at all. We ran tests with lighter background workloads too, and, also in those cases, the responsiveness guaranteed by these schedulers was noticeably worse than that guaranteed by BFQ (results available on demand). Results with both smaller and larger applications can be found in this extra result page.

Finally, video-playing results are shown in next figure. In this benchmark, the same background workloads as for the responsiveness tests are generated, and, to make the background workload even more demanding for the time-sensitive application under test, a bash shell is also started and terminated repeatedly. This time the symbol X means that the playback of the video did not terminate within a 60-second timeout after its actual duration, and thus the test was aborted. In most of the failed cases, the playback of the video actually did not start at all.

Video-playing frame-drop rate on the HitachiHDD — **Figure 3**. Video-playing frame-drop rate on the Hitachi HDD (lower is better).

As can be seen, the performance of BFQ is not even comparable with that of the other schedulers.