Budget Fair Queueing (BFQ) Storage-I/O Scheduler

In this page we report a selection of our test results with BFQ (v7r2 or v7, see below), CFQ, DEADLINE and NOOP, under 3.13.0, and on the following devices:

A (high-speed) Seagate ST1000DM003 HDD
A pair of the above HDDs in a software RAID1 configuration
A PLEXTOR PX-256M5S SSD
A (low-speed) 1.8-inch Toshiba MK6006GAH HDD
A Transcend microSDHC Class 6 Card
A SanDisk SEM16G eMMC

We used BFQ-v7r2 in the tests on the first three devices, whereas BFQ-v7 was used in the tests on the last three devices. In particular, the latter results are a courtesy of Virtual Open Systems. They will probably repeat tests also with BFQ-v7r2. We will replace the current results with BFQ-v7 with the new ones as soon as the latter are available.

For each device we report only our throughput and application-responsiveness results. As for responsiveness, we report, for brevity, only our results on the cold-cache start-up time of a light-weight application, xterm, and a heavier application, oowriter (Open Office Writer). For the first three devices, responsiveness results with further applications, as well as frame-drop-rate results with a video player can be found in this extra result page.

To get all our results we used our ad hoc benchmark suite. Finally, both a more accurate description of most of our previous results with rotational disks and an overview of the benchmark suite can be found in this technical report.

In what follows we call reader/writer a program (we used either dd or fio) that just reads/writes a large file. In addition we say that a reader/writer is sequential or random depending on whether it reads/writes the file sequentially or at random positions. For brevity we report only our results with synthetic and heavy workloads. In particular, we show application start-up times in rather extreme conditions, i.e, with very heavy background workloads. We consider lighter workloads only with the last three, slower devices.

Seagate HDD

The next figure shows the throughput achieved by each of the schedulers while one of the following four heavy workloads is being executed: 10 parallel sequential or random readers (10r-seq, 10r-rand), 5 parallel sequential or random readers plus 5 parallel sequential or random writers (5r5w-seq, 5r5w-rand).

BFQ outperforms the other schedulers with the sequential workloads, especially DEADLINE and NOOP. All the schedulers have instead about the same performance with 5r5w-rand, whereas BFQ achieves only slightly more than half the throughput of the other schedulers with 10r-rand. The reason is that, also for processes doing random I/O, BFQ cares more than the other schedulers about fairness. In particular, the fairness-related step of BFQ that causes this lower throughput with rotational disks, is that BFQ always guarantees a minimum disk-idling time also for processes doing random I/O. It would be relatively easy to relax this constraint for these processes, and get the same, still extremely low, throughput as the other schedulers. The problem is that it is hard, in general, to assess whether such a gain in terms of throughput would be worth the price to pay in terms in of fairness. In particular, if the throughput is an issue, then the (only) actual solution to get a non-negligible throughput with a random workload seems to be to switch to a flash-based device.

The next figure shows instead the cold-cache start-up time of xterm while one of the above heavy background workloads is being executed. The symbol X means that, for that workload and with that scheduler, the application failed to start in 60 seconds.

Seagate HDD xterm start-up time — **Figure 2**. *xterm* start-up time on the Seagate HDD (lower is better).

As can be seen, with any workload BFQ guarantees the same start-up time as if the disk was idle. With the other schedulers the application either takes a long time to start or is practically unresponsive. We ran tests also with lighter background workloads, and, also in those cases, the responsiveness guaranteed by these schedulers was noticeably worse than that guaranteed by BFQ (results available on demand).

Finally, here are our responsiveness results with oowriter.

Except for the case 5r5w-seq, BFQ again guarantees a start-up time comparable to that with an idle disk, whereas with all the other schedulers the application fails, with any background workload, to start in 60 seconds. As for 5r5w-seq, the higher start-up time of oowriter with BFQ is mainly due to the combination of the following tricky problems:

oowriter needs to perform some writes to complete its start-up process.
Because of the high rate at which the five greedy writers issue write requests, dirty pages start to be flushed soon. After that, and until greedy writers stop, all writes become blocking, i.e., any process issuing a write request is blocked until that request is completed.
Write requests are delegated to flusher threads, and, in the BFQ queues associated to these threads, the writes generated by oowriter get mixed up anonymously with the writes incessantly generated by the greedy writers.
BFQ has no information to know that oowriter's writes are more urgent than others. In addition, the writes generated by the greedy writers are sequential, i.e., the most beneficial ones to achieve a high throughput. For these reasons, BFQ privileges the latter with respect to oowriter writes.
BFQ guarantees however that, after a given maximum waiting time, any queued request eventually gets dispatched. But having each write request to wait, in the worst-case, for this maximum time before being dispatched, significantly inflates the start-up time of oowriter (on the opposite side, reducing this maximum waiting time would unavoidably impact throughput).

Pair of Seagate HDDs in software RAID1

RAID results basically match the above results with one HDD. Hence all the comments made on the above case apply also to the results reported in the following three figures.

Seagate RAID1 throughput — **Figure 4**. Throughput on the pair of Seagate HDDs in software RAID1 (higher is better).

Seagate RAID1 xterm start-up time — **Figure 5**. *xterm* start-up time on the pair of Seagate HDDs in software RAID1 (lower is better).

PLEXTOR SSD

With the SSD we consider only raw readers, i.e., processes reading directly from the device, to avoid writing large files repeatedly, and hence wearing out a costly SSD :)

SSD throughput — **Figure 7**. Throughput on the Plextor SSD (higher is better).

With sequential readers, BFQ loses about a 0.05 percent of throughput with respect to the other schedulers. This is the price it pays to achieve the start-up times shown in the next two figures. The throughput loss is instead at most 6% with random readers, for the following additional reason. With random readers, the number of IOPS is extremely higher, and all CPUs spend all time either executing instructions or waiting for I/O (the total idle-time percentage is 0). Therefore, the processing time of I/O requests influences the maximum throughput achievable. As a conclusion, the throughput slightly grows as the complexity, and hence the execution time, of the schedulers decrease.

As for responsiveness, for both applications BFQ achieves almost the lowest-possible start-up time with both workloads.

SSD xterm start-up time — **Figure 8**. *xterm* start-up time on the Plextor SSD (lower is better).

The high start-up times with the other schedulers in the presence of sequential readers is a consequence also of the fact that, to maximize throughput, the device prefetches requests, and, among internally-queued requests, privileges sequential ones. BFQ prevents the device from prefetching requests when that would hurt responsiveness. This behavior is paid with the above 0.05 loss of throughput with sequential readers.

1.8-inch Toshiba HDD

The heavy workloads considered so far make little sense with this slower device. So we considered lighter workloads, made, in particular, only of readers. In fact, with such a slow device, storms of writes cause anomalies that do not have much to do with the schedulers themselves. Unfortunately, we could not execute tests also with DEADLINE and NOOP (the execution of these tests is a courtesy of the staff of Virtual Open Systems). Throughput results follow.

1.8-inch
HD throughput — **Figure 10**. Throughput on the 1.8-inch Toshiba HDD (higher is better).

No result is available with CFQ with random workloads, because, with these workloads, the system became unresponsive and had just to be restarted.

As for responsiveness, it is worth noting also that this type of device is usually attached to an embedded system. It was connected exactly to an ARM embedded system in our tests. In most embedded systems, large applications, such as Open Office Writer, make little sense, or are not available at all. Hence we executed start-up tests only with xterm. The next figure shows our results.

1.8-inch
HD xterm start-up time — **Figure 11**. *xterm* start-up time on the 1.8-inch Toshiba HDD (lower is better).

With 5r-rand, BFQ performs worse than with the other workloads. The main cause of this problem is the coarse granularity of kernel time ticks on the embedded system, which unavoidably affects the precision of BFQ low-latency heuristics.

Transcend microSDHC Class 6 Card

The information provided, as well as the considerations made for the above 1.8-inch HDD, apply also to this device. Differently from what happened with the 1.8-inch HDD, in the throughput tests the system now remained responsive, with CFQ, also with random workloads. Hence, as shown in the next figure, we have results for all workloads.

MicroSDHC Card
throughput — **Figure 12**. Throughput on the Transcend microSDHC (higher is better).

Both schedulers achieve about the same performance for each workload. Finally, responsiveness results have the same pattern as with the 1.8-inch HDD.

MicroSDHC
Card xterm start-up time — **Figure 13**. *xterm* start-up time on the Transcend microSDHC (lower is better).

SanDisk SEM16G eMMC

As for the microSDHC Card, all the information and considerations provided for the 1.8-inch HDD apply also to the eMMC. Finally, as shown in the following two figures, results with eMMC have basically the same pattern as with the microSDHC. The scale of course differs, as the eMMC is definitely faster than the microSDHC Card.

eMMC
throughput — **Figure 14**. Throughput on the SanDisk eMMC (higher is better).

eMMC xterm
start-up time — **Figure 15**. *xterm* start-up time on the SanDisk eMMC (lower is better).