Budget Fair Queueing (BFQ) Storage-I/O Scheduler

This page contains a short description of the properties of BFQ, of how it works, of its interface, and of what still remains to do. The same information, plus guidance on how to use to properly configure BFQ can be found in the BFQ documentation file in the kernel source tree (Documentation/block/bfq-iosched.txt). Alternatively, and in addition to the demos mentioned in the homepage, in the following technical reports you can find many more technical details than those reported in this page:

This technical report is currently the most up-to-date and easy-to-read document on BFQ, more precisely on BFQ-v7r8.
This technical report, more detailed than the above one, contains a medium-depth description of BFQ, more precisely of BFQ-v1; it also contains both a description of the benchmark suite and various experimental results with the original version of BFQ, BFQ-v1 and the CFQ I/O scheduler.
All the details about the original algorithm can be found, instead, in this more theoretical technical report.

Both technical reports are full versions of two papers published on the IEEE Transactions on Computers and in the proceedings of SYSTOR'12; please cite these papers and not the above technical reports (you can find information about these papers in my homepage). In the second technical report you can also find a detailed comparison against other production and research schedulers. A shorter comparison can be found here. Finally, a short history of BFQ is available here.

Properties

BFQ (Budget Fair Queueing) is a Proportional Share, or equivalently Fair Queueing, I/O scheduler that allows each process/thread to be assigned a fraction of the I/O throughput. It has the following characteristics.

It distributes the throughput to I/O-bound processes as desired, even if it fluctuates, independently of the device parameters and with any workload. BFQ does not provide this sector-domain fairness to processes issuing random requests, as this would easily cause the throughput to drop on one hand, and other processes to experience very high latencies on the other hand. For these processes BFQ switches to a more appropriate time-domain fairness, in which it is the device time to be fairly distributed (basically the same service scheme of the CFQ I/O scheduler).
BFQ guarantees to each I/O request a tight delay with respect to the completion time that the request would enjoy in an ideal (unfeasible) perfectly-fair system. This is especially beneficial for soft real-time applications.
BFQ exports a low_latency tunable. If enabled (currently the default), BFQ executes a special heuristics that automatically gives to interactive and soft real-time applications more than their fair share of the I/O throughput, to reduce their latency. According to our results, for desktop or handheld usage, the system becomes virtually as responsive as if the disk was idle, whatever the actual disk load is. Soft real-time applications do not suffer from almost any glitch even in presence of heavy background workloads. As an example, here are some possible background workloads for a personal system:
- one or more large files are being read or written,
- a tree of source files is being compiled,
- one or more virtual machines are performing I/O,
- a software update is in progress,
- indexing daemons are scanning the filesystems and updating their databases,
Low-latency guarantees are preserved also in presence of NCQ.
Optimized to reach a high throughput on SSDs without losing low-latency guarantees.
According to our results, BFQ achieves up to 100% higher aggregate I/O throughput than the other I/O schedulers, with some workloads on HDDs, and about the the same throughput with the rest of the workloads or on SSDs.
BFQ uses a unified mechanism (Early Queue Merge, EQM) to get a sequential read pattern, and hence a high throughput, with any set of processes performing interleaved I/O. EQM also preserves low latency. More details here.

Internals

BFQ is based on CFQ code, but it implements a more accurate scheduling policy. In brief, differently from CFQ, BFQ does not grant the disk to each process for a fixed time-slice, but until the process exhausts a previously assigned budget, measured in number of sectors. Budgets are scheduled with an internal budget scheduler, B-WF2Q+, a modified version of the WF2Q+ packet scheduler.

In more detail, in BFQ, as in CFQ, synchronous requests are collected in per-task queues, and asynchronous requests are collected in per-device (or, in case of hierarchical scheduling, per group) queues. When a new queue is created, it is assigned a budget, measured in number of sectors, to use when the queue itself is granted access to the disk. BFQ exclusively serves one queue at a time. When the underlying device driver asks for the next request to serve, BFQ selects the first request from the queue under service in C-LOOK order and returns it to the driver. The budget of the queue is decremented by the size of the request. As in CFQ, if a sync queue has no more requests to serve, but it has some budget left, the scheduler idles (i.e., it tells to the device driver that it has no requests to serve even if there are other backlogged queues) for a short period, waiting for a new request from the task associated to the queue. The queue under service goes on being served until one of the following conditions is met:

the idling timer expires while BFQ is idling to wait for the arrival of the next request from the task associated to the queue
the queue is spending too much time to consume its budget (budget timeout, see the timeout_* parameters below), or
the queue exhausts its budget, or
the queue is not sync and has no more requests to serve.

When the queue is deselected, it is assigned a new budget, computed as a function of which of the above events occurred. The next queue to serve is selected with B-WF2Q+, a modified version of the WF2Q+ packet scheduler. B-WF2Q+ schedules backlogged queues as a function of: their budgets, their weights and the service they have already received.

Working in the service domain and not in the time domain is what allows BFQ to distribute the throughput as desired, unconditionally and even if it fluctuates. In addition, the accurate service distribution of the internal B-WF2Q+ scheduler, guarantees tight delays with respect to the completion times that the requests would enjoy in an ideal perfectly-fair system.

If several processes are competing for the device at the same time, but all processes and groups have the same weight, then BFQ guarantees the expected throughput distribution without ever idling the device. It uses preemption instead. Throughput is then much higher in this common scenario.

The idea for reducing the latency for interactive and soft real-time processes (when low_latency is enabled) is just raising the weight of the associated queues. More details can be found here. An important point is that such an heuristics allows interactive and soft real-time queues to steal more throughput than their fair share to other queues. If the resulting loss of fairness is undesired (e.g., in a server system), it is certainly better to switch low_latency off.

When hierarchical scheduling is enabled, queues are collected in a tree of groups, and there is a distinct B-WFQ2+ scheduler on each non-leaf node. Leaf nodes are request queues as in the non-hierarchical case.

BFQ supports ioprio classes at each hierarchy level, enforcing a strict priority ordering among classes. This means that idle queues/groups are served only if there are no best effort queues/group in the same scheduler, and best effort queues/groups are served only if there are no real-time queues/groups.

Tunables

Here is a short description of the main parameters available to the user for configuring BFQ, exported through the canonical I/O scheduler sysfs interface are. See Documentation/block/bfq-iosched.txt for a detailed description of all the parameters.

per-process ioprio and weight: unless the cgroups interface is used, weights can be assigned to processes only indirectly, through I/O priorities, and according to the relation: weight = (IOPRIO_BE_NR - ioprio) * 10.
slice_idle: specifies how long BFQ should idle for next I/O request, when certain sync BFQ queues become empty. By default slice_idle is a non-zero value. Idling has a double purpose: boosting throughput with sequential workloads, especially on rotational devices, and making sure that the desired throughput distribution is respected.
strict_guarantees: if set (default: unset), then BFQ (a) always performs idling when the in-service queue becomes empty, (b) forces the device to serve one I/O request at a time, by dispatching a new request only if there is no outstanding request. In the presence of differentiated weights or I/O-request sizes, both the above conditions are needed to guarantee that every BFQ queue receives its allotted share of the bandwidth. Setting strict_guarantees may evidently affect throughput.
timeout_sync: Maximum amount of device time that can be given to a task (queue) once it has been selected for service. On devices with costly seeks, increasing this time usually increases maximum throughput. On the opposite end, increasing this time coarsens the granularity of the short-term bandwidth and latency guarantees, especially if the following parameter is set to zero.
max_budget: Maximum amount of service, measured in sectors, that can be provided to a BFQ queue once it is set in service (of course within the limits of the above timeout). According to what said in the description of the algorithm, larger values increase the throughput in proportion to the percentage of sequential I/O requests issued. The price of larger values is that they coarsen the granularity of short-term bandwidth and latency guarantees. The default value is 0, which enables auto-tuning: BFQ sets max_budget to the maximum number of sectors that can be served during timeout_sync, according to the estimated peak rate.
low_latency: used to enable/disable BFQ's low latency mode. By default, low latency mode is enabled. If enabled, interactive and soft real-time applications are privileged and experience a lower latency, as explained in more detail in the description of how BFQ works.

Group Scheduling

BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely blkio and io. In particular, BFQ supports weight-based proportional share.

Service guarantees provided

With BFQ, proportional share means true proportional share of the device bandwidth, according to group weights. For example, a group with weight 200 gets twice the bandwidth, and not just twice the time, of a group with weight 100.

BFQ supports hierarchies (group trees) of any depth. Bandwidth is distributed among groups and processes in the expected way: for each group, the children of the group share the whole bandwidth of the group in proportion to their weights. In particular, this implies that, for each leaf group, every process of the group receives the same share of the whole group bandwidth, unless the ioprio of the process is modified.

The resource-sharing guarantee for a group may partially or totally switch from bandwidth to time, if providing bandwidth guarantees to the group lowers the throughput too much. This switch occurs on a per-process basis: if a process of a leaf group causes throughput loss if served in such a way to receive its share of the bandwidth, then BFQ switches back to just time-based proportional share for that process.

Interface

To get proportional sharing of bandwidth with BFQ for a given device, BFQ must of course be the active scheduler for that device.

Within each group directory, the names of the files associated with BFQ-specific cgroup parameters and stats begin with the bfq. prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for BFQ-specific files is blkio.bfq. or io.bfq. For example, the group parameter to set the weight of a group with BFQ is blkio.bfq.weight or io.bfq.weight.

Parameters to set

For each group, there is only the following parameter to set.

weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the group inside its parent. Available values: 1..10000 (default 100). The linear mapping between ioprio and weights, described at the beginning of the tunable section, is still valid, but all weights higher than IOPRIO_BE_NR*10 are mapped to ioprio 0.

TODO

Find ways to address or at least mitigate occasional high latencies for soft real-time applications, caused by:

high in-device service time of unlucky, dispatched I/O requests
group hierarchies (e.g., critical task in one group, interfered by a task in another group)
I/O blocking above blk, caused by writeback pressure

Fix compilation of BFQ as a module (for some reason it prevents BFQ from being select as the default scheduler
Investigate ways to tackle even better the write-hog problem, which causes interactive applications to experience, occasionally, a quite higher latency than the average. For example, properly using the may_queue function may provide better results.
Carefully check at what extent BFQ locking policies and execution time may become an issue with high-speed SSDs.
Check whether it is really necessary to defer the __blk_run_queue on completions, or, in contrast, __blk_run_queue can be just invoked immediately.