Budget Fair Queueing (BFQ) Storage-I/O Scheduler

This page contains a short description of the properties of BFQ, of how it works, of its interface, and of what still remains to do. The same information, plus guidance on how to use to properly configure BFQ can be found in the BFQ documentation file in the kernel source tree (Documentation/block/bfq-iosched.txt). Alternatively, and in addition to the demos mentioned in the homepage, in the following technical reports you can find many more technical details than those reported in this page:

Both technical reports are full versions of two papers published on the IEEE Transactions on Computers and in the proceedings of SYSTOR'12; please cite these papers and not the above technical reports (you can find information about these papers in my homepage). In the second technical report you can also find a detailed comparison against other production and research schedulers. A shorter comparison can be found here. Finally, a short history of BFQ is available here.

Properties

BFQ (Budget Fair Queueing) is a Proportional Share, or equivalently Fair Queueing, I/O scheduler that allows each process/thread to be assigned a fraction of the I/O throughput. It has the following characteristics.

Internals

BFQ is based on CFQ code, but it implements a more accurate scheduling policy. In brief, differently from CFQ, BFQ does not grant the disk to each process for a fixed time-slice, but until the process exhausts a previously assigned budget, measured in number of sectors. Budgets are scheduled with an internal budget scheduler, B-WF2Q+, a modified version of the WF2Q+ packet scheduler.

In more detail, in BFQ, as in CFQ, synchronous requests are collected in per-task queues, and asynchronous requests are collected in per-device (or, in case of hierarchical scheduling, per group) queues. When a new queue is created, it is assigned a budget, measured in number of sectors, to use when the queue itself is granted access to the disk. BFQ exclusively serves one queue at a time. When the underlying device driver asks for the next request to serve, BFQ selects the first request from the queue under service in C-LOOK order and returns it to the driver. The budget of the queue is decremented by the size of the request. As in CFQ, if a sync queue has no more requests to serve, but it has some budget left, the scheduler idles (i.e., it tells to the device driver that it has no requests to serve even if there are other backlogged queues) for a short period, waiting for a new request from the task associated to the queue. The queue under service goes on being served until one of the following conditions is met:

When the queue is deselected, it is assigned a new budget, computed as a function of which of the above events occurred. The next queue to serve is selected with B-WF2Q+, a modified version of the WF2Q+ packet scheduler. B-WF2Q+ schedules backlogged queues as a function of: their budgets, their weights and the service they have already received.

Working in the service domain and not in the time domain is what allows BFQ to distribute the throughput as desired, unconditionally and even if it fluctuates. In addition, the accurate service distribution of the internal B-WF2Q+ scheduler, guarantees tight delays with respect to the completion times that the requests would enjoy in an ideal perfectly-fair system.

If several processes are competing for the device at the same time, but all processes and groups have the same weight, then BFQ guarantees the expected throughput distribution without ever idling the device. It uses preemption instead. Throughput is then much higher in this common scenario.

The idea for reducing the latency for interactive and soft real-time processes (when low_latency is enabled) is just raising the weight of the associated queues. More details can be found here. An important point is that such an heuristics allows interactive and soft real-time queues to steal more throughput than their fair share to other queues. If the resulting loss of fairness is undesired (e.g., in a server system), it is certainly better to switch low_latency off.

When hierarchical scheduling is enabled, queues are collected in a tree of groups, and there is a distinct B-WFQ2+ scheduler on each non-leaf node. Leaf nodes are request queues as in the non-hierarchical case.

BFQ supports ioprio classes at each hierarchy level, enforcing a strict priority ordering among classes. This means that idle queues/groups are served only if there are no best effort queues/group in the same scheduler, and best effort queues/groups are served only if there are no real-time queues/groups.

Tunables

Here is a short description of the main parameters available to the user for configuring BFQ, exported through the canonical I/O scheduler sysfs interface are. See Documentation/block/bfq-iosched.txt for a detailed description of all the parameters.

Group Scheduling

BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely blkio and io. In particular, BFQ supports weight-based proportional share.

Service guarantees provided

With BFQ, proportional share means true proportional share of the device bandwidth, according to group weights. For example, a group with weight 200 gets twice the bandwidth, and not just twice the time, of a group with weight 100.

BFQ supports hierarchies (group trees) of any depth. Bandwidth is distributed among groups and processes in the expected way: for each group, the children of the group share the whole bandwidth of the group in proportion to their weights. In particular, this implies that, for each leaf group, every process of the group receives the same share of the whole group bandwidth, unless the ioprio of the process is modified.

The resource-sharing guarantee for a group may partially or totally switch from bandwidth to time, if providing bandwidth guarantees to the group lowers the throughput too much. This switch occurs on a per-process basis: if a process of a leaf group causes throughput loss if served in such a way to receive its share of the bandwidth, then BFQ switches back to just time-based proportional share for that process.

Interface

To get proportional sharing of bandwidth with BFQ for a given device, BFQ must of course be the active scheduler for that device.

Within each group directory, the names of the files associated with BFQ-specific cgroup parameters and stats begin with the bfq. prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for BFQ-specific files is blkio.bfq. or io.bfq. For example, the group parameter to set the weight of a group with BFQ is blkio.bfq.weight or io.bfq.weight.

Parameters to set

For each group, there is only the following parameter to set.

weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the group inside its parent. Available values: 1..10000 (default 100). The linear mapping between ioprio and weights, described at the beginning of the tunable section, is still valid, but all weights higher than IOPRIO_BE_NR*10 are mapped to ioprio 0.

TODO