Budget Fair Queueing (BFQ) Storage-I/O Scheduler

This page contains a short description of the properties of BFQ and of how it works. Alternatively, and in addition to the demos mentioned in the homepage, in the following technical reports you can find many more details than those reported in this page:

Both technical reports are full versions of two papers published on the IEEE Transactions on Computers and in the proceedings of SYSTOR'12; please cite these papers and not the above technical reports (you can find information about these papers in my homepage). In the second technical report you can also find a detailed comparison against other production and research schedulers. A shorter comparison can be found here. Finally, a short history of BFQ is available here.

Properties

BFQ (Budget Fair Queueing) is a Proportional Share, or equivalently Fair Queueing, disk scheduler that allows each process/thread to be assigned a fraction of the disk throughput. It has the following characteristics.

Internals

BFQ is based on CFQ code, but it implements a more accurate scheduling policy. In brief, differently from CFQ, BFQ does not grant the disk to each process for a fixed time-slice, but until the process exhausts a previously assigned budget, measured in number of sectors. Budgets are scheduled with an internal budget scheduler, B-WF2Q+, a modified version of the WF2Q+ packet scheduler.

In more detail, in BFQ, as in CFQ, synchronous requests are collected in per-task queues, and asynchronous requests are collected in per-device (or, in case of hierarchical scheduling, per group) queues. When a new queue is created, it is assigned a budget, measured in number of sectors, to use when the queue itself is granted access to the disk. BFQ exclusively serves one queue at a time. When the underlying device driver asks for the next request to serve, BFQ selects the first request from the queue under service in C-LOOK order and returns it to the driver. The budget of the queue is decremented by the size of the request. As in CFQ, if a sync queue has no more requests to serve, but it has some budget left, the scheduler idles (i.e., it tells to the device driver that it has no requests to serve even if there are other backlogged queues) for a short period, waiting for a new request from the task associated to the queue. The queue under service goes on being served until one of the following conditions is met:

When the queue is deselected, it is assigned a new budget, computed as a function of which of the above events occurred. The next queue to serve is selected with B-WF2Q+, a modified version of the WF2Q+ packet scheduler. B-WF2Q+ schedules backlogged queues as a function of: their budgets, their weights and the service they have already received.

Working in the service domain and not in the time domain is what allows BFQ to distribute the disk throughput as desired, unconditionally and even if it fluctuates. In addition, the accurate service distribution of the internal B-WF2Q+ scheduler, guarantees tight delays with respect to the completion times that the requests would enjoy in an ideal perfectly-fair system.

The idea for reducing the latency for interactive and soft real-time processes (when low_latency is enabled) is just raising the weight of the associated queues. More details can be found here. An important point is that such an heuristics allows interactive and soft real-time queues to steal more throughput than their fair share to other queues. If the resulting loss of fairness is undesired (e.g., in a server system), it is certainly better to switch low_latency off.

When hierarchical scheduling is enabled, queues are collected in a tree of groups, and there is a distinct B-WFQ2+ scheduler on each non-leaf node. Leaf nodes are request queues as in the non-hierarchical case.

BFQ supports ioprio classes at each hierarchy level, enforcing a strict priority ordering among classes. This means that idle queues/groups are served only if there are no best effort queues/group in the same scheduler, and best effort queues/groups are served only if there are no real-time queues/groups.

Tunables

Some of the parameters available to the user for configuring BFQ, exported through the canonical I/O scheduler sysfs interface are:

All the other parameters have the same meaning as in CFQ.

Group Scheduling

BFQ supports full hierarchical scheduling; internally it controls all the devices independently from each other, but to simplify the user interface, by now, it exports only a single attribute set per-group.

Each group has the same I/O scheduling parameters that also tasks have:

The latter parameter has been inserted to preserve the same interface used to set task weights when groups are not used. Unfortunately, through this parameter only weight values in {1, 2, ..., IOPRIO_BE_NR} can be set. With groups one can however also set the weights directly, using the following additional parameter:

TODO

 
Last updated: October 27 2015.
Paolo Valente (paolo DOT valente AT unimore DOT it)