This page contains a short description of the properties of BFQ, of how it works, of its interface, and of what still remains to do. Alternatively, and in addition to the demos mentioned in the homepage, in the following technical reports you can find many more technical details than those reported in this page:
- This technical report is currently the most up-to-date and easy-to-read document on BFQ, more precisely on BFQ-v7r8.
- This technical report, more detailed than the above one, contains a medium-depth description of BFQ, more precisely of BFQ-v1; it also contains both a description of the benchmark suite and various experimental results with the original version of BFQ, BFQ-v1 and CFQ.
- All the details about the original algorithm can be found, instead, in this more theoretical technical report.
BFQ (Budget Fair Queueing) is a Proportional Share, or equivalently Fair Queueing, disk scheduler that allows each process/thread to be assigned a fraction of the disk throughput. It has the following characteristics.
- It distributes the throughput to I/O-bound processes as desired, even if it fluctuates, independently of the device parameters and with any workload. BFQ does not provide this sector-domain fairness to processes issuing random requests, as this would easily cause the throughput to drop on one hand, and other processes to experience very high latencies on the other hand. For these processes BFQ switches to a more appropriate time-domain fairness, in which it is the device time to be fairly distributed (basically the same service scheme of CFQ).
- BFQ guarantees to each disk request a tight delay with respect to the completion time that the request would enjoy in an ideal (unfeasible) perfectly-fair system. This is especially beneficial for soft real-time applications.
- BFQ exports a low_latency tunable. If enabled (currently the default), BFQ executes a special heuristics that automatically gives to interactive and soft real-time applications more than their fair share of the disk throughput, to reduce their latency. According to our results, for desktop or handheld usage, the system becomes virtually as responsive as if the disk was idle, whatever the actual disk load is. Soft real-time applications enjoy up to 3-time lower latencies than under CFQ and do not suffer from almost any glitch even in presence of heavy background workloads.
- Low-latency guarantees are preserved also in presence of NCQ.
- Optimized (from v4) to achieve a high throughput on SSDs without losing low-latency guarantees.
- According to our results, BFQ achieves up to 30% higher aggregate disk throughput than CFQ with most of the workloads considered, or the same throughput with the others.
- Differently from CFQ, BFQ uses (from v6) a unified mechanism (Early Queue Merge, EQM) to get a sequential read pattern, and hence a high throughput, with any set of processes performing interleaved I/O. EQM also preserves low latency. More details here.
InternalsBFQ is based on CFQ code, but it implements a more accurate scheduling policy. In brief, differently from CFQ, BFQ does not grant the disk to each process for a fixed time-slice, but until the process exhausts a previously assigned budget, measured in number of sectors. Budgets are scheduled with an internal budget scheduler, B-WF2Q+, a modified version of the WF2Q+ packet scheduler.
In more detail, in BFQ, as in CFQ, synchronous requests are collected in per-task queues, and asynchronous requests are collected in per-device (or, in case of hierarchical scheduling, per group) queues. When a new queue is created, it is assigned a budget, measured in number of sectors, to use when the queue itself is granted access to the disk. BFQ exclusively serves one queue at a time. When the underlying device driver asks for the next request to serve, BFQ selects the first request from the queue under service in C-LOOK order and returns it to the driver. The budget of the queue is decremented by the size of the request. As in CFQ, if a sync queue has no more requests to serve, but it has some budget left, the scheduler idles (i.e., it tells to the device driver that it has no requests to serve even if there are other backlogged queues) for a short period, waiting for a new request from the task associated to the queue. The queue under service goes on being served until one of the following conditions is met:
- the idling timer expires while BFQ is idling to wait for the arrival of the next request from the task associated to the queue
- the queue is spending too much time to consume its budget (budget timeout, see the timeout_* parameters below), or
- the queue exhausts its budget, or
- the queue is not sync and has no more requests to serve.
When the queue is deselected, it is assigned a new budget, computed as a function of which of the above events occurred. The next queue to serve is selected with B-WF2Q+, a modified version of the WF2Q+ packet scheduler. B-WF2Q+ schedules backlogged queues as a function of: their budgets, their weights and the service they have already received.
Working in the service domain and not in the time domain is what allows BFQ to distribute the throughput as desired, unconditionally and even if it fluctuates. In addition, the accurate service distribution of the internal B-WF2Q+ scheduler, guarantees tight delays with respect to the completion times that the requests would enjoy in an ideal perfectly-fair system.
If several processes are competing for the device at the same time, but all processes and groups have the same weight, then BFQ guarantees the expected throughput distribution without ever idling the device. It uses preemption instead. Throughput is then much higher in this common scenario.
The idea for reducing the latency for interactive and soft real-time processes (when low_latency is enabled) is just raising the weight of the associated queues. More details can be found here. An important point is that such an heuristics allows interactive and soft real-time queues to steal more throughput than their fair share to other queues. If the resulting loss of fairness is undesired (e.g., in a server system), it is certainly better to switch low_latency off.
When hierarchical scheduling is enabled, queues are collected in a tree of groups, and there is a distinct B-WFQ2+ scheduler on each non-leaf node. Leaf nodes are request queues as in the non-hierarchical case.
BFQ supports ioprio classes at each hierarchy level, enforcing a strict priority ordering among classes. This means that idle queues/groups are served only if there are no best effort queues/group in the same scheduler, and best effort queues/groups are served only if there are no real-time queues/groups.
Here is a short description of the main parameters available to the user for configuring BFQ, exported through the canonical I/O scheduler sysfs interface are. See Documentation/block/bfq-iosched.txt for a detailed description of all the parameters.
- per-process ioprio and weight: unless the cgroups interface is used, weights can be assigned to processes only indirectly, through I/O priorities, and according to the relation: weight = (IOPRIO_BE_NR - ioprio) * 10.
- slice_idle: specifies how long BFQ should idle for next I/O request, when certain sync BFQ queues become empty. By default slice_idle is a non-zero value. Idling has a double purpose: boosting throughput with sequential workloads, especially on rotational devices, and making sure that the desired throughput distribution is respected.
- strict_guarantees: if set (default: unset), then BFQ (a) always performs idling when the in-service queue becomes empty, (b) forces the device to serve one I/O request at a time, by dispatching a new request only if there is no outstanding request. In the presence of differentiated weights or I/O-request sizes, both the above conditions are needed to guarantee that every BFQ queue receives its allotted share of the bandwidth. Setting strict_guarantees may evidently affect throughput.
- timeout_sync: Maximum amount of device time that can be given to a task (queue) once it has been selected for service. On devices with costly seeks, increasing this time usually increases maximum throughput. On the opposite end, increasing this time coarsens the granularity of the short-term bandwidth and latency guarantees, especially if the following parameter is set to zero.
- max_budget: Maximum amount of service, measured in sectors, that can be provided to a BFQ queue once it is set in service (of course within the limits of the above timeout). According to what said in the description of the algorithm, larger values increase the throughput in proportion to the percentage of sequential I/O requests issued. The price of larger values is that they coarsen the granularity of short-term bandwidth and latency guarantees. The default value is 0, which enables auto-tuning: BFQ sets max_budget to the maximum number of sectors that can be served during timeout_sync, according to the estimated peak rate.
- low_latency: used to enable/disable BFQ's low latency mode. By default, low latency mode is enabled. If enabled, interactive and soft real-time applications are privileged and experience a lower latency, as explained in more detail in the description of how BFQ works.
BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely blkio and io. In particular, BFQ supports weight-based proportional share.
Service guarantees provided
With BFQ, proportional share means true proportional share of the device bandwidth, according to group weights. For example, a group with weight 200 gets twice the bandwidth, and not just twice the time, of a group with weight 100.
BFQ supports hierarchies (group trees) of any depth. Bandwidth is distributed among groups and processes in the expected way: for each group, the children of the group share the whole bandwidth of the group in proportion to their weights. In particular, this implies that, for each leaf group, every process of the group receives the same share of the whole group bandwidth, unless the ioprio of the process is modified.
The resource-sharing guarantee for a group may partially or totally switch from bandwidth to time, if providing bandwidth guarantees to the group lowers the throughput too much. This switch occurs on a per-process basis: if a process of a leaf group causes throughput loss if served in such a way to receive its share of the bandwidth, then BFQ switches back to just time-based proportional share for that process.
To get proportional sharing of bandwidth with BFQ for a given device, BFQ must of course be the active scheduler for that device.
Within each group directory, the names of the files associated with BFQ-specific cgroup parameters and stats begin with the bfq. prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for BFQ-specific files is blkio.bfq. or io.bfq. For example, the group parameter to set the weight of a group with BFQ is blkio.bfq.weight or io.bfq.weight.
Parameters to set
For each group, there is only the following parameter to set.
weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the group inside its parent. Available values: 1..10000 (default 100). The linear mapping between ioprio and weights, described at the beginning of the tunable section, is still valid, but all weights higher than IOPRIO_BE_NR*10 are mapped to ioprio 0.
- Find ways to address or at least mitigate occasional high latencies for soft real-time applications, caused by:
- high in-device service time of unlucky, dispatched I/O requests
- group hierarchies (e.g., critical task in one group, interfered by a task in another group)
- I/O blocking above blk, caused by writeback pressure
- Fix compilation of BFQ as a module (for some reason it prevents BFQ from being select as the default scheduler
- Investigate ways to tackle even better the write-hog problem, which causes interactive applications to experience, occasionally, a quite higher latency than the average. For example, properly using the may_queue function may provide better results.
- Carefully check at what extent BFQ locking policies and execution time may become an issue with high-speed SSDs.
- Check whether it is really necessary to defer the __blk_run_queue on completions, or, in contrast, __blk_run_queue can be just invoked immediately.