This page contains a short description of the properties of BFQ and of how it works. Alternatively, and in addition to the demos mentioned in the homepage, in the following technical reports you can find many more details than those reported in this page:
- This technical report is currently the most up-to-date and easy-to-read document on BFQ, more precisely on BFQ-v7r8.
- This technical report, more detailed than the above one, contains a medium-depth description of BFQ, more precisely of BFQ-v1; it also contains both a description of the benchmark suite and various experimental results with the original version of BFQ, BFQ-v1 and CFQ.
- All the details about the original algorithm can be found, instead, in this more theoretical technical report.
BFQ (Budget Fair Queueing) is a Proportional Share, or equivalently Fair Queueing, disk scheduler that allows each process/thread to be assigned a fraction of the disk throughput. It has the following characteristics.
- It distributes the disk throughput to I/O-bound processes as desired, even if it fluctuates, independently of the disk parameters and with any workload. BFQ does not provide this sector-domain fairness to processes issuing random requests, as this would easily cause the disk throughput to drop on one hand, and other processes to experience very high latencies on the other hand. For these processes BFQ switches to a more appropriate time-domain fairness, in which it is the disk time to be fairly distributed (basically the same service scheme of CFQ).
- BFQ guarantees to each disk request a tight delay with respect to the completion time that the request would enjoy in an ideal (unfeasible) perfectly-fair system. This is especially beneficial for soft real-time applications.
- BFQ exports a low_latency tunable. If enabled (currently the default), BFQ executes a special heuristics that automatically gives to interactive and soft real-time applications more than their fair share of the disk throughput, to reduce their latency. According to our results, for desktop or handheld usage, the system becomes virtually as responsive as if the disk was idle, whatever the actual disk load is. Soft real-time applications enjoy up to 3-time lower latencies than under CFQ and do not suffer from almost any glitch even in presence of heavy background workloads.
- Low-latency guarantees are preserved also in presence of NCQ.
- Optimized (from v4) to achieve a high throughput on SSDs without losing low-latency guarantees.
- According to our results, BFQ achieves up to 30% higher aggregate disk throughput than CFQ with most of the workloads considered, or the same throughput with the others.
- Differently from CFQ, BFQ uses (from v6) a unified mechanism (Early Queue Merge, EQM) to get a sequential read pattern, and hence a high throughput, with any set of processes performing interleaved I/O. EQM also preserves low latency. More details here.
InternalsBFQ is based on CFQ code, but it implements a more accurate scheduling policy. In brief, differently from CFQ, BFQ does not grant the disk to each process for a fixed time-slice, but until the process exhausts a previously assigned budget, measured in number of sectors. Budgets are scheduled with an internal budget scheduler, B-WF2Q+, a modified version of the WF2Q+ packet scheduler.
In more detail, in BFQ, as in CFQ, synchronous requests are collected in per-task queues, and asynchronous requests are collected in per-device (or, in case of hierarchical scheduling, per group) queues. When a new queue is created, it is assigned a budget, measured in number of sectors, to use when the queue itself is granted access to the disk. BFQ exclusively serves one queue at a time. When the underlying device driver asks for the next request to serve, BFQ selects the first request from the queue under service in C-LOOK order and returns it to the driver. The budget of the queue is decremented by the size of the request. As in CFQ, if a sync queue has no more requests to serve, but it has some budget left, the scheduler idles (i.e., it tells to the device driver that it has no requests to serve even if there are other backlogged queues) for a short period, waiting for a new request from the task associated to the queue. The queue under service goes on being served until one of the following conditions is met:
- the idling timer expires while BFQ is idling to wait for the arrival of the next request from the task associated to the queue
- the queue is spending too much time to consume its budget (budget timeout, see the timeout_* parameters below), or
- the queue exhausts its budget, or
- the queue is not sync and has no more requests to serve.
When the queue is deselected, it is assigned a new budget, computed as a function of which of the above events occurred. The next queue to serve is selected with B-WF2Q+, a modified version of the WF2Q+ packet scheduler. B-WF2Q+ schedules backlogged queues as a function of: their budgets, their weights and the service they have already received.
Working in the service domain and not in the time domain is what allows BFQ to distribute the disk throughput as desired, unconditionally and even if it fluctuates. In addition, the accurate service distribution of the internal B-WF2Q+ scheduler, guarantees tight delays with respect to the completion times that the requests would enjoy in an ideal perfectly-fair system.
The idea for reducing the latency for interactive and soft real-time processes (when low_latency is enabled) is just raising the weight of the associated queues. More details can be found here. An important point is that such an heuristics allows interactive and soft real-time queues to steal more throughput than their fair share to other queues. If the resulting loss of fairness is undesired (e.g., in a server system), it is certainly better to switch low_latency off.
When hierarchical scheduling is enabled, queues are collected in a tree of groups, and there is a distinct B-WFQ2+ scheduler on each non-leaf node. Leaf nodes are request queues as in the non-hierarchical case.
BFQ supports ioprio classes at each hierarchy level, enforcing a strict priority ordering among classes. This means that idle queues/groups are served only if there are no best effort queues/group in the same scheduler, and best effort queues/groups are served only if there are no real-time queues/groups.
Some of the parameters available to the user for configuring BFQ, exported through the canonical I/O scheduler sysfs interface are:
- timeout_sync, timeout_async: the maximum amount of disk time that can be given to a task once it has been selected for service, respectively for synchronous and asynchronous queues. It allows the user to specify a maximum slice length to put an upper bound to the latencies imposed by the scheduler.
- max_budget: the maximum amount of service, measured in disk sectors, that can be provided to a queue once it is selected (of course within the limits of the above timeouts). According to what we said in the description of the algoritm, larger values increase the throughput for the single tasks and for the system, in proportion to the percentage of sequential requests issued. The price is increasing the maximum latency a request may incur in. The default value is 0, which enables auto-tuning: BFQ tries to estimate it as the maximum number of sectors that can be served during timeout_sync.
- max_budget_async_rq: in addition to the max_budget, limit, async queues are served for a maximum number of requests, after that a new queue is selected.
- low_latency: if equal to 1 (default value), interactive and soft real-time applications are privileged and experience a lower latency.
- A few parameters related to the low-latency heuristics are also exported, mainly because, as of now, I do not have the time to maintain a separate development branch for BFQ. Feel free to ask me, if you want to know more about them.
All the other parameters have the same meaning as in CFQ.
BFQ supports full hierarchical scheduling; internally it controls all the devices independently from each other, but to simplify the user interface, by now, it exports only a single attribute set per-group.
Each group has the same I/O scheduling parameters that also tasks have:
- bfqio.ioprio_class: the scheduling class the group belongs to.
- bfqio.ioprio: the priority of the group inside its parent. It is linearly mapped into a weight, with weight = IOPRIO_BE_NR - ioprio.
- bfqio.weight: the weight of the group inside its parent. Available values: 1..1000. The above linear mapping between ioprio and weights is still valid, except for mapping all the weights higher than IOPRIO_BE_NR to ioprio 0.
- Find ways to address or at least mitigate occasional high latencies for soft real-time applications, caused by:
- high in-device service time of unlucky, dispatched I/O requests
- group hierarchies (e.g., critical task in one group, interfered by a task in another group)
- I/O blocking above blk, caused by writeback pressure
- Fix compilation of BFQ as a module (for some reason it prevents BFQ from being select as the default scheduler
- Investigate ways to tackle even better the write-hog problem, which causes interactive applications to experience, occasionally, a quite higher latency than the average. For example, properly using the may_queue function may provide better results.
- Carefully check at what extent BFQ locking policies and execution time may become an issue with high-speed SSDs.
- Check whether it is really necessary to defer the __blk_run_queue on completions, or, in contrast, __blk_run_queue can be just invoked immediately.