BFQ and related stuff on disk scheduling

Here you can find a table showing the performance of bfq-v4 on an SSD, plus an annotated summary of the results for bfq-v1 under 2.6.32/33/34/35/36/... for several benchmarks (all the tests I have not yet run are pointed out too). This summary contains somehow also the story of how I got to the idea of the low-latency heuristic that enabled bfq to provide a definitely low latency to interactive applications. Newer versions of bfq have of course better performance, see the changelogs for details. Apart from the next table, the rest of the numbers are about rotational disks.

To get the results reported in this page I implemented an ad hoc benchmark suite. Both a more accurate description of most of the results with rotational disks and an overview of the benchmark suite can be found in this technical report.

Results of bfq-v4 for 3.5.0 on an Intel 320 160G SSD

The first two rows of the next table report the aggregate throughput achieved by noop, deadline, sio, cfq and bfq, under a workload made of 10 sequential or random readers running in parallel (readers are raw because we preferred to read directly from the device instead of from 10 large files, so as to not need to create the files in advance and hence to wear out our costly SSD :) ).

The remaing raws report the cold-cache start-up time experienced by various applications while 10 sequential or random readers are running at the same time. In particular, each run of these start-up-time benchmarks lasts 60 seconds, so, where you find '>60', it means that the application under test started only after the the whole benchmark finished, i.e., only when the background workload was turned off. Besides, STARTUP SEQ/RAND stands for "start-up time of the application against a sequential/random background workload"

--------------------------------------------------------------------------
|                      ELEVATOR                       |       Test       | 
-------------------------------------------------------------------------- 
|   NOOP   | DEADLINE |   SIO    |   CFQ    |   BFQ   |                  | 
-------------------------------------------------------------------------- 
|          |          |          |          |         |     AGG_THR      |
|          |          |          |          |         |      (MB/s)      |
|   265    |   264    |   263    |   268    |   263   |    10 raw seq    | 
|   142    |   142    |   142    |   138    |   140   |    10 raw rand   | 
--------------------------------------------------------------------------
|          |          |          |          |         |  STARTUP 10 seq  |
|          |          |          |          |         |       (s)        |
|   1.96   |   1.99   |   1.94   |   >60    |   0.20  |  gnome-terminal  | 
|  13.41   |  13.50   |  13.38   |   >60    |   0.94  |     konsole      | 
|   1.77   |   1.72   |   1.66   |   >60    |   0.30  |     xterm        | 
|  10.39   |  11.03   |  11.06   |   >60    |   1.25  |    oowriter      | 
--------------------------------------------------------------------------
|          |          |          |          |         |  STARTUP 10 rand |
|          |          |          |          |         |       (s)        |
|   0.13   |   0.14   |   0.13   |  11.43   |   0.14  |  gnome-terminal  | 
|   0.62   |   0.61   |   0.61   |  16.14   |   0.80  |     konsole      | 
|   0.11   |   0.10   |   0.11   |  10.09   |   0.25  |     xterm        | 
|   0.83   |   0.83   |   0.83   |   >60    |   1.00  |    oowriter      | 
--------------------------------------------------------------------------

Results of bfq-v1 on various systems with a rotational disk

The relative performance between bfq and cfq is quite similar under any of these kernels. Hence, for brevity, only the results with 2.6.34 are described at length. For 2.6.33 I point out only the cases where the results differ significantly from the ones with 2.6.34. In a similar vein, what happens with 2.6.32 and 2.6.35 is only shortly summarized at the end of each section. Finally, under 2.6.36 the relative performance of bfq with respect to cfq for any of the tests is better than under the other schedulers. So the results for 2.6.36 are not reported at all.

The tests have been run on the following three systems with 2.6.32/33/34/35/36 (the reported peak rates have been measured with hdparm):

DISK					CPU			DISTRIBUTION	FS
ATA MAXTOR 6L080L0 (82 GB, 55 MB/s)	Athlon 64 3200+		Slackware 13.0	ext3
ATA MAXTOR 7L250S0 (250 GB, 61 MB/s)	Pentium 4 3.2GHz	Ubuntu 8.04	ext3
ATA MAXTOR STM332061 (320 GB, 89 MB/s)	Athlon 64 3200+		Ubuntu 10.04	ext4

For short, I will call each of these systems just the slow, medium or fast disk. Still to do: I did not yet run tests on RAIDs.

In all the tests the low_latency tunable of cfq was on. Each test was repeated ten times, and the values shown in the sections below are averages computed over these repetitions. The statistics I collected are however much more detailed: min/max/avg/standard deviation/confidence interval for several quantities (latency, read/write/ total aggregated throughput, ...) have been computed within each test and across multiple repetitions of each test. All these numbers, for the fast disk, can be found in this archive.

The results with 2.6.33 and 2.6.34 are essentially similar. However, whereas bfq was quite stable, cfq got slightly better results with 2.6.34. In contrast, with 2.6.35 the latencies achieved by bfq are definitely lower than with all the other kernels. So, for brevity I report here only the results with 2.6.34, apart from where they differ significantly wioth respect to 2.6.33. In that case also the results with 2.6.33 are reported. As previously said, the results with 2.6.32 and 2.6.35 are only shortly summarized at the end of each section.

COLD-CACHE STARTUP LATENCY OF SHORT COMMANDS

Executing short commands from a shell should be a common task for
which one may desire low latency on a desktop or server. To get an
idea of the latency guaranteed by bfq to these commands in presence of
peak disk workloads, I measured the time to launch some of them with
cold caches and while one of the following sets of readers/writers
were being executed in parallel: ten sequential readers (hereafter
called 10r-seq for short), ten random readers (10r-rand), five
sequential readers plus five sequential writers (5r5w-seq), five
random readers plus five random writers (5r5w-rand).

One of the commands tested was "bash -c exit". On the fast disk it was
executed in 0.32 seconds with 10r-seq or 0.41 secs with 10r-rand,
against 0.55 or 0.33 secs under cfq. This time boiled down to 0.20
secs with 5r5w-seq or 0.26 secs with 5r5w-rand, against 0.57 or 0.66
under cfq. Hence, apart from the 10r-rand case, where cfq provides a
0.08 secs lower latency, the latency with bfq is from almost 2 to
almost 3 times lower than with cfq. Consider that bfq achieves these
results without using any additional mechanism/policy to favour
latency-sensitive applications, as cfq instead does.

In each of the ten repetitions of the test, "bash -c exit" was
executed ten times, spaced by 1 second from each other. Under these
conditions, the disk aggregated throughput with bfq was 72.2 MB/s with
10r-seq and 73.2 MB/s with 5r5w-seq, against 59.9 MB/s and 56.6 with
cfq. With regard to random workloads, with bfq I got 0.74 MB/s with
10r-rand and 1.04 MB/s with 5r5w-rand, against 0.64 MB/s and 0.85 with
cfq. In the end with bfq the throughput was from about 15% to about
30% higher.

This lower latency/higher throughput result is probably due to the
more accurate scheduling policy of bfq, which allows it get a low
latency even if assigning high budgets to the readers. More in detail,
the maximum possible budget that could be assigned was equal to the
estimated maximum amount of sectors that could be transferred in 125 ms
(the default budget timeout) on the target disk. And we verified that
bfq did assign such high budgets.

On each of the other two disks bfq got similar relative performance
against cfq as on the fast disk.  These latency results with short
commands should also confirm that bfq guarantees low latency to
periodic soft real-time applications, as we showed a while ago with
our experiments on video streaming.
See the performance comparison for those results on video streaming.
As for 2.6.32, bfq got similar results as with 2.6.33/34 kernels, whereas
cfq latencies grew by 3-4 times with 10r-seq and 5r5w-seq (with respect
to the latencies it achieved with the other kernels). In contrast, cfq
achieved approximately the same latencies as with the other kernels in case
of 10r-rand and 5r5w-rand. About the throughput, cfq provided a 14% higher
aggregated throughput than bfq with 10r-seq, but it lost the same percentage of
throughput with respect to bfq with 5r5w-seq. With 10r-rand and 5r5w-rand
cfq achieved approximately the same throughput as under the other kernels.

Also with 2.6.35 bfq achieved similar results as with the other
kernels.  cfq got similar results too in terms of aggregated
throughput. In contrast latencies were higher with most workloads,
especially 4.7 times higher with 5r5w-rand.


STARTUP LATENCY OF LONGER COMMANDS

What happens with longer commands? The main purpose of the scheduling
policy of bfq is fairness. As the size of the stuff that has to be
read to execute a command grows, fairness constraints become dominant
in determining how long it will take to complete the command on a
loaded disk. For example, if the disk is idle, it takes about 2
seconds to execute "konsole -e /bin/true" on the fast disk with cold
caches. And it takes around 19.9 secs with bfq if ten files are being
read at the same time, as fairness demands. This is still less than
the 26 secs measured with cfq, but it is definitely a long wait for a
user.

The appropriate, general solution would be at least assigning the
right weight to each application according to its requirements
(probably something more sophisticated is needed). Of course this
should be done somewhere above the block layer, which is definitely
out of the business of bfq :)
However, I tried to give the following low_latency tunable a try ...


EXPERIMENTAL LOW_LATENCY TUNABLE FOR BFQ, AND CAVEATS

When this tunable is on, the original weight of a just born queue is
multiplied by ten, then it is linearly decreased as the queue gets
service, until it becomes equal to the original weight. Especially,
with the current values of the parameters, the boosting is over, i.e.,
the original weight is reached, when the queue has got about 24MB
served. To prevent seeky processes from keeping a high weight for too
long, a queue has 6 seconds (real time) to consume these 24MB, after
which the boosting is unconditionally suspended. If an existing queue
becomes backlogged after being idle for 20 seconds or more, then it
enjoys this weight boosting again. Of course I got to these numbers
after some tuning or common sense (I hope :) ).

This mechanism is a source of unfairness against disk-bound long-lived
processes, which do not benefit from any weight boosting. Probably low
latency is almost always not worth this loss of fairness on a
server. In contrast, the user of a desktop may be willing to tolerate
a temporary drop of throughput of long-lived best-effort applications
(download, peer-to-peer exchange, ..) in return of a lower latency of
the task being started.

On the contrary, few users would be happy if their long-lived soft
real-time applications, as e.g. video or audio players, suffered from
increased latencies. To check this, I repeated several times the
following experiment with both bfq and cfq: 1) flushed caches, 2)
started playing a high-resolution film with mplayer, 3) started ten
sequential or random readers when the film reached a given point, 4)
measured the number of dropped frames after 10 seconds (shown on the
status line of the player). These are the results on the slow disk
(the one on which frame drops were of course higher). Apart from a few
unlucky runs for both schedulers, the results were more or less stable
around the following averages:

 Number of dropped frames - slow disk - BFQ vs CFQ
Avg with 10 seq readers | Avg with 10 rand readers |
    6 vs 10             |           5 vs 5         |

According to this test the boosting mechanism apparently causes no
harm with respect to what is guaranteed by cfq. This parity seems also
to suggest some caution in adding a similar mechanism to cfq.

One of the shortcomings of this test is the following. Since it was
better to give you numbers and not my impressions of how smoother the
video was played with bfq, I run this mplayer test on the three
systems remotely for practicality, with -ao null and -vo null. So, to
release bfq in this millennium, I did not yet perform also systematic
tests of loss of video/audio quality on real use cases (listening to
music, watching a movie, maybe an HD one). Iteratively launching
batches of file reads while playing back music is not my typical daily
activity :)

Now I can show you what happens after turning low_latency on in bfq.


COMMAND STARTUP LATENCY WITH LOW_LATENCY ON

Repeating the same test for "bash -c exit", the startup time falls to
at most 0.20 secs with any of the sequential workloads and at most
0.23 secs with any of the rand workloads. Hence it is always from
about 1.5 to about 3 times lower than cfq. Aggregated throughput
performance are substantially the same as with low_latency off.

Now the actual litmus test: "konsole -e /bin/true" :)
The results are summarized in the following table (it reports
latency/agg. throughput with bfq vs latency/agg. throughput with cfq):

	konsole startup - fast disk - BFQ vs CFQ
 Background workload |  Latency [sec]   | Agg. thr [MB/s]
---------------------------------------------------------
	10r-seq      |   2.88 vs 26.03  | 48 vs 58
	5r5w-seq     |   2.59 vs 14.18  | 71 vs 57
	10r-rand     |   2.87 vs 22.36  | 5.5 vs 1.5
	5r5w-rand    |   2.82 vs 13.03  | 5.8 vs 2.0

The latency drop is evident. The 9-times lower latency with 10r-seq is
however paid with a 20% loss of aggregated throughput, because with
this highly sequential workload the throughput fall is more pronounced
as the seeky pattern of konsole is favoured. In the other three cases
the aggregated throughput is always higher even if the latency is at
least 4.6 times lower.

I tested also "xterm /bin/true" and found that under bfq it is
executed in no more than 0.4 secs with any of the workloads, against
around 2.4 to 4.6 secs under cfq (depending on the
workload). Aggregated throughput performance is in the middle between
the results with konsole and with bash. On each of the other two disks
the relative performance of bfq against cfq was more or less the same
as on the fast disk, for any of the three commands.

As for 2.6.32, bfq got again approximately the same results as wih the
2.6.33/34 kernels. On the contrary, the latencies with cfq grew up to
46 seconds for konsole, and 5.4 seconds for xterm, with 10r-seq. The
relative performance with respect to bfq in terms of aggregated
throughput was similar to the one with the bash tests under 2.6.32.

Also with 2.6.35 bfq exhibited about the same performance as with the
other kernels. The same happened to cfq in terms of aggregated
throughput. The latencies of cfq were instead quite worse: up to 34.5
secs for konsole and 5.3 secs for xterm. In the end the konsole and
xterm execution times with bfq were from about 6 to about 14 times
lower than with cfq.


KERNEL DEVELOPMENT TASKS (MAKE, CHECKOUT, MERGE)

I measured how much a kernel make, git checkout or git merge
progresses while one of the usual four workloads considered so far is
being served. About the 5r5w-seq and 5r5w-rand workloads, I verified
that they let these tasks almost starve with both schedulers. In fact,
a significant part of the work done by these tasks is writing files,
but writes are system-wide and end up in common flush queues. So a
per-queue disk scheduler has apparently no opportunity to save the
writes generated by these tasks from being overwhelmed by the ones of
relentless greedy writers. Thus I report here only the results with
the two workloads comprised of only readers.

Let us start with make. As I verified through tracing, each cc1
invoked by make generates really few read requests, usually served in
about 100 ms under bfq. Then it spends the (definitely longer) rest of
its life either using the CPU or waiting for its writes to be handed
over to the vm subsystem. In the end, most of the (little) control of
a disk scheduler over the progress of a task like this is related to
how the scheduler proportions writes with respect to reads. And bfq
does it more or less as cfq does on both the slow and the fast
disks. On the contrary, under 2.6.33 bfq throttles writes more than
cfq on the medium disk, and this has negative consequences, as shown
last. First, here are the results for make on the fast disk:

	         make - fast disk - BFQ vs CFQ
 Background workload | Number of source files   | Agg. thr [MB/s]
	   	     | compiled during the test |
------------------------------------------------------------------
	10r-seq      |  219 vs 197              | 74 vs 57
	10r-rand     |  261 vs 270              | 0.7 vs 0.7

As can be seen, whereas the number of files compiled is about the same
with any of the two schedulers, the aggregated throughput with bfq is
higher with 10r-seq.

Things differ in case of git merge and git checkout. 

         git merge / git checkout - fast disk - BFQ vs CFQ
Background workload  | Progress of merge | Progress of checkout
	   	     | during the test   | during the test
------------------------------------------------------------------
	10r-seq      |  19% vs 1.3%      | 15.5% vs 2.5%
	10r-rand     |  25% vs 2.6%      | 40.7% vs 16.1%

The reason for these better results may be that the two tasks are more
read-intensive than a make; but we did not investigate this issue
further. With both tasks the relative performance of the schedulers in
terms of aggregated throughput is close to the one with make.

The relative performance of bfq with respect to cfq on the other two
disks are similar to the one on the fast disk. As anticipated, the
performance is instead worse on the medium disk with 2.6.33. On this
system bfq reduces the write throughput by 26% to 50% with respect to
cfq in presence of sequential readers. Here are the consequences:

   make / git merge / git checkout - medium disk 2.6.33 - BFQ vs CFQ
Backgr. workload  | Num. files compiled | Progr. of merge | Progr. of checkout
	   	  | during make test    | during the test | during the test
-----------------------------------------------------------------------------
	10r-seq   |       55 vs 134     |   5% vs 4.9%    | 3.2% vs 3%
	10r-rand  |      286 vs 131     |  13.9% vs 5.8%  | 17.2% vs 12.41%

The improvement of bfq drops with merge and checkout, and the
performance is even worse than cfq for make with sequential readers.
Yet bfq is definitely better than cfq for make with random
readers. These results with 2.6.33 should be further investigated.

As for 2.6.32, bfq achieved about the same results, whereas cfq got
quite poor results in terms of progress of each of the tasks. For
example, merge with 10r-seq progressed, under cfq, 77 times less than
under bfq. About the throughput, cfq got up to 40% higher aggregated
throughput with respect to bfq with 10r-seq, but lost up to 200% of
the throughput with 10r-rand.

Under 2.6.35 the relative performance of bfq with respect to cfq was quite
better than under 2.6.34.


AGGREGATED THROUGHPUT

As said several thousands lines ago, on one hand bfq quickly assigns
high budgets to non-seeky queues. On the other hand, the budget
timeout mechanism lets it naturally switch to a safe time slice scheme
for seeky queues, essentially the same as cfq. Here are the results of
these facts:

aggregated throughput - fast disk - BFQ vs CFQ
 Workload  | Agg. thr [MB/s]
---------------------------------------
10r-seq    |   82.4 vs 66.1
5r5w-seq   |   77.2 vs 62.4
10r-rand   |   0.60 vs 0.60
5r5w-rand  |   0.71  vs 0.83

Similar relative performance on the other two disks. I also
investigated a middle-ground scenario where two out of three
sequentially read files were interleaved on the disk surface (written
by running two dd in parallel). bfq achieved a higher throughput than
cfq also in this case. The analysis of these type of workloads has not
yet been deepened.

As for 2.6.32, the relative performance between bfq and cfq was essentially
the same as in the bash tests under 2.6.32.

Finally, these were the results under 2.6.35 with the fast disk:

aggregated throughput - fast disk - 2.6.35 - BFQ vs CFQ
 Workload  | Agg. thr [MB/s]
---------------------------------------
10r-seq    |   81.0 vs 63.6
5r5w-seq   |   74.8 vs 62.0
10r-rand   |   0.70 vs 1.1
5r5w-rand  |   1.0  vs 0.64


FAIRNESS

As you most certainly know, bfq works in the service, and not in the
time domain. This peculiar feature allows it to distribute the disk
throughput to non-seeky processes as desired (i.e., according to the
assigned weights). In particular, the desired throughput distribution
is guaranteed under any workload and independently of the disk
physical characteristics.

On the contrary, it is the disk time to be fairly distributed with
cfq. Due to ZBR, this results in unfairness in distributing the disk
throughput, in proportion to the distance between the files being
read/written.
We already showed this fact with a varying number of parallel reads,
see our old comparative tests.
Repeating some of those tests on the fast disk, I saw, for example,
that two parallel readers of two 500MB distant files run at 32.1 and
32.2 MB/s with bfq, against 34.1 and 30.7 MB/s with cfq. In practice
bfq equally redistributes the excess throughput given by cfq to the
process that reads the file in the faster zone of the disk. In
addition, under cfq the unlucky reader finishes later than any of the
two readers under bfq. Hence bfq also in this case enjoys a higher
average aggregated throughput.

As I saw from the traces, fairness and throughput are unsurprisingly
also related to when bfq treats a queue as seeky and when it does not.
And even sequential readers generate a seeky workload during part of
their life. Hence there may still be room for improvement, to increase
the throughput of each reader. In addition, further tuning might be
needed to get high throughput and/or fairness with workloads that I
did not yet consider. I did not yet look into these issues
further. Finally, thorough testing with differentiated weights is
still missing too.
 
Last updated: October 3 2012.
Paolo Valente (paolo DOT valente AT unimore DOT it)