GNU/Linux optimisation

Introduction

There has been lot of discussion about compiling GNU/Linux from scratch, and performance boost related to it. I have never been a big believer of this – except for applications such as multimedia encoding. Many GNU/Linux distributions come with executables, that can be run on virtually any i386-compatible processor out there, but which are optimised for certain CPU. Some distributions also come with executables that ignore compatibility with old processors, and assume Pentium 3 as oldest supported architecture for instance. How much does it make difference, really?

Attention! Sophisticated scientific methods and highly precise measurements ahead. You have been warned!

Test setup and environment

My test environment isn’t the best possible; I’m running AMD Athlon 64 X2 (cpuinfo) running at 2400 MHz (i.e. 4600+) with 2 GiB of dual-channel memory. My Debian GNU/Linux (unstable) installation is running in pure 64-bit. 32-bit executables are run in 32-bit chroot, also from Debian unstable. CPU clock frequency governor is in use, but minimum frequency is set manually to 2400 MHz. Thanks to two cores, combined tests are rather irrelevant, but included nevertheless.

Tests and measurements

I decoded and encoded approximately 48 minutes of CD-quality stereo audio with the latest oggdec and oggenc. Unfortunately I have read the data to decode/encode from disk instead of RAM-disk, but hopefully using /usr/bin/time have negated the effects. For reference, reading the uncompressed wave file, reported by /usr/bin/time, took “0.01user 0.21system” in worst case run. Sample I have used is full dump of Christopher Franke’s album Klemania.

See the scripts.

I have recompiled not only vorbis-tools, but also libc6, libogg0, libvorbis0a, and libvorbisenc2. The sources and default compiler options are taken from Debian unstable repository, ftp.fi.debian.org, at around 21 August 2007, 22:00:00 UTC+3. For optimised versions, switches -O3 -fomit-frame-pointer -march=athlon-xp and -O3 -fomit-frame-pointer -march=athlon64 are added for 32-bit and 64-bit executables, respectively. Self-compiled libraries are then loaded with LD_PRELOAD, except for ld-linux.so/ld-linux-x86-64.so and linux-gate.so.1. For reasons unknown to me, preloading ld-linux would cause segfault upon running any executable.

Finally, the tests were run with run.sh overnight, manually interrupted in the morning after completing 24 full rounds.

Results

Raw data.

Processing time

Performance for unoptimised/optimised files
Task Executable Average [s] Median [s] sdev [s^2] Minimum [s] Maximum [s]
Encoding 32-bit 163.61 162.83 0.7539 162.43 165.47
32-bit opt 137.52 136.25 2.4017 135.93 145.84
64-bit 109.80 109.47 1.1295 108.64 113.49
64-bit opt 107.47 107.17 0.5934 106.56 109.90
Decoding 32-bit 17.81 17.84 0.0761 17.69 18.02
32-bit opt 16.12 16.04 0.1311 16.01 16.62
64-bit 15.50 15.42 0.2577 15.33 16.53
64-bit opt 15.15 15.06 0.1989 15.00 16.08
Combined 32-bit 182.74 181.53 2.1005 180.05 190.35
32-bit opt 154.27 154.58 2.0737 151.69 159.60
64-bit 126.39 127.39 1.6426 124.30 130.94
64-bit opt 123.38 122.69 1.3600 121.85 128.79

Decoding, encoding speed
Encoding speed Decoding speed Combined speed

Impact

Performance difference
Task Arch Unoptimised [s] Optimized [s] Difference [s] Difference
Encoding 32-bit 163.61 137.52 26.10 -0.159
64-bit 109.80 107.47 2.33 -0.0213
Decoding 32-bit 17.81 16.12 1.69 -0.0950
64-bit 15.50 15.16 0.35 -0.0223
Combined 32-bit 182.74 154.27 28.46 -0.156
64-bit 126.39 123.38 3.02 -0.0239

Other results

Before running these “final tests” I did some initial testing by running the same scripts on shorter samples with only vorbis-tools optimised. With my infinite wisdom I already deleted the raw logfiles, so I cannot present any fancy figures or render pretty graphs. Fortunately what I did not delete, are the old graphs. They’re ugly, but carry most of the information the new ones do.

Unexpectedly, encoding actually takes longer! However it is clear that since most multimedia applications use libraries to do the work, just optimising the executable won’t help. It can actually make things worse.

Difference between 64-bit and 32-bit executables

While I wasn’t exactly looking to compare 64-bit and 32-bit executables, it seems that encoding Vorbis benefits from those extra bits and CPU registers. Encoding gains 30% performance boost, decoding 10%. This is unlikely to have huge impact on general usability of your computer, but again, when working with heavy multimedia contents, switching from 32-bits to 64-bits can reduce the risk of dying for old age while waiting for audio compressor to finish its task.

Analysis

Figures for 32-bit files clearly show that there is difference between 32-bit stock Debian files and gcc-optimised files. If you start optimising, you’ll have to go all the way – just optimising the executables won’t do the trick. However there is no significant difference between unoptimised and optimised 64-bit files. Also, it is unlikely switches other than -march (or -mcpu and friends) make much difference. Even Gentoo Compilation Optimization Guide acknowledges that gcc manual says so.

Since Debian thrives on compatibility, all 32-bit i386-compatible binaries are indeed compiled for i386, and not for i686. If I remember correctly, Mandrake, for instance, has been optimised for i686 or i586. Multimedia applications like audio encoding can take advantage of instructions unavailable on earlier CPUs, and as such optimising for i686 can indeed make difference. This is probably why there is so clear difference between optimised 32-bit files but next to no difference between 64-bit files. x86_64 files can take full advantage of x86_64’s instruction set as there are no legacy processors that need to be worried about. Unfortunately, in future this can change.

In case of Vorbis decoding/encoding, performance boost is 10–16% and 2% for 32-bit and 64-bit, respectively. I think it is fairly safe to say that compiling for x86_64 to get performance boost just isn’t worth it. On the other hand, if you’re working with heavy multimedia contents, and need to decode and encode large chunks of data often, optimisation does matter for 32-bit environments.

While optimised 32-bit executables perform better on multimedia content, how much difference does optimisation have on general tasks? Will you notice difference in web browsing, starting applications, or task switching? Is this difference significant, noticeable, or even relevant?

First of all, I think we can dismiss the impact optimisation has on listening to music. Decoding 48 minutes of Vorbis encoded music takes just 16 seconds on my computer. That is well below 1% of CPU load for real-time decoding, and hardly significant. Second, usually the user is the slowest part of computerised task. Computer usually responds to user faster than user responds to computer. Personally I’d consider anything less that 25% to be insignificant, but that is, of course, an open subject.

Conclusions

Optimisation on x86_64 has virtually no effect on performance. Processing multimedia content may speed up a little, but the difference is only about 2%. Simply spending a little more money on the hardware would improve the performance much more.

For 32-bit environments the answer is not as straightforward. There is clear difference between unoptimised files for i386 and optimised files for athlon-xp. Encoding Vorbis files gains 10–16% in encoding speed when optimised for the CPU. However in other tasks such as web browsing, text editing, and listening to music, multimedia enternsions for the CPUs instruction set are not that useful. Besides of less room for optimisation, the user also reacts slowly.

Some people claim that compiling everything by hand makes their computers run faster overall, from logging in time to web browsing and even running Bash. My claim is that none of these benefit from CPU-specific optimisations so much that the difference would be noticeable. I claim that what matters the most is the hardware – or number of bytes that need to be loaded from disk. Pre-built GNU/Linux distributions often come with everything user can imagine. Applications have been compiled against libraries the user will never need. Libraries may come with debug symbols. People who compile everything themself, often disable every feature they think they are not going to miss, thus reducing library dependencies.

In short, optimising for x86_64 is waste of time for time being. Optimizing for i686 instead of i386 may speed up processing multimedia content, but ordinary user never notices any difference.