Distributed Compile on AmigaOS

In my GNU Make article I covered a few topics relating to the use of the GNU make tools included with your AmigaOS 4.x SDK. Along with advice on how to eliminate those nasty recursive make build systems and generating file dependencies automatically I also promised to explore the world of distributed compilation. Any software developer with more than one computer can potentially make use of distributed compilation. So get ready to fire up all your Macs, Linux boxes and Amigas and read on.

If you Google around you'll find there are quite a few different distributed make systems out there both commercial and non-commercial. One of the most popular is named distcc and you'll find it preinstalled in your SDK. "A fast, free distributed C/C++ compiler" is what you'll find scrawled on the top of the distcc home page which means it works only with C and C++. Another restriction is that distcc works only with the GCC tools. It is possible to modify distcc to work with other programming languages and compilers and patches have been created but for the purposes of this article, we'll stick to what distcc supports out of the box.

Distributed compilation with distcc works best in a high speed LAN with as many boxes as possible--up to a point. One machine will serve as the host or client machine and will run the distcc client. The host machine runs make, performs pre-compilation and links the executables. You also need one or more server machines which will run the distccd daemon. These machines can have any operating system you want as long as they have an AmigaOS 4.x cross-compiler and the distccd daemon which all modern operating systems do. Joachim Birging (AKA zerohero) has an excellent collection of cross-compilers available for Linux, Windows and (sometimes) Mac OS X. The nice thing about using distcc is that you don't need the SDK installed on the server boxes at all. Just the compiler and the distccd daemon are required. The following diagram may help to explain what is needed:
Stacks Image 106
As was hinted at before, distcc works best up to a point. That point is where the client machine gets saturated. The speed of the LAN is usually not the bottleneck especially in our 100Mb+ Ethernet world. The distcc host performs the pre-compilation step and this involves running the GCC pre-processor on each source file. The great advantage of this method is consistency because you only have one set of header files to worry about on one box. The great disadvantage is you have to run all your pre-processing on a single machine and if that machine is not powerful enough it quickly becomes the bottleneck. Linking of all executables is also done on this same host machine so distributed linking is not a possibility with the distcc tools. In practice linking should only be of great concern if you have a poorly designed system with many cyclic dependencies in which case you have much bigger problems.

The good people at Google noticed the pre-processing bottleneck and decided to try and work around that limit. The so-called pump mode is what resulted which is a collection of Python scripts which pump out the headers to each server node and keeps them in sync with the client automatically. Pump mode is currently not available on the AmigaOS distcc port but it is planned to be added. Feel free to contact me if you would like to see pump mode supported.

So you want to see some numbers? I knew you would. Here is a chart which displays the results of some testing I performed using distcc in various scenarios back in the AmigaOS 4.0 era:
1 Host 2 Hosts 3 Hosts 4 Hosts
clib2 (seconds) 376 360 353 350
% Difference -4 -6 -7
WW2 GUI (seconds) 1120 934 853 824
% Difference -20 -31 -36
The clib2 project contains hundreds of smaller C files while the WW2 project contains hundreds of large C++ files. In all the cases above I used the same distcc client machine which was running the latest AmigaOS 4.0 install. All times are in seconds and below them is the percentage difference compared to a single host. For example, with 3 hosts the time to compile some clib2 libraries decreased by 6%.

Larger and more complex source files take longer to compile so you'll find distcc makes the biggest difference with C++ projects and larger C files. Use of debugging and optimization will also result is a faster overall compile with distcc. In all cases you will see some benefit when using distcc with parallel make.

The more adventurous among you might want to use a multi-CPU multi-core monster Linux box as the client with the SDK installed. Many configurations are possible and your AmigaOS 4.x machines can participate as either a client or a server along with the big boys. If you have a less powerful AmigaOS 4.x machine like a Sam440ep you could still use it to compile large projects by distributing the compiles to much larger machines.

I'll now run through the steps I followed to obtain my test results in the table above.

The easiest way to setup distcc is to first setup a home directory and assign HOME: to that directory. Next, create a subdirectory called .distcc which is the default location distcc uses to store all of its files. Finally, you need to tell distcc what server machines you have which is stored in the HOME:.distcc/hosts file. There are more ways to configure distcc so it is worth going through the documentation to learn all of your options.

Here is the contents of the hosts file I used for some of the experiments:

The distcc client picks machines from left to right order so the machines on the left have a higher priority than the ones on the right. You may use names or IP addresses depending on how your have things set up in your LAN. The ",lzo" option tells the distcc client to compress the files sent to and from the distccd servers which can make a large difference at the cost of some CPU time. Experiment with the host file and the lzo option to find the balance that is right for you.

On each of your servers you need to install an AmigaOS 4.x compiler or cross-compiler and distccd and nothing more. No need for Amiga header files because the distcc client is sending each server the pre-processed output from the compiler. The distccd server can be run in three different modes but I recommend standalone mode which is the simplest to control and debug. Here is an example line which starts the distccd server:
distccd --daemon --allow --log-stderr

The --daemon option tells the server to run in standalone mode which can be run as any user with access to the compiler if you are on a multi-user capable operating system. The --allow option is used as a primitive security measure to allow only machines on the network. You should modify this as required for your particular LAN setup. The --log-stderr option will send all logging to stderr which can be helpful when diagnosing problems. Many more options and complete details are in the distccd documentation.
Broken distccd on Mac OS X?
MacOS X developers should be aware the version of distccd included with Apple's SDK is hard coded to only work with Apple's GCC compiler and nothing else. This behaviour is easily corrected by installing a more generic distccd. I used the distccd from Darwin Ports.
It is time to do some distributed compiling. If you followed the advice in my GNU make article, you now have a nice non-recusive makefile with a single place where the compiler is invoked. Instead of invoking the compiler you stick the distcc command in front of the gcc compiler call. For example,
CC := gcc
changes to
CC := distcc gcc

Fire up GNU make with the parallel make option and away you go like so,
make --jobs=3

This will start up to three distcc clients simultaneously and distribute the compile based on your hosts file you setup earlier. Voila!

The very next thing you'll want to do is monitor the situation. Thankfully, there is the text based distccmon-text tool and the AmigaOS GUI based distccmon-amiga. Here is a shot of the GUI based monitor tool in action:
Stacks Image 130
Most of the columns should be fairly self explanatory. The Slot column tells you how many jobs are running on a single host. Some hosts may be multi-core or multi-CPU enabled and the distccd daemon is multi-threaded. The Tasks column gives you a running graph of what is going on at each host. The various colours represent various states. The green means compiling time, purple is pre-compiling time, yellow and red are network transfers. Just watch the monitor for a while and you'll figure out what each colour represents soon enough. Gaps are inserted between the various jobs so you can get a rough idea how your cluster is doing at a glance.

That should be enough information to get you going with distributed compilation. Remember you don't even have to use distributed compilation to use the distcc and distcc monitor tools so you might find it useful to just use the tools to compile on a single machine and monitor what is going on with the build. Setup a hosts file with just localhost in the file and distcc will do as it is told.

The distcc tools are both simple and powerful at the same time which is probably why they enjoy such great success on all modern platforms. And now AmigaOS can join in on the fun as either a distcc server or client. Happy compiling!