Greetings from the famous room N101 at ANU where Wayne Luk from Imperial College London is talking on "A Heterogeneous Cluster with FPGAs and GPUs". He started by apologising the talk will not be polished as the work is very new and they are just starting to get results. He then gave us a quick tourist's guide to Imperial, which is near Kensington Palace and the Albert Hall. He argues that techniques for embedded systems could be applied to high performance computing. This is counter-intuitive as embedded computing is usually used for low cost small scale computing in consumer goods, whereas supercomputers have been made from high cost, high performance custom components.
The concept is that an application written in a conventional programming language would be compiled partly into code for a conventional processor and partly into configuration information for customisable chips. This could be used for applications from supercomputers to distributed applications using "smart dust".
The application would used Field-programmable gate arrays (FPGA). These are now used in consumer equipment, such as LCD TVs. FPGAs are very efficient in terms of cost and processing power per unit of energy used. But programming FPGAs is complex. FPGAs have high speed serial interfaces which allow them to be used together. Examples are the Stratix III and Stratix IV. Imperial have produced an 8 x 8 "cube" of FPGAs for emulating processors ("MUMAlink" Interconnect Fabric), and for prototyping the entertainment system in a car.
Graphics processing units (GPUs) have multiple processors, a shared bus and memory on a chip. As a result they are less customisable and less power efficient than FPGAs, but they are easier to program. Ideally FPGAs and GPUs would be combined with conventional processors in the one system for maximum flexibility. This approach differs to the one investigated in "Comparison of GPU and FPGA hardware for HWIL scene generation and image processing" (by Eales and Swierkowski, DSTO Weapons Systems Division, 2009).
Imperial has a 16 node cluster "Axel", with an AMD CPU, C1060 GPU and Vpf5 FPGA, connected by Gigabit Ethernet and Infiniband on the FPGA. This has a "non-uniform node" architecture: there is a CPU, GPU and a FPGA in each node, with these connected on a common backbone. Initially a Single Program Multile Data design was used for simplicity. The backbone has Gigabit Ethernet plus Infiniband.
Linux runs on each node, using NFS. There is a custom resource manager and public domain cluster manager (openMP and OpenMPI). There is a communications bottleneck with data having to pass through the CPU from the FPGA to the GPU. Direct communication would be desirable but difficult.
The question then is what are common patterns of parallelism which the system could support. The "Berkeley Dwarfs" offers a set of common patterns.
The new Intel Atom chip (codenamed "Pineview") due in early 2010, is rumoured to have an integrated graphics core, which could be useful for low cost systems.
Iridium is planning a new generation of communication satellites with provision for an earth observation payload. It might be interesting to see how much processing could be usefully put on-board. The processing might be reprogrammable to to communications or processing as required and depending on where in the orbit they are. The Iridium satellites can only carry out their primary function of communications during a small part of their orbit. The rest of the time the satellite could carry out observations and process data.