Putting the user first
For the relatively small fee of £9 an hour, researchers at Cambridge University can now access teraflop processing powers to help projects ranging from astronomical simulations to the interrogation of bioinformatics databases that could provide leads for new, life-saving drugs.
But Cambridge University has always been at the forefront of computing, from Charles Babbage’s mechanical ‘difference engine’, first exhibited in 1822, through Alan Turing’s theoretical framework of the ‘universal computer’, to the Darwin supercomputer that is currently helping high-profile researchers like Stephen Hawking to perform large scale processing jobs that would have been impossible 10 years ago.
When the previous high-performance computing centre at Cambridge was ready for renewal, Paul Calleja, the new director of the service, decided to opt for a new, open strategy for implementing Darwin that involved a lot of input from potential users. Previously, the centre had taken a rather insular approach, with its operations managed by scientists for scientists, with little consideration for the arts and social sciences.
The university decided to opt for a commodity cluster, where each component was chosen independently, rather than a readymade, proprietary supercomputer. According to Calleja, this new, do-it-yourself strategy was considerably cheaper than the alternative, because each component came from a larger market, with a greater amount of competition that drove down the price.
‘It improved the price-to-performance ratio,’ says Calleja. ‘The previous machine operated at speeds of one teraflop/s and cost £6m. The new machine can operate at 18 teraflop/ s and cost just £2m.’ He believes that this approach is becoming increasingly popular as more and more centres realise the cost benefits of building the systems themselves.
‘10 years ago, commodity clusters were practically non-existent. Now, 70 per cent of the Top 500 [list of the world’s fastest computers] are clusters. It’s a widely held trend.’ Calleja concedes that the disadvantage of this method is that it does require a higher level of skill to design, implement and manage, and it’s not always easy to obtain the optimum performance from such systems.
The university chose to use 2,340 processing cores from Dell, which will each take a small share of the computations from computing jobs run on the cluster. InfiniBand interconnects from Qlogic were chosen to provide a high-speed connection between the different cores. In addition to providing a high-bandwidth of 900Mb/s (compared to 120Mb/s with Gigabit Ethernet), Infiniband also reduces delays that occur before the transfer of data packets.
This delay, known as latency, occurs every time data is transferred between cores, and it can actually have a more important effect on processing speeds than the bandwidth.
InfiniBand’s latency is just 2ms, compared to 80ms with Gigabit Ethernet. Combined with the high-bandwidth, this prevents traffic jams building up in the cluster that could mitigate the benefits of using a large number of processors. These features helped the machine to make number 20 on the Top500 list when it debuted in November 2006, although it has since dropped to number 60 on the latest listing.
This computational power comes at a cost, and as an academic institution the university does not have an infinite supply of funding to upgrade the cluster every three years. Instead, the centre pays for itself, with a neutral-cost payment plan, through which it hopes to recover both the depreciation in the value of the equipment as time passes and the running costs of the system.
For this reason, the service operates on a two-tier basis, with a higher quality service for paying users that buy credits for a guaranteed amount of time and resources on the cluster, at a cost of 7p per core per hour.
Paying users include members of the university funded by grants, or researchers from commercial companies. Once their credits have run out, they must then rely on the reduced service for non-paying users.
At any one time the cluster has many different users who need to share the resources, with each user taking up a certain number of cores for part of the day. To schedule the different jobs so that each user receives the resources they have paid for, the centre relies on the Moab Cluster Suite from Cluster Resources.
The Gold Allocation Manager acts as a bank, keeping track of the credits, and communicates this information to the Moab Workload Manager to plan when the user can make use of these resources, and which cores within the cluster they will be working on.
One of the centres’ new endeavours is to encourage researchers outside of the physical sciences to take advantage of HPC but, historically, these users want to use different operating systems. Previously, it had only been possible to run Linux on Cambridge University’s supercomputer – a popular choice of operating system for computational scientists, but not for geographers, biologists and anthropologists, who may be more familiar with a Windows interface found on their desktop PC.
Choosing either of these operating systems alone for the cluster would exclude different groups of researchers, and splitting the cluster into two portions, each dedicated to a different operating system, would limit its potential power should a job require a greater number of nodes than each portion could offer.
Instead, when scheduling jobs, Moab can change the operating system on each core, depending on a user’s preferences. ‘Users are not tied down – they can change the operating system to suit the application,’ says Chris Vaughan, a systems engineer with Cluster Resources, which helped with the installation of Darwin.
It’s an approach that looks set to become increasingly popular as centres try to encourage researchers from a broader range of disciplines to make use of their facilities. ‘It’sa very novel idea, and something that will proliferate in essential services,’ says Calleja. In addition to the choice of operating system, there are other requirements that need to be taken into account when scheduling jobs. Some jobs require parallel processing to solve a single problem very quickly. In these situations, which could include largescale astronomical simulations or climate models, the job is split up into smaller portions, which are spread out over the different cores. The results of these smaller problems often depend on each another, so a great deal of communication is needed between the cores, which can make the programming considerably more complicated.
On the other hand, some problems, such as protein sequencing or Monte Carlo simulations, require the same program to be performed many times, so the application can run on each core simultaneously with no interaction. This kind of problem is known as high-throughput computing University College London’s HPC centre has found a somewhat counter-intuitive solution to scheduling these two different kinds of jobs. Each server on the cluster contains four cores, and it may seem natural to assign just one HPC problem to each server. However, after performing tests to find the most effective solution, Jeremy Yates and his team found that there is a significant increase in the speed of processing if two cores on the server perform parallel processing, and two cores perform the serial processing from a different job. ‘You can spread the load in clever ways that allows an increase in speed for everyone,’ says Yates.
UCL’s new cluster, Legion, is currently being tested on 10 projects, and if everything goes to plan it should be a fully functioning service by mid-December. ‘The UCL procurement has been very user driven. The new centre is owned by the UCL community more than previous services were.’
If there is a common trend for these HPC services, it seems that an increasing amount of attention is now being paid to the user’s specific needs. As the demand for supercomputing increases in a variety of disciplines, it is no longer enough to simply increase the power of hardware with each upgrade.
Sometimes researchers need to use HPC services further afield than their own institution. This may be because the processing powers of the institution’s own computing facilities simply don’t match up to the requirements of the problem in hand, or it could be due to complications with funding, as was the case for Michiel Sprik’s team at Cambridge University.
Procuring academic funding is a lengthy and often frustrating process, and researchers often need to explore every avenue possible to obtain the necessary resources. Despite the Darwin HPC cluster sitting on his doorstep, Sprik found it easier to gain computing time through the DEISA supercomputing infrastructure, which is spread across Europe, than to find funding to pay for privileged-resources on the Darwin supercomputer, which sits practically on his doorstep. Luckily for Sprik, the Géant2 network that connects these computers means that he can access the resources from his department in Cambridge.
The team will begin using the service in January 2008 to model the behaviour of a special kind of protein that performs a key step in respiration and photosynthesis. The proteins transport electrons during these processes, and to understand this mechanism Sprik needs to study both the small-scale electronic structure of the individual molecules, and the way the molecules move en-masse when they are dissolved. Simulating the behaviour at both levels simultaneously is very intensive: ‘We need to recreate the electronic structure of the molecules at each stage in the solution,’ he says. ‘The whole thing adds up to a very big system’ It’s a daunting task, but one that is ideally suited to high-performance computing.