When models outgrow hardware, turn to HPC
As engineers discover the power of computer simulation, they want to work on ever-larger models, today some having hundreds of millions of degrees of freedom (DOFs). This desire is going hand-inhand with the evolution of HPC systems, whether multicore processors, multiprocessor servers or clusters. In the past we could count on higher CPU clock rates to speed up analyses, but power consumption and heat became a problem. The solution was to put multiple cores on one piece of silicon and step back clock speeds. However, until independent software vendors (ISVs) take into account the multicore nature of today’s systems, the performance of some applications could even decrease due to competition for memory bandwidth and I/O.
Cheaper hardware, more software expenses
In the old days, hardware was the dominant cost when running CAE codes, but that economic model has totally shifted, says Greg Clifford, HPC manager for IBM’s automotive segment. And while hardware is still a cost item not to ignore, software is starting to dominate. ISVs generally require a licence for each core, so for customers with large clusters the cost of applications exceeds that for hardware. Another issue is the cost of power and cooling, which Clifford says is a significant if not the dominant issue for all of his customers.
Nonetheless, hardware has been the driving force behind HPC. Quad-core chips are readily available, and Intel’s Nehalem (see Figure 2 on page 35), scheduled for production in Q4 2008, is scalable with from two to eight cores, each with simultaneous multithreading, resulting in 4- to 16-thread capability. There’s also been a shift from one server running dozens of CPUs to cluster computing with many servers each containing fewer CPUs. Further, while Unix dominated in desktop and servers for HPC through the 90s, in the past five or more years there has been a major shift towards Linux running on X86 class processors. Today, comments Dr Tony Kent, European marketing director at MSC Software, Linux is that company’s biggest HPC platform.
When looking at HPC, it’s also important to understand the two major memory architectures. Many older multicore and multiprocessor systems use SMP (Shared Memory Parallel) where each CPU shares memory with others; but the trend is to DMP (Distributed Memory Parallel) where each CPU or node has its own non-shared memory space. In terms of CAE, note that some solvers can work with only certain memory architectures or support others only in limited ways.
Interconnect becomes key
The way cluster nodes are connected has a great influence on overall application performance. To achieve linear application scalability in multicore environments, it is essential to have interconnects that provide low latency for each core, no matter how many. Because each core generates and receives data to and from other nodes, the bandwidth capability must be able to handle all the datastreams. Furthermore, interconnects should handle all the communications and offload CPU cores from network-related tasks so these processor cycles are available to applications.
Roughly three years ago, Gigabit Ethernet (GigE) was adequate to scale CAE applications to 8- or 16-way parallel, when CPUs had one core and each server had one or two CPUs. But now users want to scale up to more and faster cores. In the meantime, InfiniBand has become one of the most popular interconnect options for some good reasons, explains Gilad Shainer, director of technical marketing at Mellanox, one of the top suppliers of InfiniBand hardware to server OEMs. This interconnect provides high bandwidth (20 Gb/s today with 40 Gb/s products due in the coming months), low latency (1μs vs 50 to 100μs for GigE), and scalability up to thousands of nodes and multiple CPU cores per server platform.
Applications that take advantage of interconnects such as InfiniBand often include the MPI (Message Passing Interface) communications library; it helps code achieve the best latency, an important characteristic of CAE apps. MPI acts as the interface between the application and the cluster networking. Although MPI is transparent to the end user it solves a huge problem for ISVs, who must deal with an almost infinite combination of possible cores, memory configurations and interconnect schemes, but can do so easily by incorporating this library. Several implementations are available including Hewlett Packard’s HP-MPI, MPI Connect from Scali, and even open source versions called Open MPI and MVAPICH.
Figure 2: Improvements at the silicon level will drive HPC to even higher performance levels. Here, Intel’s Nehalem, scheduled for production in Q4 2008, is scalable with from two to eight cores, each with simultaneous multithreading, resulting in 4- to 16-thread capability.
Not all codes are optimised for clusters, though. The CAE space is dominated by large ISV codes that are generally based on years of work, says IBM’s Clifford, and many are still in Fortran. Rewriting them for scaling is a daunting challenge, and there is more potential to exploit scaling for emerging markets and new codes. And while ISVs are always looking at new algorithms and solvers, adding a new solver is a relatively slow process that can take several years. Five years ago, when ISVs were moving to MPI parallel operation, they were paying attention to such improvements, but they have worked through that stage.
One interesting relatively new solver is AMLS (Automated Multilevel Substructuring), a computational method for noise and vibration analysis developed at the University of Texas and now licensed by German company CDH. Nearly all automobile companies use it with Nastran in noise/ vibration/harshness analysis. It allows them to use finite-element models of cars with 10 million or more DOFs where the previous limit was about two or three million, and users can run jobs on workstations rather than supercomputers.
In addition, several open-source solvers are optimised for parallel processing. They include SuperLU, whose MT version is for SMP machines and DIST is for distributed memory configurations using MPI. Next is MUMPS, a massively parallel sparse direct solver where one implementation is MPI based.
While Ansys’s Fluent CFD (computational fluid dynamics) software has been updated to optimise performance on multicore systems, for the firm’s structural analysis software it has introduced an add-on called Parallel Performance for Ansys. It consists of four solvers that permit engineers to exploit clusters. These are generally new implementations of classic solvers that support modern memory architectures. For example, compared to the classic PCG (preconditioned conjugate gradient), the Distributed DCG solver running a structural model with 12.8 million DOFs on an 8-CPU cluster improves performance by a factor of six.
In some cases hardware and software vendors cooperate to optimise code. Hewlett-Packard has worked with MSC Software to improve that ISV’s own maths library. HP hand-coded most of that library in assembly for the Itanium processor, and MSC now sells a special version of Nastran on request, where it links the modified library into the executable. HP ran benchmarks to evaluate the optimised kernel improvements in another version of Nastran, this from Siemens PLM Software, NX Nastran v5, and compared it to v4 running on HP Itanium2-based servers with Linux (on an HP Integrity rx2660 server); the elapsed job time dropped as much as 50 per cent. HP is cooperating with other ISVs on similar projects says Knute Christensen, software marketing manager for the company’s HPC division, and firms with especially strong relationships include Daussault with its Simulia software and Ansys with Fluent. He also points out that HP has developed its own maths libraries (as have Microsoft, Intel and even some ISVs) and the level of performance increase for CAE codes is generally in the range of that just described for Nastran.
Benchmarks answer critical questions
Such benchmarks are helpful to engineers who frequently ask: Which configuration best runs my CAE software? Which type of processor and with how many cores? How much memory per node? The type of interconnect? The answer depends largely on the type of analysis and the application software.
It can be difficult to find data on how well a given software package is designed for multithreaded operation, says Mellanox’s Shainer, although every vendor will tell you that its codes are, of course, optimised for it. Indeed, running benchmarks for various codes on different system configurations is a large part of his job. One package he says that is proven to scale well is OpenCFD. He has also looked at Fluent, and while two years ago it didn’t scale well, the current version does and he uses it for benchmarking to discover the pros/cons of various server/core configurations and interconnect schemes.
Figure 3: Benchmarks by Hewlett-Packard’s HPC division compare Nastran running on three different dualcore processors and shows that beyond four CPUs there are diminishing returns to adding additional cores.
The extent to which CAE software can scale up with servers depends not only on hardware settings, but also on the discipline and the numerical approaches that have evolved. CFD codes, explains HP’s Christensen, are not as dependent on a short list of maths libraries as are structural mechanics codes, and CFD codes scale much better. He adds that Nastran is dominated by these maths libraries and has trouble scaling beyond four or eight processors. To illustrate this point and to show differences that can arise just from the choice of core, consider Figure 3, above. It shows Nastran v5 scalability across systems with from one to 16 CPUs when running a carbody model with 0.6 million degrees of freedom, and running on three HP servers each using a dual-core CPU chipsets, either the AMD Opteron, Intel Xeon or Intel Itanium2. Note that there is little benefit from moving from four to eight cores and even less when going from eight to 16.
When scaling up an HPC system, users must consider both how many servers to add (‘scaling out’) as well as how many CPUs per server (‘scaling up’). Mellanox ran some tests on LSDYNA structure and fluid-analysis software from Livermore Software Technology on a single server with SMP and DMP (using the Scali MPI connect). With one core, the software running on DMP was slightly more efficient, but with eight cores DMP provided a performance boost of roughly 45 per cent.
To see the difference between GigE and InfiniBand, Mellanox ran Fluent using Ansys’s Turbo_500k benchmark. Note (see Figure 4) in particular the knee in the curve for three servers where the ratings for GigE start to flatten out, whereas performance with InfiniBand scales up almost linearly. This, explains Shainer, is due to the fact that most messages being passed are between 16k and 64k bytes in size; with more servers there are more such messages, and the overhead of Ethernet transmissions becomes the bottleneck.
Another benchmark from Mellanox shows that there can even be a loss of performance when adding cores. This time the test was on LS-DYNA on from eight to 64 cores. Again, performance of InfiniBand scales up almost linearly, but GigE increases only up to 16 cores and afterwards performance drops – amazingly, it runs more slowly on 64 cores than on eight cores, and the performance gain of InfiniBand over GigE for 64 cores is 1,034 per cent.
Figure 4: In a benchmark run by Mellanox, you can see the knee in the curve for three servers where the ratings for GigE start to flatten out whereas performance with InfiniBand continues to scale up almost linearly.
When modelling with languages
Much mathematical modelling is done with Very High Level Languages such as Matlab, Maple, IDL or Spice.Those working with such environments can parallelise these applications by turning to Star-P, a client-server parallelcomputing platform from Interactive Supercomputing. The company’s goal is to make parallel processing as accessible as possible to those people who aren’t computer scientists. In fact, adds VP of marketing David Rich, often the real speed issue isn’t so much actual compute time, it is instead doing the work necessary to get the speed; most scientists don’t want to get involved in complex programming projects to implement parallel processing.
In Matlab code, for instance, you use the *p operator to create a new datatype, the distributed data array, and the Star-P client running on the same machine takes ownership of that array. Given a matrix n, the code snippet n=n*p; creates a matrix for Star-P. That software then implements either task parallelism (where some function must run multiple times, but they are all independent of each other) or data parallelism (such as where each column in a matrix holds data from a different image).The Star-P client sends the selected code to a Star-P server for parallel execution on multicore hardware, whether one machine with multiple cores or a cluster. The Star-P application is configured to support the available processors and can also work with server’s platform manager for resource allocation, all transparent to the end user. Depending on the application and available hardware, processing can sometimes run as much as 100 times faster.
In addition, if you do have custom libraries that lend themselves to parallel operation, you can bind them into a Matlab process. Star-P allows you to connect libraries to the bottom of the software stack, and they become visible within the Matlab user interface. Finally, Interactive Supercomputing has recently introduced Star-P On-Demand whereby scientists can test, benchmark, build and deploy parallelised applications. This is very interesting for companies that want to investigate the benefits of parallel processing, but that don’t yet have a cluster or for those that want to try their applications on a larger scale platform.