China dominates HPC industry with home-grown technology
At the International Supercomputing Conference (ISC’16) held last week in Frankfurt, Germany, a new Chinese supercomputer - Sunway TaihuLight - took the top spot of the TOP500, a bi-annual list of the fastest supercomputers, based on a Linpack benchmark.
However, this new system represents more than just a 93-petaflop Chinese supercomputer. It demonstrates the Chinese commitment to developing home-grown HPC hardware that can compete at the very top of the worldwide supercomputing industry.
At ISC last year it was announced that China would deliver two 100-petaflop systems within the next year. One would be an upgrade to the Tianhe-2 system, featuring a new Chinese-developed accelerator. The second, a new system developed at the National Supercomputing Center in Wuxi, near Shanghai, which we know now as the Number 1 Top500 system ‘Sunway TaihuLight'.
The Sunway TaihuLight System was developed by the National Research Center of Parallel Computer Engineering & Technology (NRCPC), and installed at the National Supercomputing Center in Wuxi (a joint team with the Tsinghua University, City of Wuxi, and Jiangsu province), which is in China's Jiangsu province.
The new system is based on home-grown processor technology, announced at ISC last year, an indication that China is not only putting considerable resources into developing HPC technology but also that it is far along the roadmap to eliminating the dependence on US based processor companies such as Intel.
Chinese RISC processor
A recent report written by Jack Dongarra, co-founder of the Top500, professor at the University of Tennesse, shed some light on the technology underpinning the new system – including the new Chinese processor powering Sunway TaihuLight.
The report states: ‘The complete system has a theoretical peak performance of 125.4 Pflop/s with 10,649,600 cores and 1.31 PB of primary memory. It is based on a processor, the SW26010 processor that was designed by the Shanghai High Performance IC Design Center.’
A computer node of this system is based on one SW26010 many-core processor chip. Each processor is composed of four management processing elements (MPE), four computing processing elements (CPE), (a total of 260 cores), four memory controllers (MC) and a network on chip connected to the system interface. Each of the four MPEs, CPEs, and MCs have access to 8 GB of DDR3 memory.
The use of ShenWei processors present an element of historical irony, as some reports indicate the design of the chip appears to resemble very closely that of the ‘Alpha’ RISC chip developed by the Digital Equipment Corporation (DEC) and discontinued in 2007 by HP, which had inherited the technology through merger and acquisition.
Since the new ShenWei system uses its own CPU, China enters the 100-petaflop era with a CPU and interconnect technologies. The upgrade to Tianhe-2, featuring a Chinese developed accelerator, will complete hardware set underlying China’s latest drive to develop home-grown supercomputing hardware.
It should also be noted that reports say the processor was based on an earlier RISC based DEC Alpha processor, the JiÄngnán Computing Lab has been developing its own line of ShenWei processors since at the very least, the launch of ShenWei S-1 in 2006. The latest processor used in the Sunway TaihuLight supercomputer is the fourth generation of this processor – SW26010.
In Dongarra’s report on the Sunway TaihuLight System, it states that a single node of the Sunway TaihuLight supercomputer contains a single SunWei processor. A node delivers approximately three teraflops, an impressive raw performance roughly comparable to Xeon Phi.
However, while this system delivers impressive performance on the computationally intensive Linpack benchmark, it is not as suited to some data intensive applications. The report does note that a lack of memory and a PCIE Gen 3 based interconnect system reduces performance for The High Performance Conjugate Gradients (HPCG) Benchmark.
HPCG focuses computational and data access patterns that more closely match a different and broad set of important applications. The aim is to incentivise system designers to invest in capabilities that will have an impact on the collective performance of these applications as opposed to just relentlessly increasing FLOPS performance.
The report states that the ratio of floating point operations per byte of data from memory on the SW26010 is 22.4 Flops(DP)/Byte transfer, as opposed to 7.2 Flops(DP)/Byte for the Intel Knights Landing processor. The report also states that ‘the primary memory for this system is on low side at 1.3 PB (Tianhe-2 has 1.4 PB and Titan has 0.71 PB).’
While the system does have a focus towards computation, as opposed to the more data-centric computing strategies that we have begun to see implemented in the US and Europe it is most certainly not just a Linpack supercomputer. The report explains that there are already three applications running on the Sunway TaihuLight system which are finalists for the Gordon Bell Award at SC16.
Each of these applications was scaled to around 8,000,000 cores and there are still several months to improve performance before the award is presented at SC16 later this year.
The Gordon Bell Prize is awarded each year to recognise an outstanding achievement in high-performance computing. The purpose of the award is to track the progress of parallel computing over time, with particular emphasis on rewarding innovation in applying HPC to applications in science, engineering, and large-scale data analytics.
Ultimately it will take time to develop a full suite of applications for this system as there are very few RISC based supercomputers on this scale. Therefore it stands to reason that there are not the same number of applications or programmers with experience in RISC when compared to x86 based systems.
As the number of applications and expert programmers increase we will likely see even more performance from this new breed of RISC based Chinese supercomputer.
The report by Dongarra concludes: ‘The Sunway TaihuLight system, based on a home-grown processor, demonstrates the significant progress that China has made in the domain of designing and manufacturing large-scale computation systems.’
‘The fact that there are sizeable applications and Gordon Bell contender applications running on the system is impressive and shows that the system is capable of running real applications and not just a “stunt machine".’