Programming Titan

Hybrid architecture points the way to the future of supercomputing

The lab's new Titan supercomputer promises research that is literally global in scale.

A group of researchers from Princeton University is proposing to use Titan—the beefier successor to the lab's Jaguar supercomputer—to create seismological simulations of the entire Earth. Not just a single fault line or continent. The world. "This group wants to use seismic data from around the globe to image the substructure of the planet—as if they were giving the Earth an ultrasound," explains Jack Wells, science director for ORNL's National Center for Computational Sciences (NCCS).

Titan will be capable of performing 20,000 trillion calculations every second, making it six times more powerful than ORNL's current Jaguar system. Image: Andy Sproles

"If they tried to run that simulation on Jaguar, it would have required the whole machine for half a year. However, if they optimize their code and win a big allocation of time on Titan, they just might be able to do it."

Titan comes online later this year and may be able to handle this feat of computational strength because of a fundamental difference in the way it approaches complex calculations. Rather than increasing the speed of its individual processor cores, Titan's computational prowess rests on tens of thousands of next-generation processors, each of which contains hundreds of cores designed to rip through calculations.

"They're not getting faster, they're getting wider," Wells says. These "wider" processors can handle hundreds of parallel threads of data and are changing the way programmers and software designers work in the high-performance computing arena.

Speed and efficiency

For decades, computers got faster by increasing their central processor unit's "clock rate"—how often a CPU, or "core," cycles through its assigned tasks. However, fast processors produce a lot of heat. Twice the speed results in a chip that's eight times as hot. So by about 2004, manufacturers decided it made more sense to simplify the circuitry on their chips and use the extra space for more, slower cores running in parallel, rather than finding new ways to cool progressively faster monolithic CPUs. This led to the dual-core and quad-core processors that dominate the personal computer market today. Even an early version of ORNL's Jaguar supercomputer was based on dual-core processors.

"Parallelism at the chip level increases with the number of cores on the chip," Wells explains. "We went to dual core, then quad core, then hex core. The traditional processors we have placed in Titan have 16 cores."

These multicore CPU chips are flexible, powerful and designed to orchestrate complex computational activities. What they're not is energy efficient. "If they were a building, the heat and lights would be on all the time," Wells says.

Jaguar, with its 300,000 cores, consumes about 8 megawatts of power annually. To reach the center's goal of creating an "exascale" computer (about 400 times faster than Jaguar) with the same technology would require 100 to 200 MW of power. "That's not a practical option," Wells says. "We need a way to increase performance with much higher energy efficiency."

It just so happens that there is a rapidly emerging technology that does exactly that: the graphics processing unit, or "accelerator." Rather than following Jaguar's lead and using two CPU chips on each of the nodes on its communication network, Titan will replace one of the CPUs with a GPU. While consuming slightly more power than two CPUs, this hybrid arrangement allows the central processor chip to hand off the most computationally intensive and time-consuming activities to the accelerator, which divides them among hundreds of streamlined GPU cores and returns the results to the CPU. By radically boosting throughput, the hybrid configuration provides a big increase in processing speed.

As its name implies, the GPU was originally developed for accelerating graphics processing. The demand for increasingly realistic visual content in video games and other applications compelled GPU designers to endow their chips with the ability to apply complex physical equations to the task of producing convincing renderings of real-world phenomena like explosions, smoke and fire. Eventually game programmers and GPU manufacturers realized that such "game physics" could also have applications in the field of scientific computing and simulation.

"At that point, NVIDIA a leading GPU manufacturer, began to develop a business strategy for producing GPUs for scientific computing," Wells says. "Our partnership with them on Titan is part of that strategy. NVIDIA recognized that, although scientific computing is a smaller market than video gaming, it has considerable impact, and it is growing. So now they have a line of products for scientific computing."

New challenges

The use of these highly parallel GPUs shifts some of high-performance computing's complexity from its hardware to its software, providing new challenges for software developers.

"To take full advantage of this hybrid architecture, a programmer needs to concentrate on revealing all of the available parallelism in the computer code to the processor," Wells explains. "Any tasks that can be done in parallel, need to be made available to the hierarchy of GPUs, CPUs, the computer's communication network, and the memory structure that goes with it."

Wells says that, in anticipation of the move to Titan, the laboratory's Center for Accelerated Application Readiness has been retooling key computer codes in a range of research areas to exploit opportunities for parallel processing. As a result, most of them are now also twice as fast when they're running on traditional CPU-based systems.

One of these applications is LSMS, a code used for modeling magnetic materials that was initially developed at ORNL 15 years ago. It employs a computational technique called matrix-matrix multiplication to simulate the interactions between electrons and atoms in magnetic materials. Because the code was developed for early parallel computers, its programmers anticipated many of the nuances of "message passing," the ability to efficiently use parallel streams of data that will be critical to operating in Titan's highly parallel environment.

"The message-passing capabilities and data structure of LSMS allow the program to stride through memory in a regular fashion," Wells says. "That means the processor doesn't have to wait for data to arrive. Time isn't wasted waiting on resources."

As a result, while the program will require some work to bring it up to date, it is a good candidate for being adapted for use on highly parallel computers.

Wells explains that the accelerators give researchers new tools to apply to their research problems. "The question they need to consider is how they will structure their data in order to use these resources efficiently.

"Think of a GPU as a factory with materials moving around the factory floor," he suggests. "When you have big, expensive machines available, you want the necessary materials to be there when you need them, so the machine operators are running the machines, not waiting around drinking coffee.

"An important part of our Titan project is partnering with software vendors to make sure that appropriate programming tools such as compilers and debuggers are available, along with the necessary libraries, to help make this programming task more manageable. "Again, restructuring the data to take advantage of all available parallelism is the basic programming task for all foreseeable architectures. Accomplishing this is the first thing on which the programmer must focus. We have observed that the performance of the restructured codes, in general, is twice as fast as the old code running on the same, traditional CPU-based hardware. That is before we offload work to the GPUs."

Next-generation GPUs

The application of this hybrid architecture to building Titan began in earnest earlier this year and will occur in two phases. Phase 1 involved replacing all the circuit boards in Jaguar with new ones that came with new CPUs and installing twice as much memory, a slot for the yet-to-be-installed GPU, and a new network interconnect to speed communication among nodes. The upgraded interconnect is also "hot-swappable," so if something goes wrong with the node (if a processor fails, for example) it can be replaced while the machine is still running.

Phase 2, which is scheduled to begin in October, will fill the empty accelerator slots o f the ne w nodes with NVIDIA 's next-generation "Kepler" GPU.

While they wait for Kepler to make its appearance, NCCS researchers have partitioned off 10 of Jaguar's 200 cabinets to try out the new hybrid ar chitecture using NVIDIA's current-generation GPU, called Fermi.

Wells notes that this developmental mini-Titan has already yielded a good comparison of traditional CPU-only architecture, like that used in Jaguar, with the hybrid arrangement that will be used in Titan.

"Depending on the code we're running, with the Fermi chip, the hybrid node is usually a factor of 1.5 to 3 times faster than a node containing two multicore CPUs," he says. "On a few applications it has been 4 to 6 times faster. We made these comparisons after rewriting the code to take better advantage of the parallelism in the Fermi accelerator. So it's an apples-to-apples comparison.

Of course the big question is, once the Jaguar-to-Titan transformation is complete, how fast will Titan be? There are algorithms for extrapolating from Fermi's performance to Kepler's, and those suggest that Titan could be 10 times faster than Jaguar. The crucial variable, however, is how many of Jaguar's 200 cabinets will be upgraded to include Kepler GPUs. That has yet to be determined. So a reliable prediction of Titan's power is still just out of reach.

Opportunity for innovation

When Titan goes online, researchers will be able to create some of the simulations they have been dreaming about, but were just too big or too detailed for other systems.

For climate scientists that might mean generating 10 times as much detail in their climate models. For example, climate simulations often assume that air pressure is relatively constant—the same at the ground as it is at the jet stream. They make this compromise with reality because calculating variable air pressure would add a lot of computation time, and researchers have a limited number of hours on the computer to get their calculations done.

"There's a lot of interesting stuff that happens at different altitudes," Wells says. "Titan will now be able to explore that." Other users plan to apply Titan to studying the fundamentals of combustion—how materials burn—in greater detail than ever before. These studies will give researchers a better understanding of the basic processes that underpin the use of both fossil fuels and complex renewable fuels, such as ethanol or biodiesel, for transportation and electricity generation.

"We expect this research to result in the development of more energy-efficient combustion processes for these fuels," Wells says. Another group will use highly accurate molecular dynamics techniques to simulate the fusion of biological membranes. This process is fundamental to cell division and is related to disease processes, such as cancer.

"This basic biophysical phenomenon is poorly understood," Wells says. "Titan's computing power will enable highly accurate simulations and new insights.

Expanding the base

"These are the kinds of problems we want scientists to consider addressing with Titan. We're asking the scientists, if you had a big allocation of time on Titan a year from now what would you do with it?"

The response to Wells' question has been enthusiastic—and maybe a little surprising. "As expected, most of the scientists and engineers we have traditionally worked with are interested in working with Titan's hybrid programming model," Wells says. "What we didn't anticipate was the number of users working in the GPU computing space—researchers doing accelerated computing on their workstations, rather than on supercomputers—who have been inspired to think of bigger problems that could be done on Titan. We were worried that Titan's hybrid architecture would alienate some of our traditional users; instead, it is actually attracting a new community of users to Titan.

"This is a very healthy development. A big part of our mission is reaching out to new users and new communities and encouraging them to take advantage of our unique resources for scientific computing." —Jim Pearce