System operations

Feiyi Wang — leader of the OLCF’s Analytics and AI Methods at Scale, or AAIMS, group — has been spending much of his time pondering an elusive goal: how to operate a supercomputer so that it uses less energy. Tackling this problem first required the assembly of a massive amount of HPC operational data. 

Long before Frontier was built, he and the AAIMS group collected over one year’s worth of power profiling data from Summit, the OLCF’s 200-petaflop supercomputer launched in 2018. Summit’s 4,608 nodes each have over 100 sensors that report metrics at 1 hertz, meaning that for every second, the system reports over 460,000 metrics. 

Using this 10-terabyte dataset, Wang’s team analyzed Summit’s entire system from end to end, including its central energy plant, which contains all its cooling machinery. They overlaid the system’s job allocation history on the telemetry data to construct per-job, fine-grained power-consumption profiles for over 840,000 jobs. This work earned them the Best Paper Award at the 2021 International Conference for High Performance Computing, Networking, Storage, and Analysis, or SC21.

The effort also led Wang to come up with a few ideas about how such data can be used to make informed operational decisions for better energy efficiency.

Using the energy-profile datasets from Summit, Wang and his team kicked off the Smart Facility for Science project to provide ongoing production insight into HPC systems and give system operators “data-driven operational intelligence,” as Wang puts it.

“I want to take this continuous monitoring one step further to ‘continuous integration,’ meaning that we want to take the computer’s ongoing metrics and integrate them into a system so that the user can observe how their energy usage is going to be for their particular job application. Taking this further, we also want to implement ‘continuous optimization,’ going from just monitoring and integration to actually optimizing the work on the fly,” Wang said.

Another one of Wang’s ideas may assist in that goal. At SC23, Wang and lead author Wes Brewer, a senior research scientist in the AAIMS group, delivered a presentation, “Toward the Development of a Comprehensive Digital Twin of an Exascale Supercomputer.” They proposed a framework called ExaDIGIT that uses augmented reality, or AG, and virtual reality, or VR, to provide holistic insights into how a facility operates to improve its overall energy efficiency. Now, ExaDIGIT has evolved into a collaborative project of 10 international and industry partners, and Brewer will present the team’s newest paper at SC24 in Atlanta, Georgia.

At ORNL, the AAIMS group launched the Digital Twin for Frontier project to construct a simulation of the Frontier supercomputer. This virtual Frontier will enable operators to experiment with “What if we tried this?” energy-saving scenarios before attempting them on the real Frontier machine. What if you raised the incoming water temperature of Frontier’s cooling system — would that increase its efficiency? Or will you put it at risk of not cooling the system enough, thereby driving up its failure rate?

“Frontier is a system so valuable that you can’t just say, ‘Let’s try it out. Let’s experiment on the system,’ because the consequences may be destructive if you get it wrong,” Wang said. “But with this digital twin idea, we can take all that telemetry data into a system where, if we have enough fidelity modeled for the power and cooling aspects of the system, we can experiment. What if I change this setting — does it have a positive effect on the system or not?” 

Frontier’s digital twin can be run on a desktop computer, and using VR and AR allows operators to exam the system telemetry in a more interactive and intuitive way as they adjust parameters. The AAIMS group also created a virtual scheduling system to examine the digital twin’s power consumption and how it progresses over time as it runs jobs. 

Although the virtual Frontier is still being developed, it is already yielding insights into how workloads can affect its cooling system and what happens with the power losses that occur during rectification, which is the process of converting alternating current to direct current. The system is also being used to predict the future power and cooling needs of Discovery. 

“We can and will tailor our development as well as the system to address any current and future pressing challenges faced by the OLCF,” Wang said. 

Facility infrastructure

Powering a supercomputer doesn’t just mean turning it on — it also means powering the entire facility that supports it. Most critical is the cooling system that must remove the heat generated by all the computer’s cabinets in its data center.

“From a 10,000-foot viewpoint, a supercomputer is really just a giant heater — I take electricity from the grid, I run it into this big box, and it gets hot because it’s using electricity. Now I have to run more electricity into an air conditioner to cool it back off again so that I can keep it running and it doesn’t melt,” Geist said. “Inside the data center there is a lot of work that goes into cooling these big machines more efficiently. From 2009 to 2022, we have reduced the energy needed for cooling by 10 times, and our team will continue to make cooling optimizations going forward.”

Much of the planning for those cooling optimizations is led by David Grant, the lead HPC mechanical engineer in ORNL’s Laboratory Modernization Division. Grant oversees the design and construction of new mechanical facilities and is primarily responsible for ensuring that every new supercomputer system installed at the OLCF has the cooling it requires to reliably operate 24-7. He started at ORNL in 2009 and worked on operations for the Jaguar supercomputer. Then, he became involved in its transition into Titan in 2012, led Summit’s infrastructure design for its launch in 2018, and most recently oversaw all the engineering to support Frontier. 

In that span of time, the OLCF’s cooling systems have substantially evolved alongside the chip technology, going from loud fans and chiller-based air-conditioning in Jaguar to fan-free liquid cooling in Frontier. Furthermore, the water temperatures required to cool down the compute nodes have risen from 42 degrees Fahrenheit for Titan to Frontier’s 90 degrees Fahrenheit — a target set by the FastForward program. That extra warmth spurs huge energy savings because the circulating water no longer needs to be refrigerated and can be sufficiently cooled by evaporative towers instead. 

“We are trying to get the warmest water possible back from the cabinets while serving them the warmest water-supply temperatures — the higher the supply temperatures, the better,” Grant said. “Warmer water coming back to us allows us to minimize the flow that we have to circulate on the facility side of the system, which saves pumping energy. And then the warmer temperatures allow us to be more efficient with our cooling towers to be able to reject that heat to our environment.”

Frontier’s power usage effectiveness, or PUE — the ratio of the total power used by a computer data-center facility versus the power delivered to computing equipment — is delivering a 1.03 at peak usage. This essentially means that for every 1,000 watts of heat, it takes just 30 watts of additional electrical power to maintain the system’s appropriate thermal envelope. The global, industry-wide average for data centers is around 1.47 PUE, according to the Uptime Institute

Making further reductions in power usage for a faster system such as Discovery will require even more innovative approaches, which Grant is investigating.

First, the concept of recovering (or using) some of Discovery’s excess heat may hold some promise. The facility is well situated to reuse waste heat if it can be moved from the cooling system to the heating system. But this task is challenging because of the elevated temperatures of the heating system, the low-grade heat from the cooling system and the highly dynamic nature of the heat being generated by the HPC systems.

Second, the incoming Discovery system will share Frontier’s cooling system. Additional operational efficiencies are expected from this combined-use configuration.

“Right now, Frontier gets to sit on its own cooling system, and we’ve optimized it for that type of operation. But if you have Frontier demanding up to 30 megawatts and then another system demanding perhaps that much again, what does that do to our cooling system? It is designed to be able to do that, but we’re going to be operating at a different place in its operational envelope that we haven’t seen before. So, there’ll be new opportunities that present themselves once we get there,” Grant said. 

Third, Grant is examining how construction and equipment choices may benefit the facility’s overall energy efficiency. For example, Frontier’s cooling system has 20 individual cooling towers that require a process called pacification to help protect their internal metal surfaces, and this process involves a lot of pumping over time. That step could be eliminated with newer towers that no longer require the pacification process.

Fourth, idle time on a supercomputer can use up a great deal of electricity —Frontier’s idle loads are 7 to 8 megawatts. What if that idle load could be greatly reduced or eliminated?

“When we interact with the customers who have influence on the software side, we try to communicate to them how their decisions will translate through the cooling system and to the facility energy use,” Grant said. “I think there’s a lot of potential on the software side to try to reduce the idle load requirement and make their models run as efficiently as possible and increase the utilization of the system. In return, they will get higher production on their side for the data that they’re trying to produce.”

This Oak Ridge National Laboratory news article "Computer engineers at ORNL pioneer approaches to energy efficient supercomputing" was originally found on https://www.ornl.gov/news