The computational power of the Frontier supercomputer — currently ranked No. 1 on the TOP500 list of the world’s fastest supercomputers — comes in part from the unique design of its more than 37,000 AMD Instinct™ MI250X GPUs.
Using approximately 16,000 of Frontier’s AMD GPUs, AxoNN clocked an unprecedented speed of 1.38 exaflops per second when running at a reduced precision. One exaflop is more than a quintillion, or a billion-billion, calculations per second.
Details about the team’s achievement can be found in their preprint publication, “Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers.”
Frontier’s computing power also allowed the team to study LLM memorization behavior at a larger scale than ever before. They found that privacy risks associated with problematic memorization tend to occur in models with greater than 70 billion parameters.
“Generally, the more training parameters there are, the better the LLM will be,” Bhatele said. “However, introducing more training parameters also increases privacy issues and copyright risks caused by memorization of training data that we don’t want the LLMs to regurgitate. My colleague and co-author, Tom Goldstein, has coined a term for it: catastrophic memorization.”
To mitigate the problem, they used a technique called Goldfish Loss that randomly omits certain bits of information during training and prevents the model from memorizing entire sequences that could contain sensitive or proprietary data. Frontier’s scalability allowed the team to test the mitigation strategy efficiently in large experiments with models of different sizes up to 405 billion parameters.
“We are extremely pleased with AxoNN’s performance on Frontier and the other leadership systems we used in our experiments,” Bhatele said. “Being recognized for this achievement is extra special, and everyone on the team is excited about our presentation at the upcoming supercomputing conference.”
The study also used the Alps supercomputer at the Swiss National Supercomputing Centre in Lugano, Switzerland, and the Perlmutter supercomputer at DOE’s National Energy Research Scientific Computing Center, an Office of Science user facility located at Lawrence Berkeley National Laboratory.
The Gordon Bell effort on AxoNN was led by Siddharth Singh, a senior doctoral student in Bhatele’s group, which also includes Prajwal Singhania and Aditya Ranjan. Other collaborators include Tom Goldstein, John Kirchenbauer, Yuxin Wen, Neel Jain, Abhimanyu Hans, and Manli Shu (University of Maryland); Jonas Geiping (Max Planck Institute for Intelligent Systems); and Aditya Tomar (UC Berkeley).
Access to high-performance computing resources at ORNL were awarded through DOE’s Innovative and Novel Computational Impact on Theory and Experiment program.
UT-Battelle manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit energy.gov/science.
This Oak Ridge National Laboratory news article "Gordon Bell Prize nomination recognizes efforts to train extreme-scale large language models using Frontier" was originally found on https://www.ornl.gov/news