paint-brush
Scaling the Depths: Oceananigans Achieves Record-Breaking Climate Simulation Performanceby@oceanography
New Story

Scaling the Depths: Oceananigans Achieves Record-Breaking Climate Simulation Performance

tldt arrow

Too Long; Didn't Read

Oceananigans.jl achieves scalable, energy-efficient ocean simulations with breakthrough resolution, leveraging GPU clusters for climate modeling advancements.

People Mentioned

Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - Scaling the Depths: Oceananigans Achieves Record-Breaking Climate Simulation Performance
Oceanography: Everything You Need to Study the Ocean HackerNoon profile picture
0-item

Authors:

(1) Simone Silvestri, Massachusetts Institute of Technology, Cambridge, MA, USA;

(2) Gregory Wagner, Massachusetts Institute of Technology, Cambridge, MA, USA;

(3) Christopher Hill, Massachusetts Institute of Technology, Cambridge, MA, USA;

(4) Matin Raayai Ardakani, Northeastern University, Boston, MA, USA;

(5) Johannes Blaschke, Lawrence Berkeley National Laboratory, Berkeley, CA, USA;

(6) Valentin Churavy, Massachusetts Institute of Technology, Cambridge, MA, USA;

(7) Jean-Michel Campin, Massachusetts Institute of Technology, Cambridge, MA, USA;

(8) Navid Constantinou, Australian National University, Canberra, ACT, Australia;

(9) Alan Edelman, Massachusetts Institute of Technology, Cambridge, MA, USA;

(10) John Marshall, Massachusetts Institute of Technology, Cambridge, MA, USA;

(11) Ali Ramadhan, Massachusetts Institute of Technology, Cambridge, MA, USA;

(12) Andre Souza, Massachusetts Institute of Technology, Cambridge, MA, USA;

(13) Raffaele Ferrari, Massachusetts Institute of Technology, Cambridge, MA, USA.

Abstract and 1 Justification

2 Performance Attributes

3 Overview of the Problem

4 Current State of the Art

5 Innovations

5.1 Starting from scratch with Julia

5.2 New numerical methods for finite volume fluid dynamics on the sphere

5.3 Optimization of ocean free surface dynamics for unprecedented GPU scalability

6 How performance was measured

7 Performance Results and 7.1 Scaling Results

7.2 Energy efficiency

8 Implications

9 Acknowledgments and References

6 How performance was measured

The Oceananigans model performance is estimated for two near-global ocean simulations with different domains: a realistic (R) domain and an aqua planet (AP) domain. Both domains span the entire longitudinal extent of the sphere and cover a latitude range of 75◦S to 75◦N.


The Realistic domain has realistic bathymetry and is forced by realistic surface momentum, heat, and salinity fluxes derived from the ECCO2 state estimate[31] at three resolutions:


OceananigansR12 with 1/12th degree horizontal resolution (∼7 km) and 48 vertical levels


OceananigansR24 with 1/24th degree horizontal resolution (∼3.4 km) and 100 vertical levels


OceananigansR48 with 1/48th degree horizontal resolution (∼1.7 km) and 100 vertical levels


Figure 3 shows surface vertical vorticity after one year integration of OceananigansR12 and OceananigansR48 over the global ocean and also for selected regions to show further detail. Both OceananigansR48 and OceananigansR12 exhibit macroscale turbulent ocean features that are currently unresolved by most IPCC-class models. The OceananigansR48 solution exhibits fronts, filaments, and other “submesoscale” vorticity features realized only a handful of times in global simulations.


The idealized OceananigansAP suite of simulations [13], which has idealized bathymetry and surface forcing that does not require interpolation to different resolutions, is used for weak scaling experiments. All OceananigansAP experiments have 100 vertical levels and two latitudinal ridges that divide the world ocean into two basins. We vary the horizontal resolution of OceananigansAP from 1/6th of a degree (∼14 km) to 1/196th of a degree (∼488 m).


None of our simulations require explicit horizontal diffusion of momentum or tracers owing to the adaptive WENO advection scheme described in section 5.2. All simulations use a Richardson-number based parameterization for vertical mixing due to unresolved shear and convective turbulence at 1–100 m scales.


To assess the time-to-solution for each experiment in simulated years per day (SYPD), we measure average wall clock time per time-step. Wall clock time is sampled through NVIDIA’s Nsight System and recorded by NVIDIA Tool Extension Library via the NVTX.jl Julia package.


To assess the efficiency of each solution in simulated years per mega-watt-hour (SYPMWh), we combine SYPD with an estimate of the mean power draw over the duration of an experiment. On MIT Satori [2], which has 256 Nvidia V100s, we have access to precise, billing-grade power metering. For all simulations with Nvidia A100s we estimate power consumption P with



where D is the number of A100s and N is the number of nodes.


We further note that power estimates are provided by LICOM3 and COSMO, but not for LLC4320 or Veros. To estimate the power consumption of LLC4320, we assume that each of the 1000 dual CPU nodes draws 500W. We estimate the power consumption of iHESP CESM [51] and HadGCM3 [37] as a percentage of the peak power consumption of their respective clusters. We use equation (1) to estimate Veros’ power consumption on 1 node with 16 A100s.


Figure 3: Vertical vorticity as simulated by Oceananigans12 (top left) and Oceananigans48 (bottom left) after a one year integration on September 1st. To the right, insets zoom on particularly energetic current systems: the Aghulas and the East Australian Currents. While major ocean currents with widths of 10-100 km are resolved in both simulations, the sharp density fronts and associated currents that develop at the ocean surface in winter at scales between 1-10 km (the ocean weather) are only resolved by Oceananigans48. On September 1 — spring in the southern hemisphere, fall in the northern hemisphere — such sharp frontal features populate the southern ocean but are suppressed in the north.


7 Performance Results

We report both scaling results via time-to-solution in SYPD and efficiency results via energy-to-solution in SYPMWh.

7.1 Scaling Results


Figure 4: Strong scaling tests for the realistic setups OceananigansR12 (1/12◦), OceananigansR24 (1/24◦), and OceananigansR48 (1/48◦). The left plot reports simulated years per wall clock day (SYPD) while the right plot wall clock milliseconds per time steps. All results are averaged over 1500 time steps.



Realistic ocean simulations (Satori and Engaging clusters). We report strong scaling tests using the realistic global setup shown in figure 3 on two clusters: (i) the MIT Satori cluster [2], a high-performance Power 9 system composed of 64 Power 9 nodes hosting four Nvidia V100 GPUs with 32GBs memory each, and (ii) the Engaging MIT cluster, using 8 nodes that host 4 NVlinked A100s with 80GBs memory each. The resulting wall clock time per time step, averaged over 1500 time steps, is presented in Figure 4 for both single precision (FP32) and double precision (FP64) computations. On a single node, OceananigansR12 attains 0.9 SYPD in double precision and 1.4 SYPD in single precision, with a wall clock time per time step ranging from 330 to 550 ms. When increasing the number of nodes up to 16 (64 GPUs), the communication overhead increases, resulting in 12.4 SYPD in single precision and 7.75 SYPD in double precision. We measure a strong scaling efficiency of 52% in single precision and 55% in double precision over 64 GPUs, because the computational workload (40 ms wall clock time per time-step) eventually becomes too short to completely mask the communication overhead.


For higher-resolution ocean weather-permitting simulations, the scaling is almost ideal across the range we investigate. For OceananigansR24 (FP64-V100) and OceananigansR48 (FP32-V100), we measure larger than ideal scaling. This counter-intuitive result is a product of a load balance improvement as the number of GPUs increases. In summary, we attain 1.94 SYPD on 120 V100 GPUs with a kilometer-scale resolution (OceananigansR24) and 0.33 SYPD with an ocean weather resolving simulation (OceananigansR48). Finally, we have tested the OceananigansR48 setup on 144 Perlmutter nodes (576 A100 GPUs), reaching the 0.95 SYPD. This is the first instance of a kilometer-scale ocean achieving ∼1 SYPD. We have also tested the OceananigansR12 setup on 17 nodes obtaining 9.9 SYPD (see fig. 5).


Aqua-planet simulation (Perlmutter cluster). We report weak scaling tests on the NERSC supercomputer (Perlmutter). Perlmutter is a HPE (Hewlett Packard Enterprise) Cray EX super



Figure 5: Weak scaling tests performed in double precision with the OceananigansAP setup. Each GPU has a grid equivalent to a global 1/6◦ and 100 vertical layers. The weak scaling is performed up to a horizontal resolution of 1/168th of a degree (∼488 m resolution) where we achieve 15 simulated days per wall clock day (1 year in roughly 25 days). The star marks the performance of OceananigansR48 (figure 3) on 144 Perlmutter GPU nodes. All results are averaged over 500 time steps.



computer that hosts four A100 GPUs with 40GB per node, linked through a NVLink3 interconnect. All weak scaling tests are performed using the OceananigansAP setup on double precision. We allocate two different horizontal resolutions (1/12 and 1/6 of a degree), progressively increasing them with the number of GPUs while maintaining 100 vertical levels. As shown in figure 5, we obtain 100% weak scaling efficiency for the whole investigated range (1 to 196 nodes – 4 to 768 A100s).


This paper is available on arxiv under CC BY 4.0 DEED license.