The Proceedings of the PASC Conference (PASC21) are published in the Association for Computing Machinery's (ACM's) Digital Library. In recognition of the high quality of the PASC Conference papers track, the ACM continues to provide the proceedings as an Open Table of Contents (OpenTOC). This means that the definitive versions of PASC Conference papers are available to everyone at no charge to the author and without any pay-wall constraints for readers.

The OpenTOC for the PASC Conference is hosted on the ACM’s SIGHPC website. PASC papers can be accessed for free at: www.sighpc.org/for-our-community/acm-open-tocs.

The following papers will be presented as talks at PASC21.

A Smoothed Particle Hydrodynamics Mini-App for Exascale

Aurélien Cavelan, Rubén M. Cabezón, Michal Grabarczyk, and Florina M. Ciorba

The Smoothed Particles Hydrodynamics (SPH) is a particle-based, meshfree, Lagrangian method used to simulate multidimensional fluids with arbitrary geometries, most commonly employed in astrophysics, cosmology, and computational fluid-dynamics (CFD). It is expected that these computationally-demanding numerical simulations will significantly benefit from the up-and-coming Exascale computing infrastructures, that will perform 1018 FLOP/s. In this work, we review the status of a novel SPH-EXA mini-app, which is the result of an interdisciplinary co-design project between the fields of astrophysics, fluid dynamics and computer science, whose goal is to enable SPH simulations to run on Exascale systems. The SPH-EXA mini-app merges the main characteristics of three state-of-the-art parent SPH codes (namely ChaNGa, SPH-flow, SPHYNX) with state-of-the-art (parallel) programming, optimization, and parallelization methods. The proposed SPH-EXA mini-app is a C++14 lightweight and flexible header-only code with no external software dependencies. Parallelism is expressed via multiple programming models, which can be chosen at compilation time with or without accelerator support, for a hybrid process+thread+accelerator configuration. Strong- and weak-scaling experiments on a production supercomputer show that the SPH-EXA mini-app can be efficiently executed with up 267 million particles and up to 65 billion particles in total on 2,048 hybrid CPU-GPU nodes.

Aurélien Cavelan, Rubén M. Cabezón, Michal Grabarczyk, and Florina M. Ciorba. 2020. A Smoothed Particle Hydrodynamics Mini-App for Exascale. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 11, 1–11. DOI:https://doi.org/10.1145/3394277.3401855

A Task-Based Distributed Parallel Sparsified Nested Dissection Algorithm

Leopold Cambier and Eric Darve

Sparsified nested dissection (spaND) is a fast scalable linear solver for sparse linear systems. It combines nested dissection and separator sparsification, leading to an algorithm with an O(N log N) complexity on many problems. In this work, we study the parallelization of spaND using TaskTorrent, a lightweight, distributed, task-based runtime in C++. This leads to a distributed version of spaND using a task-based runtime system. We explain how to adapt spaND's partitioning for parallel execution, how to increase concurrency using a simultaneous sparsification algorithm, and how to express the DAG using TaskTorrent. We then benchmark spaND on a few large problems. spaND exhibits good strong and weak scalings, efficiently using up to 9,000 cores when ranks grow slowly with the problem size.

Accelerating Parallel CFD Codes on Modern Vector Processors Using Blockettes

Anil Yildirim, Charles Mader, and Joaquim Martins

The performance and scalability of computational fluid dynamics (CFD) solvers are essential for many applications, including multidisciplinary design optimization. With the evolution of high-performance computing resources such as Intel's Knights Landing and Skylake architectures in the Stampede2 cluster, CFD solver performance can be improved by modifying how the core computations are performed while keeping the mathematical formulation unchanged. In this work, we introduce a cache-blocking method to improve memory-bound CFD codes that use structured grids. The overall idea is to split computational blocks into smaller, fixed-sized blockettes that are sufficiently small to completely fit into the available cache size per-processor on each architecture. We can fully take advantage of modern vector instruction sets such as AVX2 and AVX512 on these modern architectures with this approach. Using this method, we have achieved up to 3.27 times speedup in the core routines of the open-source CFD solver, ADflow.

Accurate and Efficient Jones-Worland Spectral Transforms for Planetary Applications

Philippe Marti and Andrew Jackson

Spectral transforms between physical space and spectral space are needed for fluid dynamical calculations in the whole sphere, representative of a planetary core. In order to construct a representation that is everywhere smooth, regular and differentiable, special polynomials called Jones-Worland polynomials, based on a type of Jacobi polynomials, are used for the radial expansion, coupled to spherical harmonics in angular variables. We present an exact, efficient transform that is partly based on the FFT and which remains accurate in finite precision. Application is to high-resolution solutions of the Navier-Stokes equation, possibly coupled to the heat transfer and induction equations. Expected implementations would be in simulations with P^3 degrees of freedom, where P may be greater than 10^3. Memory use remains modest at high spatial resolution, indeed typically P times lower than competing algorithms based on quadrature.

Algorithm-Hardware Co-design of a Discontinuous Galerkin Shallow-Water Model for a Dataflow Architecture on FPGA

Tobias Kenter, Adesh Shambhu, Sara Faghih-Naini, and Vadym Aizinger

We present the first FPGA implementation of the full simulation pipeline of a shallow water code based on the discontinuous Galerkin method. Using OpenCL and following an algorithm-hardware co-design approach, the software reference is transformed into a dataflow architecture that can process a full mesh element per clock cycle. The novel projection approach on the algorithmic level complements the pipeline and memory optimizations in the hardware design. With this, the FPGA kernels for different polynomial orders outperform the CPU reference by 43x -- 144x in a strong scaling benchmark scenario. A performance model can explain the measured FPGA performance of up to 717 GFLOPS accurately.

Aphros: High Performance Software for Multiphase Flows with Large Scale Bubble and Drop Clusters

Petr Karnakov, Fabian Wermelinger, Sergey Litvinov, and Petros Koumoutsakos

We present the high performance implementation of a new algorithm for simulating multiphase flows with bubbles and drops that do not coalesce. The algorithm is more efficient than the standard multi-marker volume-of-fluid method since the number of required fields does not depend on the number of bubbles. The capabilities of our methods are demonstrated on simulations of a foaming waterfall where we analyze the effects of coalescence prevention on the bubble size distribution and show how rising bubbles cluster up as foam on the water surface. Our open-source implementation enables high throughput simulations of multiphase flow, supports distributed as well as hybrid execution modes and scales efficiently on large compute systems.

Petr Karnakov, Fabian Wermelinger, Sergey Litvinov, and Petros Koumoutsakos. 2020. Aphros: High Performance Software for Multiphase Flows with Large Scale Bubble and Drop Clusters. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 12, 1–10. DOI:https://doi.org/10.1145/3394277.3401856

Automatic Generation of Efficient Linear Algebra Programs

Henrik Barthels, Christos Psarras, and Paolo Bientinesi

The level of abstraction at which application experts reason about linear algebra computations and the level of abstraction used by developers of high-performance numerical linear algebra libraries do not match. The former is conveniently captured by high-level languages and libraries such as Matlab and Eigen, while the latter expresses the kernels included in the BLAS and LAPACK libraries. Unfortunately, the translation from a high-level computation to an efficient sequence of kernels is a task, far from trivial, that requires extensive knowledge of both linear algebra and high-performance computing. Internally, almost all high-level languages and libraries use efficient kernels; however, the translation algorithms are too simplistic and thus lead to a suboptimal use of said kernels, with significant performance losses. In order to both achieve the productivity that comes with high-level languages, and make use of the efficiency of low level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem and produces as output an efficient sequence of calls to high-performance kernels. In 25 application problems, the code generated by Linnea always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10×.

Henrik Barthels, Christos Psarras, and Paolo Bientinesi. 2020. Automatic Generation of Efficient Linear Algebra Programs. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 1, 1–11. DOI:https://doi.org/10.1145/3394277.3401836

Benchmarking of state-of-the-art HPC Clusters with a Production CFD Code

Fabio Banchelli, Marta Garcia-Gasulla, Guillaume Houzeaux, and Filippo Mantovani

Computing technologies populating high-performance computing (HPC) clusters are getting more and more diverse, offering a wide range of architectural features. As a consequence, efficient programming of such platforms becomes a complex task. In this paper we provide a micro-benchmarking of three HPC clusters based on different CPU architectures, predominant in the Top500 ranking: x86, Armv8 and IBM Power9. On these platforms we study a production fluid-dynamics application leveraging different compiler technologies and micro-architectural features. We finally provide a scalability study on state-of-the-art HPC clusters. The two most relevant conclusions of our study are: i) Compiler development is critical for squeezing performance out of most recent technologies; ii) Micro-architectural features such as Single Instruction Multiple Data (SIMD) units and Simultaneous Multi-Threading (SMT) can impact the overall performance. However, a closer look shows that while SIMD is improving the performance of compute bound regions, SMT does not show a clear benefit on HPC workloads.

Fabio Banchelli, Marta Garcia-Gasulla, Guillaume Houzeaux, and Filippo Mantovani. 2020. Benchmarking of state-of-the-art HPC Clusters with a Production CFD Code. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 3, 1–11. DOI:https://doi.org/10.1145/3394277.3401847

Deploying Scientific AI Networks at Petaflop Scale on Secure Large Scale HPC Production Systems with Containers

David Brayford and Sofia Vallecorsa

There is an ever-increasing need for computational power to train complex artificial intelligence (AI) and machine learning (ML) models to tackle large scientific problems. High performance computing (HPC) resources are required to efficiently compute and scale complex models across tens of thousands of compute nodes. In this paper, we discuss the issues associated with the deployment of machine learning frameworks on large scale secure HPC systems and how we successfully deployed a standard machine learning framework on a secure large scale HPC production system, to train a complex three-dimensional convolutional GAN (3DGAN), with petaflop performance. 3DGAN is an example from the high energy physics domain, designed to simulate the energy pattern produced by showers of secondary particles inside a particle detector on various HPC systems.

David Brayford and Sofia Vallecorsa. 2020. Deploying Scientific Al Networks at Petaflop Scale on Secure Large Scale HPC Production Systems with Containers. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 6, 1–8. DOI:https://doi.org/10.1145/3394277.3401850

Ensuring Statistical Reproducibility of Ocean Model Simulations in the Age of Hybrid Computing

Salil Mahajan

Novel high performance computing systems that feature hybrid architectures require large scale code refactoring to unravel underlying exploitable parallelism. Such redesign can often be accompanied with machine-precision changes as the order of computation cannot always be maintained. For chaotic systems like climate models, these round-off level differences can grow rapidly. Systematic errors may also manifest initially as machine-precision differences. Isolating genuine round off level differences from such errors remains a challenge. Here, we apply two sample equality of distribution tests to evaluate statistical reproducibility of the ocean model component of US Department of Energy’s Energy Exascale Earth System Model (E3SM). A 2-year control simulation ensemble is compared to a modified ensemble as a test case – after a known non-bit-for-bit change in a model component is introduced – to evaluate the null hypothesis that the two ensembles are statistically indistinguishable. To quantify the false negative rates of these tests, we conduct a formal power analysis using a targeted suite of short simulation ensembles. The ensemble suite contains several perturbed ensembles, each with a progressively different climate than the baseline ensemble - obtained by perturbing the magnitude of a single model tuning parameter, the Gent and McWilliams $\kappa$, in a controlled manner. The null hypothesis is evaluated for each of perturbed ensembles using these tests. The power analysis informs on the detection limits of the tests for given ensemble size allowing model developers to evaluate the impact of an introduced non-bit-for-bit change to the model.

Evaluating the Influence of Hemorheological Parameters on Circulating Tumor Cell Trajectory and Simulation Time

Sayan Roychowdhury, John Gounley, and Amanda Randles.

Extravasation of circulating tumor cells (CTCs) occurs primarily in the microvasculature, where flow and cell interactions significantly affect the blood rheology. Capturing cell trajectory at this scale requires the coupling of several interaction models, leading to increased computational cost that scales as more cells are added or the domain size is increased. In this work, we focus on micro-scale vessels and study the influence of certain hemorheological factors, including the presence of red blood cell aggregation, hematocrit level, microvessel size, and shear rate, on the trajectory of a circulating tumor cell. We determine which of the aforementioned factors significantly affect CTC motion and identify those which can potentially be disregarded, thus reducing simulation time. We measure the effect of these elements by studying the radial CTC movement and runtime at various combinations of these hemorheological parameters. To accurately capture blood flow dynamics and single cell movement, we perform high-fidelity hemodynamic simulations at a sub-micron resolution using our in-house fluid dynamics solver, HARVEY. We find that increasing hematocrit increases the likelihood of tumor cell margination, which is exacerbated by the presence of red blood cell aggregation. As microvessel diameter increases, there is no major CTC movement towards the wall; however, including aggregation causes the CTC to marginate quicker as the vessel size increases. Finally, as the shear rate is increased, the presence of aggregation has a diminished effect on tumor cell margination.

Sayan Roychowdhury, John Gounley, and Amanda Randles. 2020. Evaluating the Influence of Hemorheological Parameters on Circulating Tumor Cell Trajectory and Simulation Time. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 4, 1–10. DOI:https://doi.org/10.1145/3394277.3401848

Eventify: Event-Based Task Parallelism for Strong Scaling

David Haensel, Laura Morgenstern, Andreas Beckmann, Ivo Kabadshow, and Holger Dachsel

Today's processors become fatter, not faster. However, the exploitation of these massively parallel compute resources remains a challenge for many traditional HPC applications regarding scalability, portability and programmability. To tackle this challenge, several parallel programming approaches such as loop parallelism and task parallelism are researched in form of languages, libraries and frameworks. Task parallelism as provided by OpenMP, HPX, StarPU, Charm++ and Kokkos is the most promising approach to overcome the challenges of ever increasing parallelism. The aforementioned parallel programming technologies enable scalability for a broad range of algorithms with coarse-grained tasks, e. g. in linear algebra and classical N-body simulation. However, they do not fully address the performance bottlenecks of algorithms with fine-grained tasks and the resultant large task graphs. Additionally, we experienced the description of large task graphs to be cumbersome with the common approach of providing in-, out- and inout-dependencies. We introduce event-based task parallelism to solve the performance and programmability issues for algorithms that exhibit fine-grained task parallelism and contain repetitive task patterns. With user-defined event lists, the approach provides a more convenient and compact way to describe large task graphs. Furthermore, we show how these event lists are processed by a task engine that reuses user-defined, algorithmic data structures. As use case, we describe the implementation of a fast multipole method for molecular dynamics with event-based task parallelism. The performance analysis reveals that the event-based implementation is 52 % faster than a classical loop-parallel implementation with OpenMP.

David Haensel, Laura Morgenstern, Andreas Beckmann, Ivo Kabadshow, and Holger Dachsel. 2020. Eventify: Event-Based Task Parallelism for Strong Scaling. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 14, 1–10. DOI:https://doi.org/10.1145/3394277.3401858

Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications

Qinglei Cao, Yu Pei, Kadir Akbudak, Aleksandr Mikhalev, George Bosilca, Hatem Ltaief, David Keyes, and Jack Dongarra

Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE-based iterative optimization procedure requires the solving of large-scale linear systems that performs a Cholesky factorization on a symmetric positive-definite covariance matrix---a demanding dense factorization in terms of memory footprint and computation. We propose a novel solution to this problem: at the mathematical level, we reduce the computational requirement by exploiting the data sparsity structure of the matrix off-diagonal tiles by means of low-rank approximations; and, at the programming-paradigm level, we integrate PaRSEC, a dynamic, task-based runtime to reach unparalleled levels of efficiency for solving extreme-scale linear algebra matrix operations. The resulting solution leverages fine-grained computations to facilitate asynchronous execution while providing a flexible data distribution to mitigate load imbalance. Performance results are reported using 3D synthetic datasets up to 42M geospatial locations on 130, 000 cores, which represent a cornerstone toward fast and accurate predictions of environmental applications.

Qinglei Cao, Yu Pei, Kadir Akbudak, Aleksandr Mikhalev, George Bosilca, Hatem Ltaief, David Keyes, and Jack Dongarra. 2020. Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 2, 1–11. DOI:https://doi.org/10.1145/3394277.3401846

Fast Scalable Implicit Solver with Convergence of Equation-Based Modeling and Data-Driven Learning: Earthquake City Simulation on Low-Order Unstructured Finite Element

Tsuyoshi Ichimura, Kohei Fujita, Kentaro Koyama, Yuma Kikuchi, Ryota Kusakabe, Kazuo Minami, Hikaru Inoue, Seiya Nishizawa, Miwako Tsuji, Tatsuo Nishiki, Muneo Hori, Lalith Maddegedara, and Naonori Ueda

We developed a new approach in converging equation-based modeling and data-driven learning on high-performance computing resources to accelerate physics-based earthquake simulations. Here, data-driven learning based on data generated while conducting equation-based modeling was used to accelerate the convergence process of an implicit low-order unstructured finite-element solver. This process involved a suitable combination of data-driven learning for estimating high-frequency components and coarsened equation-based models for estimating low-frequency components of the problem. The developed solver achieved a 12.8-fold speedup over the state-of-art solver with a 96.4% size-up scalability up to 24,576 nodes (98,304 MPI processes x 12 OpenMP threads = 1,179,648 CPU cores) of Fugaku with 126,581,788,413 degrees-of-freedom, leading to solving a huge city earthquake shaking analysis in a 10.1-fold shorter time than the previous state-of-the-art solver. Furthermore, to show that the developed method attains high performance on variety of systems with small implementation costs, we ported the developed method to recent GPU systems by use of directive based methods (OpenACC). The equation based modeling and the data-driven learning are of utterly different characteristics, and hence they are rarely combined. The developed approach of combining them is effective, and remarkable results mentioned above are achieved.

In-Situ Assessment of Device-Side Compute Work for Dynamic Load Balancing in a GPU-Accelerated PIC Code

Michael Rowan, Axel Huebl, Kevin Gott, Jack Deslippe, Maxence Thévenet, Rémi Lehe, and Jean-Luc Vay

Maintaining computational load balance is important to the performant behavior of codes which operate under a distributed computing model. This is especially true for GPU architectures, which can suffer from memory oversubscription if improperly load balanced. We present enhancements to traditional load balancing approaches and explicitly target GPU architectures, exploring the resulting performance. A key component of our enhancements is the introduction of several GPU-amenable strategies for assessing compute work. These strategies are implemented and benchmarked to find the most optimal data collection methodology for in-situ assessment of GPU compute work. For the fully kinetic particle-in-cell code WarpX, which supports MPI+CUDA parallelism, we investigate the performance of the improved dynamic load balancing via a strong scaling-based performance model and show that, for a laser-ion acceleration test problem run with up to 6144 GPUs on Summit, the enhanced dynamic load balancing achieves from 62%--74% (88% when running on 6 GPUs) of the theoretically predicted maximum speedup; for the 96-GPU case, we find that dynamic load balancing improves performance relative to baselines without load balancing (3.8x speedup) and with static load balancing (1.2x speedup). Our results provide important insights into dynamic load balancing and performance assessment, and are particularly relevant in the context of distributed memory applications ran on GPUs.

k-Dispatch: A Workflow Management System for the Automated Execution of Biomedical Ultrasound Simulations on Remote Computing Resources

Marta Jaros, Bradley E. Treeby, Panayiotis Georgiou, and Jiri Jaros

Therapeutic ultrasound is increasingly being used for applications in oncology, drug delivery, and neurostimulation. In order to adapt the treatment procedures to patient needs, complex physical models have to be evaluated prior to the treatment. These models, however, require intensive computations that can only be satisfied by cloud and HPC facilities. Unfortunately, employing these facilities and executing the required computations is not straightforward even for experienced developers.

k-Dispatch is a novel workflow management system aimed at modelling biomedical ultrasound procedures using the open-source k-Wave acoustic toolbox. It allows ultrasound procedures to be uploaded with a single click and provides a notification when the result is ready for download. Inside k-Dispatch, there is a complex workflow management system which decodes the workflow graph, optimizes the workflow execution parameters, submits jobs to remote computing facilities, monitors their progress, and logs the consumed core hours. In this paper, the architecture and deployment of k-Dispatch are discussed, including the approach used for workflow optimization. A key innovation is the use of previous performance data to automatically select the utilised hardware and execution parameters. A review of related work is also given, including workflow management systems, batch schedulers, and cluster simulators.

Marta Jaros, Bradley E. Treeby, Panayiotis Georgiou, and Jiri Jaros. 2020. K-Dispatch: A Workflow Management System for the Automated Execution of Biomedical Ultrasound Simulations on Remote Computing Resources. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 10, 1–10. DOI:https://doi.org/10.1145/3394277.3401854

Load Balancing in Large Scale Bayesian Inference

Daniel Wälchli, Sergio M. Martin, Athena Economides, Lucas Amoudruz, George Arampatzis, Xin Bian, and Petros Koumoutsakos

We present a novel strategy to improve load balancing for large scale Bayesian inference problems. Load imbalance can be particularly destructive in generation based uncertainty quantification (UQ) methods since all compute nodes in a large-scale allocation have to synchronize after every generation and therefore remain in an idle state until the longest model evaluation finishes. Our strategy relies on the concurrent scheduling of independent Bayesian inference experiments while sharing a group of worker nodes, reducing the destructive effects of workload imbalance in population-based sampling methods.

To demonstrate the efficiency of our method, we infer parameters of a red blood cell (RBC) model. We perform a data-driven calibration of the RBC's membrane viscosity by applying hierarchical Bayesian inference methods. To this end, we employ a computational model to simulate the relaxation of an initially stretched RBC towards its equilibrium state. The results of this work advance upon the current state of the art towards realistic blood flow simulations by providing inferred parameters for the RBC membrane viscosity.

We show that our strategy achieves a notable reduction in imbalance and significantly improves effective node usage on 512 nodes of the CSCS Piz Daint supercomputer. Our results show that, by enabling multiple independent sampling experiments to run concurrently on a given allocation of supercomputer nodes, our method sustains a high computational efficiency on a large-scale supercomputing setting.

Daniel Wälchli, Sergio M. Martin, Athena Economides, Lucas Amoudruz, George Arampatzis, Xin Bian, and Petros Koumoutsakos. 2020. Load Balancing in Large Scale Bayesian Inference. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 5, 1–12. DOI:https://doi.org/10.1145/3394277.3401849

Massive Scaling of MASSIF: Algorithm Development and Analysis for Simulation on GPUs

Anuva Kulkarni, Jelena Kovačević, and Franz Franchetti

Micromechanical Analysis of Stress-Strain Inhomogeneities with Fourier transforms (MASSIF) is a large-scale Fortran-based differential equation solver used to study local stresses and strains in materials. Due to its prohibitive memory requirements, it is extremely difficult to port the code to GPUs with small on-device memory. In this work, we present an algorithm design that uses domain decomposition with approximate convolution, which reduces memory footprint to make the MASSIF simulation feasible on distributed GPU systems. A first-order performance model of our method estimates that compression and multi-resolution sampling strategies can enable domain computation within GPU memory constraints for 3D grids larger than those simulated by the current state-of-the-art Fortran MPI implementation. The model analysis also provides an insight into design requirements for further scalability. Lastly, we discuss the extension of our method to irregular domain decomposition and challenges to be tackled in the future.

Anuva Kulkarni, Jelena Kovačević, and Franz Franchetti. 2020. Massive Scaling of MASSIF: Algorithm Development and Analysis for Simulation on GPUs. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 13, 1–10. DOI:https://doi.org/10.1145/3394277.3401857

Memory Reduction Using a Ring Abstraction over GPU RDMA for Distributed Quantum Monte Carlo Solver

Weile Wei, Eduardo D’Azevedo, Kevin Huck, Arghya Chatterjee, Oscar Hernandez, and Hartmut Kaiser

Scientific applications that run on leadership computing facilities often face the challenge of being unable to fit leading science cases onto accelerator devices due to memory constraints (memory-bound applications). In this work, the authors studied one such US Department of Energy mission-critical condensed matter physics application, Dynamical Cluster Approximation (DCA++), and this paper discusses how device memory-bound challenges were successfully reduced by proposing an effective “all-to-all” communication method—a ring communication algorithm. This implementation takes advantage of acceleration on GPUs and remote direct memory access for fast data exchange between GPUs. Additionally, the ring algorithm was optimized with sub-ring communicators and multi-threaded support to further reduce communication overhead and expose more concurrency, respectively. The computation and communication were also profiled by using the Autonomic Performance Environment for Exascale (APEX) profiling tool, and this paper discusses the performance trade-off for the ring algorithm implementation. The memory analysis on the ring algorithm shows that the allocation size for the authors’ most memory-intensive data structure per GPU is now reduced to 1/p of the original size, where p is the number of GPUs in the ring communicator. The communication analysis suggests that the distributed Quantum Monte Carlo execution time grows linearly as sub-ring size increases, and the cost of messages passing through the network interface connector could be a limiting factor.

Performance Evaluation of a Two-Dimensional Flood Model on Heterogeneous High-Performance Computing Architectures (Best Paper Award for PASC20)

Md Bulbul Sharif, Sheikh K. Ghafoor, Thomas M. Hines, Mario Morales-Hernändez, Katherine J. Evans, Shih-Chieh Kao, Alfred J. Kalyanapu, Tigstu T. Dullo, and Sudershan Gangrade

This paper describes the implementation of a two-dimensional hydrodynamic flood model with two different numerical schemes on heterogeneous high-performance computing architectures. Both schemes were able to solve the nonlinear hyperbolic shallow water equations using an explicit upwind first-order approach on finite differences and finite volumes, respectively, and were conducted using MPI and CUDA. Four different test cases were simulated on the Summit supercomputer at Oak Ridge National Laboratory. Both numerical schemes scaled up to 128 nodes (768 GPUs) with a maximum 98.2x speedup of over 1 GPU. The lowest run time for the 10 day Hurricane Harvey event simulation at 5 meter resolution (272 million grid cells) was 50 minutes. GPUDirect communication proved to be more convenient than the standard communication strategy. Both strong and weak scaling are shown.

Md Bulbul Sharif, Sheikh K. Ghafoor, Thomas M. Hines, Mario Morales-Hernändez, Katherine J. Evans, Shih-Chieh Kao, Alfred J. Kalyanapu, Tigstu T. Dullo, and Sudershan Gangrade. 2020. Performance Evaluation of a Two-Dimensional Flood Model on Heterogeneous High-Performance Computing Architectures. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 8, 1–9. DOI:https://doi.org/10.1145/3394277.3401852

Performance Optimization and Load-Balancing Modeling for Superparametrization by 3D LES

Gijs van den Oord, Maria Chertova, Fredrik Jansson, Inti Pelupessy, Anne Pier Siebesma, and Daan Crommelin

In order to eliminate climate uncertainty w.r.t cloud and convection parametrizations, superpramaterization (SP) has emerged as one of the possible ways forward. We have implemented (regional) superparametrization of the ECMWF weather model OpenIFS by cloud-resolving, three-dimensional large-eddy simulations. This setup contains a two-way coupling between a global meteorological model that resolves large-scale dynamics, with many local instances of the Dutch Atmospheric Large Eddy Simulation (DALES), resolving cloud and boundary layer physics. The model is currently prohibitively expensive to run over climate or even seasonal time scales, and a global SP requires the allocation of millions of cores. In this paper, we study the performance and scaling behavior of the LES models and the coupling code and present our implemented optimizations. We mimic the observed load imbalance with a simple performance model and present strategies to improve hardware utilization in order to assess the feasibility of a world-covering superparametrization. We conclude that (quasi-)dynamical load-balancing can significantly reduce the runtime for such large-scale systems with wide variability in LES time-stepping speeds.

Predictive, Reactive and Replication-based Load Balancing of Tasks in Chameleon and sam(oa)2

Philipp Samfass, Jannis Klinkenberg, Minh Thanh Chung, and Michael Bader

Increasingly complex hardware architectures as well as numerical algorithms make balancing load in parallel numerical software for adaptive mesh refinement an inherently difficult task, especially if variability of system components and unpredictability of execution time comes into play. Yet, traditional predictive load balancing strategies are largely based on cost models that aim to predict the execution time of computational tasks. To address this fundamental weakness, we present a novel reactive load balancing approach in distributed memory for MPI+OpenMP parallel applications that is based on keeping tasks speculatively replicated on multiple MPI processes. Replicated tasks are scheduled fully reactively without the need of a predictive cost model. Task cancelation mechanisms help to keep the overhead of replication minimal by avoiding redundant computation of replicated tasks. We implemented our approach in the Chameleon library for reactive load balancing building upon previous work on reactive task migration. Our experiments in the parallel dynamic adaptive mesh refinement software sam(oa)2 demonstrate performance improvements in the presence of wrong cost models and artificially introduced noise to simulate imbalances coming from hardware variability.

Progress Towards Accelerating the Unified Model on Hybrid Multi-Core Systems

Wei Zhang, Min Xu, Katherine Evans, Matthew Norman, Mario Morales-Hernandez, Salil Mahajan, Adrian Hill, James Manners, Ben Shipway, and Maynard Christopher

The cloud microphysics scheme, CASIM, and the radiation scheme, SOCRATES, are the two computationally intensive parts within the Met Office’s Unified Model (UM). This study enables CASIM and SOCRATES to use accelerated multi-core systems for optimal computational performance of the UM. Using profiling to guide our efforts, we refactored the code for optimal threading and kernel arrangement and implemented OpenACC directives manually or through the CLAW source-to-source translator. Initial porting results achieved 10.02x and 9.25x speedup in CASIM and SOCRATES respectively on 1 GPU compared with 1 CPU core. A granular performance analysis of the strategy and bottlenecks are discussed. These improvements will enable UM to run on heterogeneous computers and a path forward for further improvements is provided.

Refactoring the MPS/University of Chicago Radiative MHD (MURaM) Model for GPU/CPU Performance Portability Using OpenACC Directives

Eric Wright, Damien Przybylski, Cena Miller, Supreeth Suresh, Matthias Rempel, Shiquan Su, Richard Loft, and Sunita Chandrasekaran

The MURaM (Max Planck University of Chicago Radiative MHD) code is a solar atmosphere radiative MHD model that has been broadly applied to solar phenomena ranging from quiet to active sun, including eruptive events such as flares and coronal mass ejections. The treatment of physics is sufficiently realistic to allow for the synthesis of emission from visible light to extreme UV and X-rays, which is critical for a detailed comparison with available and future multi-wavelength observations. This component relies critically on the radiation transport solver (RTS) of MURaM; the most computationally intensive component of the code. The benefits of accelerating RTS are multiple fold: A faster RTS allows for the regular use of the more expensive multi-band radiation transport needed for comparison with observations, and this will pave the way for the acceleration of ongoing improvements in RTS that are critical for simulations of the solar chromosphere. We present challenges and strategies to accelerate a multi-physics, multi-band MURaM using a directive-based programming model, OpenACC in order to maintain a single source code across CPUs and GPUs. Results for a $288^3$ test problem show that MURaM with the optimized RTS routine achieves 1.73x speedup using a single NVIDIA V100 GPU over a fully subscribed 40-core Intel Skylake CPU node and with respect to the number of simulation points (in millions) per second, a single NVIDIA V100 GPU is equivalent to 69 Skylake cores. We also measure parallel performance on up to 96 GPUs and present weak and strong scaling results.

Scalable HPC & AI Infrastructure for COVID-19 Therapeutics

Shantenu Jha

COVID-19 has claimed more 2.4M lives and resulted in over 125M infections. There is an urgent need to identify drugs that can inhibit SARS-CoV-2. We discuss innovations in computational infrastructure and methods that are accelerating and advancing drug design. Specifically, we describe several methods that integrate artificial intelligence and simulation-based approaches, and the design of computational infrastructure to support these methods at scale. We discuss their implementation and characterize their performance, and highlight science advances that these capabilities have enabled.

Simulation of Droplet Dispersion in COVID-19 Type Pandemics on Fugaku

Rahul Bale, ChungGang Li, Masashi Yamakawa, Akiyoshi Iida, Ryoichi Kurose, and Makoto Tsubokura

Transmission of infectious respiratory diseases through airborne dispersion of viruses poses a great risk to public health. In several major diseases, one of the main modes of transmission is through respiratory droplets. Virus laden respiratory droplets and aerosols can be generated during coughing, sneezing and speaking. These droplets and aerosols can remain suspended in air and be trans- ported by airflow posing risk of infection in individuals who might come in contact with them. With this background, in this work, we present a numerical framework for simulation of dispersion of respiratory sputum droplets using implicit large-eddy simulations. A combination of discrete Lagrangian droplet model and fully com- pressible Navier-Stokes flow solver is employed in this study. The method is applied to analyze cases such as droplet dispersion during speech and cough under different environmental settings. Further- more, the performance of the numerical framework is evaluated through strong and weak scaling analysis.

Solving DWF Dirac Equation Using Multi-splitting Preconditioned Conjugate Gradient with Tensor Cores on NVIDIA GPUs

Jiqun Tu, M. A. Clark, Chulwoo Jung, and Robert Mawhinney

We show that using the multi-splitting algorithm as a preconditioner for the domain wall Dirac linear operator, arising in lattice QCD, effectively reduces the inter-node communication cost, at the expense of performing more on-node floating point and memory operations. Correctly including the boundary \textit{snake} terms, the preconditioner is implemented in the QUDA framework, where it is found that utilizing kernel fusion and the tensor cores on NVIDIA GPUs is necessary to achieve a sufficiently performant preconditioner. A reduced-dimension (reduced-$L_s$) strategy is also proposed and tested for the preconditioner. We find the method achieves lower time to solution than regular CG at high node count despite the additional local comutational requirements from the preconditioner. This method could be useful for supercomputers with more on-node flops and memory bandwidth than inter-node communication bandwidth.

Stream-AI-MD: Streaming AI-driven Adaptive Molecular Simulations for Heterogeneous Computing Platforms

Alexander Brace, Michael Salim, Vishal Subbiah, Heng Ma, Murali Emani, Anda Trifan, Austin Clyde, Corey Adams, Thomas Uram, Hyunseung Yoo, Andrew Hock, Jessica Liu, Venkatram Vishwanath, and Arvind Ramanathan

Emerging hardware tailored for artificial intelligence (AI) and machine learning (ML) methods provide novel means to couple them with traditional high performance computing (HPC) workflows involving multi-scale molecular dynamics (MD) simulations. We propose \emph{Stream-AI-MD}, a novel instance of applying deep learning methods to drive adaptive MD simulation campaigns in a \emph{streaming} manner. We leverage the ability to run ensemble MD simulations on GPU clusters, while the data from atomistic MD simulations are streamed continuously to AI/ML approaches to guide the conformational search in a biophysically meaningful manner on a wafer-scale AI accelerator. We demonstrate the efficacy of \emph{Stream-AI-MD} simulations for two scientific use-cases: (1) folding a small prototypical protein, namely $\beta\beta\alpha$-fold (BBA) FSD-EY and (2) understanding protein-protein interaction (PPI) within the SARS-CoV-2 proteome between two proteins, nsp16 and nsp10. We show that \emph{Stream-AI-MD} simulations can improve time-to-solution by $\sim$50X for BBA protein folding. Further, we also discuss performance trade-offs involved in implementing AI-coupled HPC workflows on heterogeneous computing architectures.

Urgent Supercomputing of Earthquakes: Use Case for Civil Protection

Josep de la Puente, Juan Esteban Rodriguez, Marisol Monterrubio-Velasco, Otilio Rojas, and Arnau Folch

Deadly earthquakes are events that are unpredictable, relatively rare and have a huge impact upon the lives of those who suffer their consequences. Furthermore, each earthquake has specific characteristics (location, magnitude, directivity) which, combined to local amplification and de-amplification effects, makes their outcome very singular. Empirical relations are the main methodology used to make early assessment of an earthquake's impact. Nevertheless, the lack of sufficient data registers for large events makes such approaches very imprecise. Physics-based simulators, on the other hand, are powerful tools that provide highly accurate shaking information. However, physical simulations require considerable computational resources, a detailed geological model, and accurate earthquake source information.

A better early assessment of the impact of earthquakes implies both technical and scientific challenges. We propose a novel HPC-based urgent seismic simulation workflow, hereafter referred to as Urgent Computing Integrated Services for EarthQuakes (UCIS4EQ), which can deliver, potentially, much more accurate short-time reports of the consequences of moderate to large earthquakes. UCIS4EQ is composed of four subsystems that are deployed as services and connected by means of a workflow manager. This paper describes those components and their functionality. The main objective of UCIS4EQ is to produce ground-shaking maps and other potentially useful information to civil protection agencies. The first demonstrator will be deployed in the framework of the Center of Excellence for Exascale in Solid Earth (ChEESE, cheese.coe.eu, last access: 12 Feb. 2020).

Josep de la Puente, Juan Esteban Rodriguez, Marisol Monterrubio-Velasco, Otilio Rojas, and Arnau Folch. 2020. Urgent Supercomputing of Earthquakes: Use Case for Civil Protection. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 9, 1–8. DOI:https://doi.org/10.1145/3394277.3401853

X-Composer: Enabling Cross-Environments In-Situ Workflows between HPC and Cloud

Feng Li, Dali Wang, Yan Feng, and Fengguang Song

As large-scale scientific simulations and big data analyses become more popular, it is increasingly more expensive to store huge amounts of raw simulation results to perform post-analysis. To minimize the expensive data I/O, ``in-situ'' analysis is a promising approach, where data analysis applications analyze the simulation generated data on the fly without storing it first. However, it is challenging to organize, transform, and transport data at scales between two semantically different ecosystems due to the distinct software and hardware difference. To tackle these challenges, we design and implement the X-Composer framework that bridges cross-ecosystem applications to form an ``in-situ'' scientific workflow by performing data filtering, aggregation, and format conversions. X-Composer reorganizes simulation data as continuous data streams and feeds them seamlessly into the Cloud-based stream processing services to minimize I/O overheads. For evaluation, we use X-Composer to set up and execute a cross-ecosystem workflow, which consists of a parallel Computational Fluid Dynamics simulation running on HPC, and a distributed Dynamic Mode Decomposition analysis application running on Cloud. Our experimental results show that X-Composer can seamlessly couple HPC and Big Data jobs in their own native environments, achieve good scalability, and provide high-fidelity analytics for ongoing simulations in real-time.