PASC'19

ACM OpenTOC

What is the ACM Open Table of Contents (TOC) Service?

The ACM OpenTOC is a unique service that enables Special Interest Groups to open the content from recently held conferences enabling visitors to download the definitive version of the contents from the ACM Digital Library at no charge in perpetuity starting from the conference start date. Downloads of these articles are captured in official ACM statistics, improving the accuracy of usage and impact measurements.

  • View the OpenTOC pdfs using the following supported platforms:
  • Mac with Chrome, Firefox, Safari
  • Windows laptop with Chrome, Edge, Internet Explorer v.11
  • Android phone with Chrome or Firefox
  • IPhone with Chrome, Edge, Firefox, Safari

PASC '19- Proceedings of the Platform for Advanced Scientific Computing Conference

Full Citation in the ACM Digital Library

  • Sam Hatfield
  • Matthew Chantry
  • Peter Düben
  • Tim Palmer

The next generation of weather and climate models will have an unprecedented level of resolution and model complexity, and running these models efficiently will require taking advantage of future supercomputers and heterogeneous hardware.

In this paper, we investigate the use of mixed-precision hardware that supports floating-point operations at double-, single- and half-precision. In particular, we investigate the potential use of the NVIDIA Tensor Core, a mixed-precision matrix-matrix multiplier mainly developed for use in deep learning, to accelerate the calculation of the Legendre transforms in the Integrated Forecasting System (IFS), one of the leading global weather forecast models. In the IFS, the Legendre transform is one of the most expensive model components and dominates the computational cost for simulations at a very high resolution.

We investigate the impact of mixed-precision arithmetic in IFS simulations of operational complexity through software emulation. Through a targeted but minimal use of double-precision arithmetic we are able to use either half-precision arithmetic or mixed half/single-precision arithmetic for almost all of the calculations in the Legendre transform without affecting forecast skill.

  • Alp Dener
  • Adam Denchfield
  • Todd Munson

Nonlinear conjugate gradient (NCG) methods can generate search directions using only first-order information and a few dot products, making them attractive algorithms for solving large-scale optimization problems. However, even the most modern NCG methods can require large numbers of iterations and, therefore, many function evaluations to converge to a solution. This poses a challenge for simulation-constrained problems where the function evaluation entails expensive partial or ordinary differential equation solutions. Preconditioning can accelerate convergence and help compute a solution in fewer function evaluations. However, general-purpose preconditioners for nonlinear problems are challenging to construct. In this paper, we review a selection of classical and modern NCG methods, introduce their preconditioned variants, and propose a preconditioner based on the diagonalization of the BFGS formula. As with the NCG methods, this preconditioner utilizes only first-order information and requires only a small number of dot products. Our numerical experiments using CUTEst problems indicate that the proposed preconditioner successfully reduces the number of function evaluations at negligible additional cost for its update and application.

  • Wouter Edeling
  • Daan Crommelin

Coarse graining of (geophysical) flow problems is a necessity brought upon us by the wide range of spatial and temporal scales present in these problems, which cannot be all represented on a numerical grid without an inordinate amount of computational resources. Traditionally, the effect of the unresolved eddies is approximated by deterministic closure models, i.e. so-called parameterizations. The effect of the unresolved eddy field enters the resolved-scale equations as a forcing term, denoted as the 'eddy forcing'. Instead of creating a deterministic parameterization, our goal is to infer a stochastic, data-driven surrogate model for the eddy forcing from a (limited) set of reference data, with the goal of accurately capturing the long-term flow statistics. Our surrogate modelling approach essentially builds on a resampling strategy, where we create a probability density function of the reference data that is conditional on (time-lagged) resolved-scale variables. The choice of resolved-scale variables, as well as the employed time lag, is essential to the performance of the surrogate. We will demonstrate the effect of different modelling choices on a simplified ocean model of two-dimensional turbulence in a doubly periodic square domain.

  • Shashank Jaiswal
  • Jingwei Hu
  • Julien K. Brillon
  • Alina A. Alexeenko

When the molecules of a gaseous system are far apart, say in microscale gas flows where the surface to volume ratio is high and hence the surface forces dominant, the molecule-surface interactions lead to the formation of a local thermodynamically non-equilibrium region extending few mean free paths from the surface. The dynamics of such systems is accurately described by Boltzmann equation. However, the multi-dimensional nature of Boltzmann equation presents a huge computational challenge. With the recent mathematical developments and the advent of petascale, the dynamics of full Boltzmann equation is now tractable. We present an implementation of the recently introduced multi-species discontinuous Galerkin fast spectral (DGFS) method for solving full Boltzmann on streaming multi-processors. The present implementation solves the inhomogeneous Boltzmann equation in span of few minutes, making it at least two order-of-magnitude faster than the present state-of-art stochastic method---direct simulation Monte Carlo---widely used for solving Boltzmann equation. Various performance metrics, such as weak/strong scaling have been presented. A parallel efficiency of 0.96--0.99 is demonstrated on 36 Nvidia Tesla-P100 GPUs.

  • Michael Riesch
  • Nikola Tchipev
  • Hans-Joachim Bungartz
  • Christian Jirauschek

Over the last decades, quantum cascade lasers (QCLs) have become established sources of mid-infrared and terahertz light. For their anticipated applications, e.g., in spectroscopy, their dynamical behavior is particularly interesting. Numerical simulations constitute an essential tool for investigating the QCL dynamics but exhibit considerable computational workload. In order to accelerate the simulations and thereby aid the design process of QCLs, we present efficient parallel implementations of an established numerical method using OpenMP. Performance measurements on a 28-core CPU confirm their efficiency.

  • Yoshiyuki Sakai
  • Sandra Mendez
  • Håkon Strandenes
  • Martin Ohlerich
  • Igor Pasichnyk
  • Momme Allalen
  • Michael Manhart

This paper presents the optimisation techniques implemented to run across four HPC platforms the finite-volume computational fluid dynamics (CFD) code MGLET (Multi Grid Large Eddy Turbulence). We analysed and applied refactoring to the parallel communication routines, and reduced the memory footprint significantly, resulting in a substantial improvement of the parallel-scaling capability and in an increase of the maximum number of degrees of freedom for applications. Data structures and files layout were redesigned for implementing parallel I/O in HDF5. The new parallel I/O strategy results in a considerable increase in the average data transfer rate compared with the former serial implementation. An I/O pattern analysis and detailed I/O profiling of the new implementation were then conducted and further performance improvement was achieved by increasing the size of I/O requests and reducing the number of I/O processes. We compare the improved parallel-scaling capability of MGLET on different architectures using representative CFD application test cases.

  • Paul F. Baumeister
  • Shigeru Tsukamoto

Large scale electronic structure calculations require modern high performance computing (HPC) resources and, as important, mature HPC applications that can make efficient use of those. Real-space grid-based applications of Density Functional Theory (DFT) using the Projector Augmented Wave method (PAW) can give the same accuracy as DFT codes relying on a plane wave basis set but exhibit an improved scalability on distributed memory machines. The projection operations of the PAW Hamiltonian are known to be the performance critical part due to their limitation by the available memory bandwidth. We investigate on the utility of a 3D factorizable basis of Hermite functions for the localized PAW projector functions which allows to reduce the bandwidth requirements for the grid representation of the projector functions in projection operations. Additional on-the-fly sampling of the 1D basis functions eliminates the memory transfer almost entirely. For an quantitative assessment of the expected memory bandwidth savings we show performance results of a first implementation on GPUs. Finally, we suggest a PAW generation scheme adjusted to the analytically given projector functions.

  • Uri Nahum
  • Carlo Seppi
  • Philippe C. Cattin

Background: Ever since endoscopes were invented, surgeons try to widen their field of usage by developing novel surgical approaches. An obvious advantage of the endoscope is the minimal invasiveness, but this comes at a price of reduced dexterity, loss of tactile feedback and difficulty in orientation. One of the challenges is to acquire the data of the neighborhood to find the relative position of the endoscope to the surrounding tissues.

Methods & Results: In this paper, we present a mathematical approach to reconstruct unknown source(s) position(s) (e.g. endoscope, which produces a signal in different frequencies) and a medium (e.g. tissue surrounding the endoscope). We solve the joint Inverse Medium and Optimal Control on the Helmholtz equation, where both source(s) and medium velocity are unknown. The use of the Adaptive Eigenspace Inversion (AEI) in combination with frequency stepping, proofs itself to be a good solution. We underline our claim, with two-dimensional numerical experiments.

Conclusion: The application of this method together with its promising results can potentially aid to navigate an endoscope through the body while collecting information on the surrounding tissue. These results may also find their application in geophysics.

  • Hartwig Anzt
  • Yen-Chen Chen
  • Terry Cojean
  • Jack Dongarra
  • Goran Flegar
  • Pratik Nayak
  • Enrique S. Quintana-Ortí
  • Yuhsiang M. Tsai
  • Weichung Wang

We present an automated performance evaluation framework that enables an automated workflow for testing and performance evaluation of software libraries. Integrating this component into an ecosystem enables sustainable software development, as a community effort, via a web application for interactively evaluating the performance of individual software components. The performance evaluation tool is based exclusively on web technologies, which removes the burden of downloading performance data or installing additional software. We employ this framework for the Ginkgo software ecosystem, but the framework can be used with essentially any software project, including the comparison between different software libraries. The Continuous Integration (CI) framework of Ginkgo is also extended to automatically run a benchmark suite on predetermined HPC systems, store the state of the machine and the environment along with the compiled binaries, and collect results in a publicly accessible performance data repository based on Git. The Ginkgo performance explorer (GPE) can be used to retrieve the performance data from the repository, and visualizes it in a web browser. GPE also implements an interface that allows users to write scripts, archived in a Git repository, to extract particular data, compute particular metrics, and visualize them in many different formats (as specified by the script). The combination of these approaches creates a workflow which enables performance reproducibility and software sustainability of scientific software. In this paper, we present example scripts that extract and visualize performance data for Ginkgo's SpMV kernels that allow users to identify the optimal kernel for specific problem characteristics.

  • Federico Pittino
  • Pietro Bonfà
  • Andrea Bartolini
  • Fabio Affinito
  • Luca Benini
  • Carlo Cavazzoni

Predicting the time to solution for massively parallel scientific codes is a complex task. The reason for this is the presence of multiple, strongly interconnected algorithms that possibly react differently to the changes in compute power, vectorization length, memory and network bandwidth and latency and I/O throughput. A reliable prediction of execution time is however of great importance to the user who wants to plan on large scale simulations or virtual screening procedures characteristic of high throughput computing. In this article we present a practical approach based on machine learning techniques to achieve very accurate predictions of the time to solution for a DFT-based material science code. We compare our results with the predictions provided by a parametrized analytical performance model showing that deep learning solutions allow for a greater accuracy without the need of domain knowledge to introduce an explicit description of the algorithms implemented in the code.

  • Sandra Macià
  • Pedro J. Martínez-Ferrer
  • Sergi Mateo
  • Vicenç Beltran
  • Eduard Ayguadé

As we move towards exascale computing, an abstraction for effective parallel computation is increasingly needed to overcome the maintainability and portability of scientific applications while ensuring the efficient and full exploitation of high-performance systems. These circumstances require computer and domain scientists to work jointly toward a productive working environment. Domain specific languages address this challenge by abstracting the high-level application layer from the final, complex parallel low-level code. Saiph is an innovative domain specific language designed to reduce the work of computational fluid dynamics domain experts to an unambiguous and straightforward transcription of their problem equations. The high-level language, domain-specific compiler and underlying library are enhanced to make applications developed by scientists intuitive. Additions and improvements are presented, designed for the significant advantage of running computational fluid dynamics applications on different machines with no porting or maintenance issues. Numerical methods and parallel strategies are independently added at the library level covering the explicit finite differences resolution of a vast range of problems. Depending on the application, a specific parallel resolution is automatically derived and applied within Saiph, freeing the user from decisions related to numerical methods or parallel executions while ensuring suitable computations. Through a list of benchmarks, we demonstrate the utility and productivity of the Saiph high-level language together with the correctness and performance of the underlying parallel numerical algorithms.

  • Adrian Jackson
  • Andrew Turner
  • Michèle Weiland
  • Nick Johnson
  • Olly Perks
  • Mark Parsons

In recent years, Arm-based processors have arrived on the HPC scene, offering an alternative the existing status quo, which was largely dominated by x86 processors. In this paper, we evaluate the Arm ecosystem, both the hardware offering and the software stack that is available to users, by benchmarking a production HPC platform that uses Marvell's ThunderX2 processors. We investigate the performance of complex scientific applications across multiple nodes, and we also assess the maturity of the software stack and the ease of use from a users' perspective. This papers finds that the performance across our benchmarking applications is generally as good as, or better, than that of well-established platforms, and we can conclude from our experience that there are no major hurdles that might hinder wider adoption of this ecosystem within the HPC community.

  • Felix Thaler
  • Stefan Moosbrugger
  • Carlos Osuna
  • Mauro Bianco
  • Hannes Vogt
  • Anton Afanasyev
  • Lukas Mosimann
  • Oliver Fuhrer
  • Thomas C. Schulthess
  • Torsten Hoefler

Weather and climate simulations are a major application driver in high-performance computing (HPC). With the end of Dennard scaling and Moore's law, the HPC industry increasingly employs specialized computation accelerators to increase computational throughput. Manycore architectures, such as Intel's Knights Landing (KNL), are a representative example of future processing devices. However, software has to be modified to use these devices efficiently. In this work, we demonstrate how an existing domain-specific language that has been designed for CPUs and GPUs can be extended to Manycore architectures such as KNL. We achieve comparable performance to the NVIDIA Tesla P100 GPU architecture on hand-tuned representative stencils of the dynamical core of the COSMO weather model and its radiation code. Further, we present performance within a factor of two of the P100 of the full DSL-based GPU-optimized COSMO dycore code. We find that optimizing code to full performance on modern manycore architectures requires similar effort and hardware knowledge as for GPUs. Further, we show limitations of the present approaches, and outline our lessons learned and possible principles for design of future DSLs for accelerators in the weather and climate domain.

  • Salil Mahajan
  • Katherine J. Evans
  • Joseph H. Kennedy
  • Min Xu
  • Matthew R. Norman

Effective utilization of novel hybrid architectures of pre-exascale and exascale machines requires transformations to global climate modeling systems that may not reproduce the original model solution bit-for-bit. Round-off level differences grow rapidly in these non-linear and chaotic systems. This makes it difficult to isolate bugs/errors from innocuous growth expected from round-off level differences. Here, we apply two modern multivariate two sample equality of distribution tests to evaluate statistical reproducibility of global climate model simulations using standard monthly output of short (~ 1-year) simulation ensembles of a control model and a modified model of US Department of Energy's Energy Exascale Earth System Model (E3SM). Both the tests are able to identify changes induced by modifications to some model tuning parameters. We also conduct formal power analyses of the tests by applying them on designed suites of short simulation ensembles each with an increasingly different climate from the control ensemble. The results are compared against those from another such test. These power analyses provide a framework to quantify the degree of differences that can be detected confidently by the tests for a given ensemble size (sample size). This will allow model developers using the tests to make an informed decision when accepting/rejecting an unintentional non-bit-for-bit change to the model solution.

  • Georgios Arampatzis
  • Daniel Wälchli
  • Pascal Weber
  • Henri Rästas
  • Petros Koumoutsakos

We present the algorithm CCMA-ES, an extension to CMA-ES, an evolution strategy that has shown to perform well in a broad range of black-box optimization problems. The (μ, λ)-CMA-ES effectively handles nonlinear nonconvex functions but faces difficulties in constrained optimization problems. We introduce viability boundaries to improve the search for an initial point in the valid domain and adapt the covariance matrix using normal approximations to maintain the inequality constraints. Using benchmark problems from 2006 CEC we compare the performance of CCMA-ES with a state of the art optimization algorithm (mViE) showing favorable results. Finally, CCMA-ES is applied to a pharmacodynamics problem describing tumor growth, and we demonstrate that CCMA-ES outperforms mViE in terms of the objective function value and total function evaluations.

  • Simon D. Smart
  • Tiago Quintino
  • Baudouin Raoult

Numerical Weather Prediction (NWP) and Climate simulations sit at the intersection between classically understood High Performance Computing (HPC) and Big Data / High Performance Data Analytics (HPDA). Driven by ever more ambitious scientific goals, both the size and number of output data-elements generated as part of NWP operations have grown by several orders of magnitude, and are expected to continue growing exponentially in the future. Over the last 30 years this increase has been approximately 40% per year. To cope with this projected growth, ECMWF has been actively exploring novel hardware and software approaches to workflow and data management. ECMWF's meteorological object store acts as a hot-cache for meteorological objects within the forecast pipeline, and supports multiple backends to enable the use of different storage technologies. This paper presents extensions to this object store to allow it to operate in a distributed fashion on a wider range of hardware without assuming the presence of high-performance, parallel, globally namespaced storage systems. The improvements include a flexible, configurable front end which gives control of where data is to be stored without requiring code changes in the calling application.

  • Petr Karnakov
  • Fabian Wermelinger
  • Michail Chatzimanolakis
  • Sergey Litvinov
  • Petros Koumoutsakos

We present a high performance computing framework for multiphase, turbulent flows on structured grids. The computational methods are validated on a number of benchmark problems such as the Taylor-Green vortex that are extended by the inclusion of bubbles in the flow field. We examine the effect of bubbles on the turbulent kinetic energy dissipation rate and provide extensive data for bubble trajectories and velocities that may assist the development of engineering models. The implementation of the present solver on massively parallel, GPU enhanced architectures allows for large scale and high throughput simulations of multiphase flows.