# Center for Computing Research (CCR)

# Center for Computing Research

### Projects

Search in Projects: |
|||

The CCR enterprise is closely tied to the laboratories' broader set of missions and strategies. We share responsibility within Sandia as stewards of important capabilities for the nation in high-strain-rate physics, scientific visualization, optimization, uncertainty quantification, scalable solvers, inverse methods, and computational materials. We also leverage our core technologies to execute projects through various partnerships, such as Cooperative Research and Development Agreements (CRADAs) and Strategic Partnerships, as well as partnerships with universities.

### Aeras: A Next Generation Global Atmosphere Model

Aeras (Greek for "air") is a next generation global atmosphere model. It is built on top of Albany, a code developed to enable rapid prototyping of scientific application codes, that leverages all of Sandia's advanced scientific computing infrastructure. Codes built on Albany are, from inception, parallel, either explicit or implicit (with implicit solves supported by automatically generated Jacobians), capable of performing sensitivity analysis and uncertainty quantification, with a wide range of software quality controls from unit and regression testing to an advanced build system and version control. Albany was recently upgraded to become performance portable: the abilty to run efficiently on current and future advanced computing platforms, all using a single code base.

Two of these features -- uncertainty quantification and performance portability -- are what make Aeras (along with the FELIX land ice model) truly next generation, and we hope a template for future climate model components, such as ocean and sea ice models. In fact, Aeras currently serves as the template for the current ACME atmosphere model refactor project.

Aeras numerics are similar to the Community Atmosphere Model - Spectral Elements (CAM-SE), using spectral elements on a cubed sphere for horizontal discretization, a hybrid vertical coordinate, and hyperviscosity for stabilization. Albany provides Aeras with a diverse suite of time-stepping options, from explicit leap frog and Runge-Kutta to fully implicit. We are also implemented a simplified Kessler cloud physics package, as a demonstration of coupling to a physics parameterization.

**Associated Software:** E3SM Kokkos Trilinos

**Contact:** Spotz, William (Bill), wfspotz@sandia.gov

SAND2014-16640 W

### Albany

Albany is an implicit, unstructured grid, finite element code for the solution and analysis of partial differential equations. Albany is the main demonstration application of the AgileComponents software development strategy at Sandia. It is a PDE code that strives to be built almost entirely from functionality contained within reusable libraries (such as Trilinos/STK/Dakota/PUMI). Albany plays a large role in demonstrating and maturing functionality of new libraries, and also in the interfaces and interoperability between these libraries. It also serves to expose gaps in our coverage of capabilities and interface design.

In addition to the component-based code design strategy, Albany also attempts to showcase the concept of Analysis Beyond Simulation, where an application code is developed up from for a design and analysis mission. All Albany applications are born with the ability to perform sensitivity analysis, stability analysis, optimization, and uncertainty quantification, with a clean interfaces for exposing design parameters for manipulation by analysis algorithms.

Albany also attempts to be a model for software engineering tools and processes, so that new research codes can adopt the Albany infrastructure as a good starting point. This effort involves a close collaboration with the 1400 SEMS team.

The Albany code base is host to several application projects, notably:

- LCM (Laboratory for Computational Mechanics) [PI J. Ostien]: A platform for research in finite deformation mechanics, including algorithms for failure and fracture modeling, discretizations, implicit solution algorithms, advanced material models, coupling to particle-based methods, and efficient implementations on new architectures.
- QCAD (Quantum Computer Aided Design) [PI Nielsen]: A code to aid in the design of quantum dots from built in doped silicon devices. QCAD solves the coupled Schoedinger-Poisson system. When wrapped in Dakota, optimal operating conditions can be found.
- FELIX (Finite Element for Land Ice eXperiments) [PI Salinger]: This application solves variants of a nonlinear Stokes flow for simulating the evolution of Ice Sheets. In particular we are modeling the Greenland and Antarctic Ice Sheets for modeling effects of Climate change, in particularly their influence on Sea-Level Rise. Will be linked into ACME.
- Aeras [PI Spotz]: A component-based approach to atmospheric modeling, where advanced analysis algorithms and design for efficient code on new architectures are built into the code.

In addition, Albany is used as a platform for algorithmic research:

- FASTMath SciDAC project: We are developing a capability for adaptive mesh refinement within an unstructured grid application, in collaboration with Mark Shephard's group at the SCOREC center at RPI.
- Embedded UQ: Research into embedded UQ algorithms led by Eric Phipps often uses Albany as a demonstration platform.
- Performance Portable Kernels for new architectures: Albany is serving as a research vehicle for programming finite element assembly kernels using the Trilinos/Kokkos programming model and library.

**Associated Software:** Dakota E3SM Kokkos Omega_h Peridigm Trilinos

**Contact:** Hansen, Glen, gahanse@sandia.gov

2014-15772W

### ALEGRA

The ALEGRA application is targeted at the simulation of high strain rate, magnetohydrodynamic, electromechanic and high energy density physics phenomena for the U.S. defense and energy programs. Research and development in advanced methods, including code frameworks, large scale inline meshing, multiscale lagrangian hydrodynamics, resistive magnetohydrodynamic methods, material interface reconstruction, and code verification and validation, keeps the software on the cutting edge of high performance computing.

**
**

**Associated Software:** Dakota

**Contact:** Hansen, Glen, gahanse@sandia.gov

2014-15771W

### CASL

The Consortium for Advanced Simulation of Light Water Reactors (CASL) was established by the U.S. Department of Energy as the nation’s Energy Innovation Hub for nuclear energy advanced modeling and simulation. This hub, led by Oak Ridge National Laboratory and funded in 2010 at $122M over five years, is developing a “virtual” nuclear reactor that will address three critical areas of performance for nuclear power plants: (1) reducing costs for nuclear-generated electricity by enabling power up-rates and lifetime extensions for existing and future plants, (2) enabling higher fuel burn-up to reduce the volume of nuclear waste, and (3) enhancing nuclear safety. In 2015, DOE announced the renewal of CASL for an additional five years, which will focus on extending the modeling and simulation tools built during its first phase to include additional nuclear reactor designs, including small modular reactors.

Sandia is one of ten partner institutions in CASL whose scientists and engineers conduct research activities with a very high degree of technical interaction and collaboration using an advanced “telepresence” videoconferencing system and monthly physical “collocation weeks” at ORNL. Sandia scientists played leading technical roles in the development of VERA, the Virtual Environment for Reactor Applications, which will address the challenging operational and safety problems mentioned above. Sandia contributed key computational technologies, including advanced simulation codes, multiphysics coupling methods, solver libraries, and tools for verification and uncertainty quantification, that comprise the foundation of VERA. Additionally, Sandia led efforts to develop and integrate these capabilities with other codes in VERA, which was recently released by CASL for use by its nuclear industry partners. VERA was announced as a winner of an R&D 100 Award in November 2016. It is expected that CASL and support for VERA will be merged with DOE's Nuclear Energy Advanced Modeling and Simulation program in FY2019.

**Associated Software:** Dakota

**Contact:** Summers, Randall M. (Randy), rmsumme@sandia.gov

SAND2014-15428W

### Combinatorial Scientific Computing

For much of my career, I have been drawn to problems in scientific computing with strong combinatorial undertones. Examples include sparse matrix orderings, load balancing, mesh generation and many others. In any standard taxonomy of scientific computing, these disciplines would be widely scattered, but researchers working in them share a common vocabulary, toolset and aesthetic. By the late nineties, several of us recognized the value of building a community to allow these researchers to interact and to bring visibility to the importance of discrete algorithms in scientific computing. The result was the formation of Combinatorial Scientific Computing community. CSC is concerned with the formulation, application and analysis of discrete methods in scientific applications.

A number of SIAM minisymposia were followed by a series successful international workshops and a special issue of ETNA. The next CSC workshop will be held in conjunction with the SIAM Conference on Optimization in 2011. In 2006, DOE funded a SciDAC institute on Combinatorial Scientific Computing and Petascale Simulations, or CSCAPES (pronounced "sea-scapes"). The following paper and talk are an attempt to convey my view of the essence of CSC.

**Papers:**

- Combinatorial Scientific Computing: The Enabling Power of Discrete Algorithms in Computational Science, Bruce Hendrickson and Alex Pothen. In Lecture Notes in Computer Science 4395:260-280, 2007.

- Combinatorial Scientific Computing: The Role of Discrete Algorithms in Computational Science and Engineering, Bruce Hendrickson. Plenary talk at SIAM CSE'03.

Powerpoint version of overheads

**Contact:** Hendrickson, Bruce A., bahendr@sandia.gov

SAND2015-2719 S

### Data Intensive Science in the Department of Energy

In this white paper we provide a perspective on the opportunities and needs for data intensive science within the Department of Energy. In particular, we focus on two areas in which DOE’s landscape is different from those of other organizations. First, DOE is a leader in the use of high performance computing for modeling and simulation, and these computations generate huge amounts of data to manage and analyze. Second, DOE maintains leading experimental facilities and these also produce prodigious quantities of data. In both of these areas, DOE’s needs for data intensive science are distinct from those of other agencies, and we believe that these needs will likely not be adequately addressed without DOE investments.

**Contact:** Hendrickson, Bruce A., bahendr@sandia.gov

SAND2015-2362 J

### Decaf - High-Performance Decoupling of Tightly Coupled Data Flows

Exponential increases in data size and the need to distill enormous amounts of information into usable knowledge are pushing the limits of data processing in science applications. Connecting simulations with various data analyses either through storage (loose coupling) or directly using local or remote memory (tight coupling) is the way that scientists process data, but neither approach is optimal for extreme-scale science.

We are loosening the grip of tight coupling, in essence decoupling tightly coupled data flows while keeping their favorable high performance and low power characteristics. Our use of optional short- and long- term storage in the dataflow gives the best of both worlds: tight coupling whenever possible but loose coupling when usage patterns require persistent data. Our research, called Decaf, targets in situ methods and workflows, and we are evaluating our research in full and proxy applications in order to improve performance, reduce power, add fault tolerance, and enhance usability.

We are generating (1) a library of dataflow primitives, (2) a method for automatically constructing broadly applicable dataflows from the same set of primitives, designed as (3) a generic and reusable solution that other workflow and coupling tools can use.

The SmartBlock library is the outcome from this project.

**Associated Software:** SmartBlock reusable workflow components

**Contact:** Lofstead, Gerald Fredrick (Jay), gflofst@sandia.gov

2015-8888

### E3SM - Energy Exascale Earth System Model

E3SM is an unprecedented collaboration among seven National Laboratories, the National Center for Atmospheric Research, four academic institutions and one private-sector company to develop and apply the most complete leading-edge climate and Earth system models to the most challenging and demanding climate change research imperatives. It is the only major national modeling project designed to address U.S. Department of Energy (DOE) mission needs and specifically targets DOE Leadership Computing Facility resources now and in the future, because DOE researchers do not have access to other major climate computing centers. A major motivation for the E3SM project is the coming paradigm shift in computing architectures and their related programming models as capability moves into the Exascale era. DOE, through its science programs and early adoption of new computing architectures, traditionally leads many scientific communities, including climate and Earth system simulation, through these disruptive changes in computing.

**Associated Software:** E3SM Kokkos

**Contact:** Taylor, Mark A., mataylo@sandia.gov

2014-15455W

### EQUINOX -- Environment for Quantifying Uncertainty: Integrated aNd Optimized at the eXtreme scale

The goal of the EQUINOX project is to establish a modern mathematical and statistical foundation that will enable next-generation, complex, stochastic predictive simulations. Such a foundation is critical to realizing the potential of future computing platforms, including exascale, and will ultimately enable scientists to address a fundamental question, namely "how do the uncertainties ubiquitous in all modeling efforts affect our predictions and understanding of complex phenomena?" Our collaborative approach to uncertainty quantification (UQ) combines novel paradigms in applied mathematics, statistics and computational science into a unified framework, which we call an Environment for Quantifying Uncertainty: Integrated aNd Optimized at the eXtreme-scale (EQUINOX).

**Contact:** Phipps, Eric T., etphipp@sandia.gov

SAND2014-16283W

### FASTMath

The FASTMath SciDAC Institute develops and deploys scalable mathematical algorithms and software tools for reliable simulation of complex physical phenomena and collaborates with application scientists to ensure the usefulness and applicability of FASTMath technologies.

FASTMath is a collaboration between Argonne National Laboratory, Lawrence Berkeley National Laboratory, Lawrence Livermore National Laboratory, Massachusetts Institute of Technology, Rensselaer Polytechnic Institute, Sandia National Laboratories, Southern Methodist University, and University of Southern California. Dr. Lori Diachin, LLNL, leads the FASTMath project.

**Associated Software:** Dakota E3SM Mesquite Omega_h Trilinos Zoltan

**Contact:** Devine, Karen D., kddevin@sandia.gov

2014-3861W

### Graph Algorithms on Non-Traditional Architectures

Graph algorithms and applications have been central to my research for many years. But until relatively recently, I paid no attention to computer architecture. Many of the graph problems that arise in scientific computing have special structure associated with the underlying 2- or 3-dimensional world the science models. These graphs often have good separators due to geometric locality, and bounded degrees. But as my interests have broadened to include informatics and data mining, I am increasingly working with graphs that that have very little exploitable structure. Algorithms on these highly unstructured graphs are quite challenging to parallelize on traditional distributed memory machines. They typically lack good partitions, thereby requiring lots of communication and also very large data structures for ghost nodes. They may have very high degree nodes which lead to load balancing challenges, and their lack of locality leads to terrible performance at all levels of the memory hierarchy. Instead, I have become a fan of massive multithreading as a means to tolerate latency and thereby achieve high performance even on these challenging applications. These issues are addressed in the following papers.

**Papers**:

- Challenges in Parallel Graph Processing, Andrew Lumsdaine, Douglas Gregor, Jonathan Berry and Bruce Hendrickson. In Parallel Processing Letters 17(1):5-20,2007.

- Software and Algorithms for Graph Queries on Multithreaded Architectures Jonathan Berry, Bruce Hendrickson, Simon Kahan and Petr Konecny. In Proc. IEEE Workshop on Multithreaded Architectures and Applications, 2007.

- Analyzing the Scalability of Graph Algorithms on Eldorado Keith Underwood, Megan Vance, Jonathan Berry and Bruce Hendrickson. In Proc. IEEE Workshop on Multithreaded Architectures and Applications, 2007

- Realizing Parallelism in Database Operations: Insights from a Massively Multithreaded Architecture, John Cieslewicz, Jonathan Berry, Bruce Hendrickson and Kenneth Ross. In Proc. 2nd International Workshop on Data Management on New Hardware (DaMoN'06) (Recipient of best paper award).

**Contact:** Hendrickson, Bruce A., bahendr@sandia.gov

SAND2015-2642 S

### Graph Partitioning and Load Balancing

Graph partitioning is an NP hard problem with numerous applications. It appears in various forms in parallel computing, sparse matrix reordering, circuit placement and other important disciplines. I worked with Rob Leland for several years on heuristic methods for partitioning graphs, with a particular focus on parallel computing applications. Our contributions include:

- Development of multilevel graph partitioning. This widely imitated approach has become the premiere algorithm combining very high quality with short calculation times.
- Extension of spectral partitioning to enable the use of 2 or 3 Laplacian eigenvectors to quadrisect of octasect a graph.
- Generalization of the Kernighan-Lin/Fiduccia-Mattheyses algorithm to handle weighted graphs, arbitrary number of sets and lazy initiation.
- Development of terminal propagation to improve the mapping of a graph onto a target parallel architecture.
- The widely used Chaco partitioning tool which includes multilevel, spectral, geometric and other algorithms.

More recently, I have worked with Tammy Kolda on alternatives to the standard graph partitioning model. A critique of the standard approach and a discussion of alternatives can be found below. I have also been working with Karen Devine and a number of other researchers on dynamic load balancing. Dynamic load balancing is fundamentally harder than static partitioning. Algorithms and implementations must run quickly in parallel without consuming much memory. Interfaces need to be subroutine calls instead of file transfers. And since the data is already distributed, the new partition should be similar to the current one to minimize the amount of data that needs to be redistributed. We are working to build a tool called Zoltan to address this plethora of challenges. Zoltan is designed to be extensible to different applications, algorithms and parallel architectures. For example, it is data-structure neutral - an object oriented interface separates the application data structures from those of Zoltan. The tool will also support a very general model of a parallel computer, facilitating the use of heterogeneous machines.

Closely related to Zoltan, I have been interested in the software challenges inherent in parallelizing unstructured computations. I feel that well-designed tools and libraries are the key to addressing this problem. I have also worked with Ali Pinar and Steve Plimpton on algorithms for some of the kernel operations which arise in this setting, and relevant papers can be found here.

Graph Partitioning Algorithms and Tools

- The Chaco User's Guide: Version 2.0, Bruce Hendrickson and Robert Leland. Sandia Tech Report SAND94--2692, 1994.

- Dynamic Load Balancing in Computational Mechanics, Bruce Hendrickson and Karen Devine. Comp. Meth. Applied Mechanics & Engineering. 184(2-4):485-500, 2000.

- Partitioning for Complex Objectives, Ali Pinar and Bruce Hendrickson. In Proc. Irregular '01.

- An Improved Spectral Graph Partitioning Algorithm for Mapping Parallel Computations, Bruce Hendrickson and Robert Leland. SIAM J. Sci. Stat. Comput., 16(2):452-469, 1995.

- Multidimensional Spectral Load Balancing, Bruce Hendrickson and Robert Leland. In Proc. 6th SIAM Conf. Parallel Proc., 953-961, 1993.

- Skewed Graph Partitioning, Bruce Hendrickson, Robert Leland and Rafael Van Driessche. In Proc. Eighth SIAM Conf. Parallel Processing for Scientific Computing.

- Enhancing Data Locality by Using Terminal Propagation, Bruce Hendrickson, Robert Leland and Rafael Van Driessche. In Proc. 29th Hawaii Intl. Conf. System Science, 1996.

- Parallel Algorithms for Dynamically Partitioning Unstructured Grids, Pedro Diniz, Steve Plimpton, Bruce Hendrickson and Robert Leland. In Proc. 7th SIAM Conf. Parallel Proc., 1995.

Alternatives to Standard Graph Partitioning More recently, I have become disenchanted with the graph partitioning abstraction as it is applied to parallel computing. Some of my concerns and an effort to address them can be found in the following papers and talks.

- Load Balancing Fictions, Falsehoods and Fallacies, Bruce Hendrickson. Appl. Math. Modelling. 25:99-108, 2000. powerpoint, and HTML version of overheads from Plenary talk at the 3rd DRAMA Steering Workshop, September, 1999.

- Graph Partitioning and Parallel Solvers: Has the Emperor No Clothes?, Bruce Hendrickson. Lecture Notes in Computer Science, 1457, pp. 218-225, 1998. Copyright Springer-Verlag.

- Graph Partitioning Models for Parallel Computing, Bruce Hendrickson and Tamara G. Kolda. Parallel Computing. 26:1519-1534, 2000.

- Algorithms for Unstructured Applications . Many unstructured computations can be built from simpler, common components. Methods for doing so, along with efficient algorithms for some of these kernels can be found in the following papers.

· Tinkertoy Parallel Programming: A Case Study with Zoltan Karen Devine and Bruce Hendrickson. International Journal of Computational Science and Engineering 1:64-72, 2005.

- Tinkertoy Parallel Programming: Complicated Applications from Simple Tools, Bruce Hendrickson and Steve Plimpton. In Proc. 10th SIAM Conf. Parallel Processing for Scientific Computing, March '01.

- Improving Load Balance with Flexibly Assignable Tasks, Ali Pinar and Bruce Hendrickson. IEEE Transactions on Parallel & Distributed Systems 16(10):956-965, 2005. (Earlier version in SPAA'02)

- Interprocessor Communication with Limited Memory, Ali Pinar and Bruce Hendrickson. In IEEE Transactions on Parallel & Distributed Systems 15:606-616, 2004 (Earlier version in Proc. SPAA'00).

- Communication Support for Adaptive Computation, Ali Pinar and Bruce Hendrickson. In Proc. 10th SIAM Conf. Parallel Processing for Scientific Computing, March '01.

Chaco Rob Leland and I have developed a widely used graph partitioning tool called Chaco, which is available under a GNU open source license. The code is used at over 300 institutions for parallel computing, sparse matrix reordering, circuit placement and a range of other applications.

**Associated Software:** Chaco

**Contact:** Hendrickson, Bruce A., bahendr@sandia.gov

SAND2015-2718 S

### Graph Rigidity and Molecular Conformation

Graph rigidity is a somewhat obscure topic that turns out to have a surprising number of applications in the physical sciences. Imagine placing all the vertices of a graph at random points in space. Can you now move the vertices without changing the lengths of any of the graph's edges? (Trivial motions don't count.) If not, the graph is rigid, otherwise it is flexible. Not surprisingly the answer depends on the dimensionality of the space in which you place the vertices. In my thesis work, I established a connection between graph rigidity and the determination of protein conformation from NMR data. The graph algorithms and properties I developed are described in the first paper below, while the application to molecular conformation is discussed in the second. More recently, Don Jacobs and Mike Thorpe, physicists at Michigan State, have recognized a connection with a discipline known as rigidity percolation. I have worked with them to adapt and implement my algorithms towards this end as discussed in the third paper.

**Papers:**

- The Molecule Problem: Exploiting Structure in Global Optimization, Bruce Hendrickson. SIAM J. Opt., 5(4):835-857, 1995.

**Contact:** Hendrickson, Bruce A., bahendr@sandia.gov

SAND2015-2637 S

### Hardware/Software Codesign for Exascale Computing

The requirements to achieve Exascale computing place great demands on existing hardware and software solutions including the need to be significantly more energy efficient, to utilize much higher degrees of parallelism, to be resilient to hardware failures, to be tolerant to higher latencies to memory and remote communication partners as well as many other factors. The codesign process seeks to bring application and domain specialists together with computer scientists, hardware architectures and machine designers to ensure that an iterative optimization process can be established seeking to balance the many trade offs and benefits of new processor, memory or network features with novel approaches in system software, programming models, numerical methods and application/algorithm design. Sandia is one of the leading Department of Energy laboratories conducting codesign activities linking production engineering and physics specialists with experts in linear algebra, system software, computer scientists and leading industry vendors.

Codesign resources:

- Mantevo Mini-Application Suite
- ASC Advanced Architecture Test Bed Program / Access Information
- Structural Simulation Toolkit for Hardware Simulation (SST)
- ASC Co-Design Strategy

**Contact:** Ang, James A. (Jim), jaang@sandia.gov

SAND2014-16955W

### Hobbes - Extreme-Scale Operating Systems Project

Hobbes is a Sandia-led collaboration between four national laboratories and eight universities supported by the DOE Office of Science Advanced Scientific Computing Research program office. The goal of this three-year project is to deliver an operating system for future extreme-scale parallel computing platforms that will address the major technical challenges of energy efficiency, managing massive parallelism and deep memory hierarchies, and providing resilience in the presence of increasing failures. Our approach is to enable application composition through lightweight virtualization. Application composition is a critical capability that will be the foundation of the way extreme-scale systems must be used in the future. The tighter integration of modeling and simulation capability with analysis and the increasing complexity of application workflows demand more sophisticated machine usage models and new system-level services. Ensemble calculations for uncertainty quantification, large graph analytics, multi-materials and multi-physics applications are just a few examples that are driving the need for these new system software interfaces and mechanisms for managing memory, network, and computational resources. Rather than providing a single unified operating system and runtime system that supports several parallel programming models, Hobbes is leveraging lightweight virtualization to provide the flexibility to construct and efficiently execute custom OS/R environments. Hobbes extends our existing work on the Kitten lightweight operating system and the Palacios lightweight virtual machine monitor.

**Contact:** Brightwell, Ronald B. (Ron), rbbrigh@sandia.gov

2013-8641W

### HPC Resource Allocation

HPC resource allocation consists of a pipeline of methods by which distributed-memory work is assigned to distributed-memory resources to complete that work. This pipeline spans both system and application level software. At the system level, it consists of scheduling and allocation. At the application level, broadly speaking, it consists of discretization (meshing), partitioning, and task mapping. Scheduling, given requests for resources and available resources, decides which request will be assigned resources next or when a request will be assigned resources. When a request is granted, allocation decides which specific resources will be assigned to that request. For the application, HPC resource allocation begins with the discretization and partitioning of the work into a distributed-memory model and ends with the task mapping that matches the allocated resources to the partitioned work. Additionally, network architecture and routing have a strong impact on HPC resource allocation. Each of these problems is solved independently and makes assumptions about how the other problems are solved. We have worked in all of these areas and have recently begun work to combine some of them, in particular allocation and task mapping. We have used analysis, simulation, and real system experiments in this work. Techniques specific to any particular application have not been considered in this work.

**Associated Software:** Trilinos Zoltan

**Contact:** Leung, Vitus J., vjleung@sandia.gov

2016-0639 W

### IDEAS

The IDEAS Project is intent on improving scientific productivity by qualitatively changing scientific software developer productivity, enabling a fundamentally different attitude to creating and supporting computational science and engineering (CSE) applications.

We are creating an extreme-scale scientific software development ecosystem composed of high-quality, reusable CSE software components and libraries; a collection of best practices, processes and tools; and substantial outreach mechanisms for promoting and disseminating productivity improvements. We intend to improve CSE productivity by enabling better, faster *and* cheaper CSE application capabilities for extreme-scale computing.

**Associated Software:** Trilinos

**Contact:** Michael Heroux, Michael Heroux

SAND2015-6329 W

### Institute for the Design of Advanced Energy Systems (IDAES)

The Institute for the Design of Advanced Energy Systems (IDAES) will be the world’s premier resource for the development and analysis of innovative advanced energy systems through the use of process systems engineering tools and approaches. IDAES and its capabilities will be applicable to the development of the full range of advanced fossil energy systems, including chemical looping and other transformational CO2 capture technologies, as well as integration with other new technologies such as supercritical CO2. In addition, the tools and capabilities will be applicable to renewable energy development, such as biofuels, green chemistry, Nuclear and Environmental Management, such as the design of complex, integrated waste treatment facilities.

IDAES is led by the National Energy Technology Laboratory (NETL) with participants from Carnegie-Mellon University, Lawrence Berkeley National Laboratory, Sandia National Laboratories, and West Virginia University.

The IDAES leadership team is:

- David C. Miller, Technical Director and NETL PI
- Deb Agarwal, LBNL PI
- John Siirola, SNL PI

**Associated Software:** Pyomo

**Contact:** Siirola, John Daniel, jdsiiro@sandia.gov

SAND2017-5015 W

### Kitten Lightweight Kernel

Kitten is a current-generation lightweight kernel (LWK) compute node operating system designed for large-scale parallel computing systems. Kitten is the latest in a long line of successful LWKs, including SUNMOS, Puma, Cougar, and Catamount. Kitten distinguishes itself from these prior LWKs by providing a Linux-compatible user environment, a more modern and extendable open-source codebase, and a virtual machine monitor capability via Palacios that allows full-featured guest operating systems to be loaded on-demand.

**Associated Software:** Kitten Lightweight Kernel

**Contact:** Pedretti, Kevin, ktpedre@sandia.gov

2008-8033W

### Kokkos

Modern high performance computing (HPC) nodes have diverse and heterogeneous types of cores and memory. For applications and domain-specific libraries/languages to scale, port, and perform well on these next generation architectures, their on-node algorithms must be re-engineered for thread scalability and performance portability. The Kokkos programming model and its C++ library implementation helps HPC applications and domain libraries implement intra-node thread-scalable algorithms that are performance portable across diverse manycore architectures such as multicore CPUs, Intel Xeon Phi, NVIDIA GPU, and AMD GPU.

This research, development, and deployment project advances the Kokkos programming model with new intra-node parallel algorithm abstractions, implements these abstractions in the Kokkos library, and supports applications' and domain libraries' effective use of Kokkos through consulting and tutorials. The project fosters numerous internal and external collaborations, especially with the ISO/C++ language standard committee to promote Kokkos abstractions into future ISO/C++ language standards.

**Associated Software:** E3SM Kokkos MultiThreaded Graph Library (MTGL) Omega_h Qthreads Trilinos

**Contact:** Edwards, Harold C. (Carter), hcedwar@sandia.gov

SAND2016-3948 W

### Mantevo

Mantevo is a multi-faceted application performance project. It provides application performance proxies known as * miniapps. *Miniapps combine some or all of the dominant numerical kernels contained in an actual stand-alone application. Miniapps include libraries wrapped in a test driver providing representative inputs. They may also be hard-coded to solve a particular test case so as to simplify the need for parsing input files and mesh descriptions. Mini apps range in scale from partial, performance-coupled components of the application to a simplified representation of a complete execution path through the application.

**Associated Software:** Kokkos

**Contact:** Michael Heroux, maherou@sandia.gov

SAND2015-0578

### MapReduce-MPI

MR-MPI is an open-source implementation of MapReduce written for distributed-memory parallel machines on top of standard MPI message passing.

MapReduce is the programming paradigm, popularized by Google, which is widely used for processing large data sets in parallel. Its salient feature is that if a task can be formulated as a MapReduce, the user can perform it in parallel without writing any parallel code. Instead the user writes serial functions (maps and reduces) which operate on portions of the data set independently. The data-movement and other necessary parallel operations can be performed in an application-independent fashion, in this case by the MR-MPI library.

The MR-MPI library was developed to solve informatics problems on traditional distributed-memory parallel computers. It includes C++ and C interfaces callable from most hi-level languages, and also a Python wrapper and our own OINK scripting wrapper, which can be used to develop and chain MapReduce operations together. MR-MPI and OINK are open-source codes, distributed freely under the terms of the modified Berkeley Software Distribution (BSD) license.

**Contact:** Plimpton, Steven J., sjplimp@sandia.gov

SAND2017-0887 W

### Mesquite

**MESQUITE** is a linkable software library that applies a variety of node-movement algorithms to improve the quality and/or adapt a given mesh. Mesquite uses advanced smoothing and optimization to:

- Untangle meshes,
- Provide local size control,
- Improve angles, orthogonality, and skew,
- Increase minimum edge-lengths for increased time-steps,
- Improve mesh smoothness,
- Perform anisotropic smoothing,
- Improve surface meshes, adapt to surface curvature,
- Improve hybrid meshes (including pyramids & wedges),
- Smooth meshes with hanging nodes,
- Maintain quality of moving and/or deforming meshes,
- Perform ALE rezoning,
- Improve mesh quality on and near boundaries,
- Improve transitions across internal boundaries,
- Align meshes with vector fields, and
- R-adapt meshes to solutions using error estimates.

Mesquite improves surface or volume meshes which are structured, unstructured, hybrid, or non-comformal. A variety of element types are permitted. Mesquite is designed to be as efficient as possible so that large meshes can be improved.

**Associated Software:** Mesquite Trilinos

**Contact:** Mitchell, Scott A., samitch@sandia.gov

2006-1986 W

### Parallel Linear Algebra

Linear algebra calculations are at the core of most scientific computations. For this reason, efficient parallel algorithms for linear algebra operations are critical to the effective use of parallel machines. I have worked on the parallelization of direct linear- and eigen-solvers as well iterative methods. In addition, I have worked on parallel algorithms for a variety of scientific computing problems which don't involve linear algebra as described here .

** Papers:**

- Towards an Efficient Parallel Eigensolver for Dense Symmetric Matrices, Bruce Hendrickson, Elizabeth Jessup and Christopher Smith. SIAM J. Sci. Comput. 20(1):1132-1154, 1999.

- An Efficient Parallel Algorithm for Matrix-Vector Multiplication, Bruce Hendrickson, Robert Leland and Steve Plimpton. Intl. J. High Speed Comput., 7(1):73-88, 1995.

- The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers, Bruce Hendrickson and David Womble. SIAM J. Sci. Stat. Comput., 15(5):1201-1226, 1994.

**Contact:** Hendrickson, Bruce A., bahendr@sandia.gov

SAND2015-2638 S

### Parallel Scientific Computing

With a number of different collaborators, I have devised new parallel algorithms for a variety of problems in scientific computing. These include many-body calculations, solid dynamics with contacts, and some radiation transport methods. These projects are sketched below. I have also done a fair amount of work on parallel linear algebra.

Parallel Radiation Transport

Discrete ordinates methods for the simulation of radiation transport lead to a kernel computation involving a set of "sweeps" through the mesh, one for each angle in the ordinate set. These sweeps are a challenge to parallelize effectively for unstructured grids. The first paper below describes our work on this problem. To perform these sweeps (and also in many other situations), it is necessary to decompose a directed graph into strongly connected components. The classic serial algorithm for this problem is not amenable to parallelism. In the second paper we propose and analyze an alternative approach which is more parallelizable. The third paper describes an implementation of this algorithm and its performance on two parallel computers.

Papers:

- Parallel Sn Sweeps on Unstructured Grids: Algorithms for Prioritization, Grid Partitioning and Cycle Detection Steve Plimpton, Bruce Hendrickson, Shawn Burns, Will McLendon III and Lawrence Rauchwerger. In Nuclear Science & Engineering 150:267-283, 2005 (Previous version in Proc. SC'00).

- On Identifying Strongly Connected Components in Parallel Lisa Fleischer, Bruce Hendrickson and Ali Pinar. In Proc. Irregular'2000, Lecture Notes in Computer Science.

- Finding Strongly Connected Components in Distributed Graphs, William C. McLendon III, Bruce Hendrickson, Steve Plimpton and Lawrence Rauchwerger. J. Parallel & Distributed Computing 65(8):901-910, 2005 (Previous version in Proc. 10th SIAM Conf. Parallel Processing for Scientific Computing, March '01).

Paper

Grid Transfer

A number of computational procedures employ multiple grids on which solutions are computed. For example, in multi-physics simulations a primary grid may be used to compute mechanical deformation of an object while a secondary grid is used for thermal conduction calculations. When modeling coupled thermo-mechanical effects, solution data must be interpolated back and forth between the grids each timestep. On a parallel machine, this grid transfer operation can be challenging if the two grids are decomposed to processors differently for reasons of computational efficiency. If the grids move or adapt separately, the complexity of the operation is compounded. With colleagues, I have developed a grid transfer algorithm suitable for massively parallel codes which use multiple grids. It uses a rendezvous technique wherein a third decomposition is used to search for elements in one grid that contain nodal points of the other. This has the advantage of enabling the grid transfer to be load-balanced separately from the remainder of the computations.

Papers:

- A Parallel Rendezvous Algorithm for Interpolation Between Multiple Grids, Steve Plimpton, Bruce Hendrickson and James Stewart. J. Parallel & Distributed Computing 64:266-276 (2004) (Previous version in Proc. SC'98).

Paper, Abstract

Dynamic Load Balancing

In many applications, computational requirements vary during the course of a calculation. To run such calculations effectively on a parallel computer, tasks must be periodically redistributed among processors to keep the workload balanced. A number of different algorithms have been proposed for this dynamic load balancing problem. The first paper below is a critical review of the algorithmic options and the software challenges. The second paper, somewhat older, describes an attempt to parallelize a graph partition approach.

Dynamic load balancing is fundamentally harder than static partitioning. Algorithms and implementations must run quickly in parallel without consuming much memory. Interfaces need to be subroutine calls instead of file transfers. And since the data is already distributed, the new partition should be similar to the current one to minimize the amount of data that needs to be redistributed. With Karen Devine and a number of other collaborators, we are working to build a tool called Zoltan to address this plethora of challenges. Zoltan is designed to be extensible to different applications, algorithms and parallel architectures. For example, it is data-structure neutral - an object oriented interface separates the application data structures from those of Zoltan. The tool will also support a very general model of a parallel computer, facilitating the use of heterogeneous machines.

Papers:

- Dynamic Load Balancing in Computational Mechanics, Bruce Hendrickson and Karen Devine. Comp. Meth. Applied Mechanics & Engineering. 184(2-4):485-500, 2000.

Paper, Abstract

- Parallel Algorithms for Dynamically Partitioning Unstructured Grids, Pedro Diniz, Steve Plimpton, Bruce Hendrickson and Robert Leland. In Proc. 7th SIAM Conf. Parallel Proc., 1995.

Paper, Abstract

Many Body Calculations

My work on parallel algorithms for many body calculations has focused on problems with short-range forces in which the computational effort scales linearly with the number of particles (eg. molecular dynamics). Although the techniques I developed with Steve Plimpton apply to any direct simulation, problems with long-range forces are more efficiently addressed by any of several approximate methods.

My primary contribution was the force decomposition algorithm in which we divide the computation among processors in an unusual way to reduce the communication cost. A nice survey of of parallel approaches for molecular dynamics calculations can be found in the first paper by Steve Plimpton. The others introduce and extend the force decomposition idea.

Papers:

- Fast Parallel Algorithms for Short-Range Molecular Dynamics, Steve J. Plimpton, J. Comp. Phys., 117:1-19, March 1995.

Paper

- Parallel Many-Body Simulations Without All-to-All Communication, Bruce Hendrickson and Steve J. Plimpton. J. Par. Dist. Comput., 27(1):15-25, 1995.

Paper, Abstract

- Parallel Molecular Dynamics With the Embedded Atom Method, Steve J. Plimpton and Bruce Hendrickson. In Materials Theory and Modeling, edited by J. Broughton, P. Bristowe, and J. Newsam, (MRS Proceedings 291, Pittsburgh, PA, 1993), at fall MRS meeting, Boston, MA, November 1992.

Paper Abstract

- A New Parallel Method for Molecular Dynamics Simulation of Macromolecular Systems, Steve J. Plimpton and Bruce Hendrickson. J. Comp. Chemistry, 17(3):326-337, 1996.

Paper Abstract

Parallel Contact Detection

Imagine simulating a car crash. You would probably choose to use Lagrangian grids which deform with the physics as the car distorts. When the fender impacts the radiator, there will be new forces that need to be modeled. In the simulation, this will correspond to the mesh intersecting itself. Detecting these contacts is an essential component of solid dynamics simulations, but it has proved difficult to parallelize effectively. Previous parallelization efforts have generally used a single decomposition of the mesh which is used for both the finite element analysis and the contact detection. However, this can lead to severe load imbalance and large volumes of communication.

We have instead adopted an approach in which we use two different decompositions for the two computational steps. We use a static, graph-based decomposition for the finite element calculation, and a dynamic, geometric decomposition for the contact detection. Although there is some cost associated with communicating between the two decompositions, a careful implementation keeps this small. And once we've paid this cost, we get nearly perfect load balance allowing for the first algorithms which scales to hundreds and even thousands of processors. This algorithm is implemented in Sandia's PRONTO code. The parallel algorithms in PRONTO are described in greater detail in the first two papers below. The capabilities of the code are presented in the third paper, which has earned parallel PRONTO a status as finalist for the 1997 Gordon Bell Prize.

Papers:

- Transient Solid Dynamics Simulations on the Sandia/Intel Teraflop Computer, Steve Attaway, Ted Barragy, Kevin Brown, David Gardner, Bruce Hendrickson, Steve Plimpton and Courtenay Vaughan. In Proc. SC'97. (Finalist for the Gordon Bell Prize.)

Postscript , HTML, Abstract

- Transient Dynamics Simulations: Parallel Algorithms for Contact Detection and Smoothed Particle Hydrodynamics, Steve Plimpton, Steve Attaway, Bruce Hendrickson, Jeff Swegle, Courtenay Vaughan and David Gardner. J. Parallel Distrib. Comput. 50:104-122, 1998. (Earlier version in Proc. Supercomputing'96).

Paper, Abstract

- A New Parallel Algorithm for Contact Detection in Finite Element Methods, Bruce Hendrickson, Steve Plimpton, Steve Attaway, Courtenay Vaughan and David Gardner. In Proc. High Performance Computing '96.

Paper, Abstract

Smoothed Particle Hydrodynamics

When simulating phenomena involving high strains, meshes can become highly distorted and break down. To avoid these problems, one can use a gridless methodology known as smoothed particle hydrodynamics . Despite its name, this approach can model solid objects as well as fluids. Building on the ideas we devised for parallel contact detection, we have developed a parallelization of smoothed particle hydrodynamics.

Papers:

- Transient Dynamics Simulations: Parallel Algorithms for Contact Detection and Smoothed Particle Hydrodynamics, Steve Plimpton, Steve Attaway, Bruce Hendrickson, Jeff Swegle, Courtenay Vaughan and David Gardner. J. Parallel Distrib. Comput. 50:104-122, 1998. (Earlier version in Proc. Supercomputing'96).

Paper, Abstract

**Associated Software:** Zoltan

**Contact:** Hendrickson, Bruce A., bahendr@sandia.gov

SAND2015-2717 S

### Portals Interconnect API

Portals is an interconnect API intended to allow scalable, high-performance network communication between nodes of a large-scale parallel computing system. Portals is based on a building blocks approach that enables multiple upper-level protocols, such as MPI and SHMEM, to be used simultaneously within a process. This approach also encapsulates important semantics that can be offloaded to a network interface controller (NIC) to optimize performance-critical functionality. Previous generations of Portals have been deployed on large-scale production systems, including the Intel ASCI Red machine and Cray's SeaStar interconnect for their XT product line. The current generation API is being used to enable advanced NIC architecture research for future extreme-scale systems.

**Contact:** Brightwell, Ronald B. (Ron), rbbrigh@sandia.gov

2012-6934W

### Power API

Achieving practical exascale supercomputing will require massive increases in energy efficiency. The bulk of this improvement will likely be derived from hardware advances such as improved semiconductor device technologies and tighter integration, hopefully resulting in more energy efficient computer architectures. Still, software will have an important role to play. With every generation of new hardware, more power measurement and control capabilities are exposed. Many of these features require software involvement to maximize feature benefits. This trend will allow algorithm designers to add power and energy efficiency to their optimization criteria. Similarly, at the system level, opportunities now exist for energy-aware scheduling to meet external utility constraints such as time of day cost charging and power ramp rate limitations. Finally, future architectures might not be able to operate all components at full capability for a range of reasons including temperature considerations or power delivery limitations. Software will need to make appropriate choices about how to allocate the available power budget given many, sometimes conflicting considerations.** For these reasons, we have developed a portable API for power measurement and control. **

**Contact:** Laros, James H., jhlaros@sandia.gov

SAND2014-17083 W

### SIRIUS: Science-driven Data Management for Multi-tiered Storage

Scientific discovery at the exascale will not be possible without significant new research in the manage- ment, storage and retrieval over the long lifespan of the extreme amounts of data that will be produced. Our thesis in this project is that adding application level knowledge about data to guide the actions of the storage system provides substantial benefits to the organization, storage, and access to extreme scale data, resulting in improved productivity for computational science. In this project we will demonstrate novel techniques to facilitate efficient mapping of data objects, even partitioning individual variables, from the user space onto multiple storage tiers, and enable application-guided data reductions and transformations to address capacity and bandwidth bottlenecks. Our goal here is to address the associated Input/Output (I/O) and storage challenges in the context of current and emerging storage landscapes, and expedite insights into mission critical scientific processes.

We aim to reduce the time to insight, not just for a single application, but for the entire workload in a multi-user environment, where the storage is shared among users. We achieve this goal by allowing selectable data quality, by trading its accuracy and error in order to meet the time or resource constraint. We are exploring beyond checkpoint/restart I/O, and are addressing the challenges posed by key data access patterns in the knowledge gathering process. Ultimately, we will take the knowledge from the storage system to provide vital feedback to the middleware so that the best possible decisions can be autonomically made between the user intentions and the available system resources.

The EMPRESS metadata system published at PDSW-DISCS @ SC17 is an outcome of this project.

**Contact:** Lofstead, Gerald Fredrick (Jay), gflofst@sandia.gov

2015-8887

### Sparse Matrix Reorderings

The number of numerical operations required to factor a sparse matrix depends critically on the ordering of the rows and columns. For Cholesky factorization, it is possible to reorder the matrix at will without causing numerical instability. I have worked on two reordering problems. With Ed Rothberg I have looked at the general problem of minimizing the fill in the matrix. We have used a nested dissection approach building upon my work in Graph Partitioning. We obtain orderings much superior to those produced by minimum degree algorithms on a range of 2D, 3D and linear programming matrices. I have also worked with Erik Boman on multilevel algorithms for producing small envelope orderings. These orderings are quite popular in commercial codes since they require simpler data structures with less indirect addressing.

** Papers:**

- Improving the Runtime and Quality of Nested Dissection Ordering, Bruce Hendrickson and Ed Rothberg. SIAM J. Sci. Comput. 20(2):468-489, 1998.

Paper, Abstract

- Sparse Matrix Ordering Methods for Interior Point Linear Programming, Ed Rothberg and Bruce Hendrickson. INFORMS J. Comput. 10(1):107-113, 1998.

Paper, Abstract

- Effective Sparse Matrix Ordering: Just Around the BEND, Bruce Hendrickson and Ed Rothberg. In Proc. Eighth SIAM Conf. Parallel Processing for Scientific Computing.

Paper, Abstract

- A Multilevel Algorithm for Reducing the Envelope of Sparse Matrices, Erik G. Boman and Bruce Hendrickson, Tech Report SCCM-96-14, Stanford University

Paper, Abstract

**Contact:** Hendrickson, Bruce A., bahendr@sandia.gov

SAND2015-2641 S

### Spectral Graph Algorithms

Of all the deep connections between combinatorics and linear algebra, those in the field known as spectral graph theory are among the most mysterious. Upon first encounter, it seems quite bizarre that eigenvalues and eigenvectors of matrices should have anything at all to say about graph properties. But when looked at in the right way, this connection turns out to be quite natural and powerful. I have used eigenvectors of the Laplacian matrix in several different application domains including partitioning graphs, organizing databases and assembling gene fragments.

** Papers:**

- Latent Semantic Analysis and Fiedler Retrieval, Bruce Hendrickson. In Linear Algebra and its Applications, 421:345-355, 2007.

Paper, Abstract

- An Improved Spectral Graph Partitioning Algorithm for Mapping Parallel Computations, Bruce Hendrickson and Robert Leland. SIAM J. Sci. Stat. Comput., 16(2):452-469, 1995.

Paper, Abstract

- Multidimensional Spectral Load Balancing, Bruce Hendrickson and Robert Leland. In Proc. 6th SIAM Conf. Parallel Proc., 953-961, 1993.

Paper, Abstract

- A Spectral Algorithm for Seriation and the Consecutive Ones Problem, Jon E. Atkins, Erik G. Boman and Bruce Hendrickson. SIAM J. Comput. 28(1):297-310, 1998.

Paper, Abstract

**Contact:** Hendrickson, Bruce A., bahendr@sandia.gov

SAND2015-2640 S

### Structural Simulation Toolkit (SST)

High performance computing architectures are undergoing a marked transformation. Increasing performance of the largest parallel machines at the same exponential rate will require that applications expose more parallelism at an accelerated pace due to the advent of multi-core processors at relatively flat clock rates. The extreme number of hardware components in this machines along with I/O bottlenecks will necessitate looking beyond the traditional checkpoint/restart mechanism for dealing with machine failures. Additionally, as a result of the high power requirements of these machines the energy required to obtain a result will become as important as the time to solution. These changes mean that a new approach to the development of extreme-scale hardware and software is needed relying on the simulaneous exploration of both the hardware and software design space, a process referred to as co-design.

The Structural Simulation Toolkit (SST) enables co-design of extreme-scale architectures by allowing simulation of diverse aspects of hardware and software relevant to such environments. Innovations in instruction set architecture, memory systems, the network interface, and full system network can be explored in the context of design choices for the programming model and algorithms. The package provides two novel capabilities. The first is a fully modular design that enables extensive exploration of an individual system parameter without the need for intrusive changes to the simulator. The second is a parallel simulation environment based on MPI. This provides a high level of performance and the ability to look at large systems. The framework has been successfully used to model concepts ranging from processing in memory to conventional processors connected by conventional network interfaces and running MPI.

**Contact:** Arun Rodrigues, Arun Rodrigues

SAND2014-15754W

### Support Theory for Preconditioning

When using an iterative method to solve a linear system of equations, a good choice of preconditioner can have a dramatic impact on runtime and robustness. Instead of solving a system Ax=b, you solve a preconditioned system M-1Ax=M-1b, where M is the preconditioner. A good preconditioner must have a variety of properties. First, the preconditioned system should converge quickly. This generally means that M-1A has a small condition number. Second, it should be easy to solve systems of the form My=z. And obviously the construction of the preconditioner should be efficient in both time and space.

The generation of good preconditioners involves as much art as science. The best preconditioners tend to be application-specific, exploiting insight into the precise problem being solved. A number of general purpose preconditioner have been developed which often work well in practice. The most widely used of these are variants on incomplete factorizations and approximate inverses. Unfortunately, these general purpose approaches tend to be poorly understood theoretically (except perhaps on model problems), and they sometimes perform badly. New ways of thinking about preconditioning are urgently needed.

Support theory is a fairly recent methodology for bounding condition numbers of preconditioned systems. More specifically, it is a set of tools and techniques for bounding extremal eigenvalues. For some iterative methods (conjugate gradients in particular), the ratio of largest to smallest eigenvalues provides an upper bound on the number of iterations.

Support theory began with the remarkable work of Pravin Vaidya in the early nineties in which he proposed and analyzed maximum weight spanning tree preconditioners for Laplacian matrices. Vaidya chose not to publish his work, but to produce commercial software instead. His software has had a significant impact in the structural analysis community. Fortunately, John Gilbert and Gary Miller recognized the significance of Vaidya's ideas and kept them from being lost. Miller and two of his students (Keith Gremban and Steve Guattery) formalized and greatly extended the techniques that Vaidya had used. In the process, they devised a new multi-level preconditioner and also developed a new analysis of incomplete Cholesky preconditioning. Unfortunately, none of this work has yet appeared in a journal, and so the ideas and techniques are largely unknown.

Shortly after Gremban finished his thesis, Nhat Nguyen and I found an application of support theory ideas to provide a new analysis of modified incomplete Cholesky preconditioning. This work was eventually combined with the concurrent efforts of Sivan Toledo, John Gilbert and Marshall Bern to become the first paper below.

To this point, all of the techniques being used were limited to very limited classes of matrices. Specifically, most of the results were for symmetric, diagonally dominant M-matrices, with a few results extending to all symmetric diagonally dominant matrices. The limitations of applicability and the challenges of the combinatorics frustrated all of us, and led to the near demise of the field.

The next big advance was made by Erik Boman. He realized that all the support-graph methods were merely a special case of a much larger and richer algebraic structure. The algebraic generalization encompasses the much more significant class of all symmetric positive-definite matrices. This realization has reinvigorated the field, enabling a wide variety of new results. These observations grew into the second paper below which introduces this algebraic support theory, and provides an extensive set of tools for bounding condition numbers. The first application of this expanded theory is the third paper, which extends Vaidya's techniques to address all diagonally dominant matrices. This extension is mentioned in passing in Vaidya's original manuscript, but no details are given. We are also working on the application of these ideas to systems coming from the finite element solution of elliptic PDEs, and also to provide better dropping strategies for incomplete factorization preconditioners.

The past several years have witnessed continued a continual broadening and deepening of both theoretical developments and applications of support theory by a number of researchers. Although I include pointers to some of this work below, this page is not a comprehensive bibliography.

**Papers:**

- Support-Graph Preconditioners, Marshall Bern, John R. Gilbert, Bruce Hendrickson, Nhat Nguyen and Sivan Toledo. SIAM J. Matrix Anal. Appl. 27(4):930-951 (2006).

Paper, Abstract

- Support Theory for Preconditioning, Erik Boman and Bruce Hendrickson. SIAM J. Matrix Anal. & Appl. 25(3):694-717 (2003).

Paper, Abstract

- Maximum-Weight-Basis Preconditioners, Erik Boman, Doron Chen, Bruce Hendrickson and Sivan Toledo. In Numerical Linear Algebra and Applications 11:695-721, 2004.

Paper, Abstract

- Solving Elliptic Finite Element Systems in Near-Linear Time with Support Preconditioners, Erik Boman, Bruce Hendrickson and Steve Vavasis. Submitted to SIAM J. Matrix Anal. and Appl.

Paper, Abstract

- Optimal Embeddings and Eigenvalues in Support Theory Erik Boman, Stephen Guattery and Bruce Hendrickson. SIAM Journal on Matrix Analysis and Applications 22(2):596-605 (2007).

Paper, Abstract

- Algebraic Tools for Analyzing Preconditioners, Bruce Hendrickson (with Erik Boman). Invited talk at Preconditioning'03.

Abstract, powerpoint version of overheads

Although Vaidya never published his work, his PhD student Anil Joshi wrote a dissertation that includes discussion, analysis and extensions to Vaidya's original techniques. Related work that may be of interest includes several papers by Steve Guattery and others by Sivan Toledo as well as Keith Gremban's thesis. Sivan and his student Doron Chen have an implementation of Vaidya's preconditioners as part of their TAUCS solver package. They have an interesting paper in ETNA describing their empirical experiments with Vaidya's methods. In addition, Dan Spielman and Shanghua Teng have done some very interesting algorithmic work that has significantly improved the theoretical performance of support theory preconditioners.

Other related work includes a theoretical paper by John Reif (and its errata), and extensions to complex arithmetic of Gremban's multilevel preconditioner by Vicki Howle and Steve Vavasis.

This brief synopsis is undoubtedly subject to personal bias and is certainly incomplete. Please send corrections or additions to me at bahendr@sandia.gov

**Contact:** Hendrickson, Bruce A., bahendr@sandia.gov

SAND2015-2639 S

### Trilinos

The Trilinos Project is an effort to develop algorithms and enabling technologies within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems. A unique design feature of Trilinos is its focus on packages.

**Associated Software:** Dakota Kokkos Mesquite Omega_h Rapid Optimization Library (ROL) Trilinos

**Contact:** Heroux, Michael A., maherou@sandia.gov

SAND2003-2927

### XPRESS - eXascale Programming Environment and System Software

The XPRESS Project is one of four major projects of the DOE Office of Science Advanced Scientific Computing Research X-stack Program initiated in September, 2012. The purpose of XPRESS is to devise an innovative system software stack to enable practical and useful exascale computing around the end of the decade with near-term contributions to efficient and scalable operation of trans-Petaflops performance systems in the next two to three years; both for DOE mission-critical applications. To this end, XPRESS directly addresses critical challenges in computing of efficiency, scalability, and programmability through introspective methods of dynamic adaptive resource management and task scheduling.

**Contact:** Brightwell, Ronald B. (Ron), rbbrigh@sandia.gov

2012-9494W

### XVis

The XVis project brings together the key elements of research to enable scientific discovery at extreme scale. Scientific computing will no longer be purely about how fast computations can be performed. Energy constraints, processor changes, and I/O limitations necessitate significant changes in both the software applications used in scientific computation and the ways in which scientists use them. Components for modeling, simulation, analysis, and visualization must work together in a computational ecosystem, rather than working independently as they have in the past. This project provides the necessary research and infrastructure for scientific discovery in this new computational ecosystem by addressing four interlocking challenges: emerging processor technology, in situ integration, usability, and proxy analysis.

**Associated Software:** ParaView VTK-m

**Contact:** Moreland, Kenneth D., kmorel@sandia.gov

2016-0151 W

### Zoltan

The Zoltan project focuses on parallel algorithms for parallel combinatorial scientific computing, including partitioning, load balancing, task placement, graph coloring, matrix ordering, distributed data directories, and unstructured communication plans.

The Zoltan toolkit is an open-source library of MPI-based distributed memory algorithms. It includes geometric and hypergraph partitioners, global graph coloring, distributed data directories using rendezvous algorithms, primitives to simplify data movement and unstructured communication, and interfaces to the ParMETIS, Scotch and PaToH partitioning libraries. It is written in C and can be used as a stand-alone library.

The Zoltan2 toolkit is the next-generation toolkit for multicore architectures. It includes MPI+OpenMP algorithms for geometric partitioning, architecture-aware task placement, and local matrix ordering. It is written in templated C++ and is tightly integrated with the Trilinos toolkit.

**Associated Software:** Trilinos Zoltan

**Contact:** Devine, Karen D., kddevin@sandia.gov

SAND2017-0888 W