Center for Computing Research (CCR)

Center for Computing Research

Scalable System Software, 01423

The Scalable System Software department traces its roots back to the early days of distributed memory massively parallel processing (MPP) systems of the late 1980's. During this time, Sandia established the viability of MPP systems, such as the nCUBE-10 and the Intel Paragon, in solving mission-critical applications using modeling and simulation. The department grew out of the need to design, develop, and deploy more efficient system software focused on the meeting the performance and scalability demands of these applications running on the largest and fastest computing systems in the world. The group became firmly established in the early 1990's when researchers at Sandia partnered with the University of New Mexico to develop a customized system software environment based on a lightweight compute node operating system designed specifically for large-scale, distributed memory, message-passing machines. This initial lightweight kernel environment was successfully deployed on several large production systems at Sandia and eventually evolved into the operating system that ran on the compute nodes of the world's first general-purpose parallel computer to achieve a teraFLOPS, the Intel ASCI Red system. As parallel computing architectures and applications and have continued to evolve, the department has expanded into several other system software areas around operating systems, but has continued to focus on addressing the needs of extreme-scale systems and applications.

People

Projects

The Scalable System Software department supports the design, implementation, and evaluation of system software for extreme-scale parallel computing platforms with a focus on maximizing the performance, scalability, robustness, and efficiency of key scientific, engineering, and analysis applications.  Areas of active research include: lightweight compute node operating systems, dynamic adaptive runtime systems, low-level high-performance networking software, application and system resiliency, power/energy optimization, application performance analysis, RAS software infrastructure, support for integrated analysis, and parallel file systems and I/O middleware.

Software

The Scalable System Software Department contributes to many open-source software projects, including the following:

News

  • CCR Researcher Kurt Ferreira Co-Authors Best Paper at APDCM Workshop

    CCR Researcher Kurt Ferreira and his co-authors have been awarded Best Paper at the upcoming Workshop on Advances in Parallel and Distributed Computational Models (APDCM) at the International Parallel and Distributed Processing Symposium....

    CCR Researcher Kurt Ferreira Co-Authors Best Paper at APDCM Workshop

    CCR Researcher Kurt Ferreira and his co-authors have been awarded Best Paper at the upcoming Workshop on Advances in Parallel and Distributed Computational Models (APDCM) at the International Parallel and Distributed Processing Symposium. Their paper entitled "Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms" proposes a cooperative checkpoint scheduling policy that combines optimal checkpointing periods with I/O scheduling in an effort to ensure minimal overheads in the presence of bursty, competing I/O. This work provides crucial analysis and direct guidance on maximizing throughput on current and future extreme-scale platforms. This year marks the 20th APDCM Workshop, which intends “to provide a timely forum for the exchange and dissemination of new ideas, techniques and research in the field of the parallel and distributed computational models.”

    Contact: Ferreira, Kurt Brian
    May 2018
    2018-4849E

  • Power API and LAMMPS Named R&D100 Award Finalists

    Two CCR technologies have been named as finalists for the 2018 R&D100 Awards. Each year, R&D Magazine names the 100 most technologically significant products and advancements, recognizing the winners and their organizations....

    Power API and LAMMPS Named R&D100 Award Finalists

    Two CCR technologies have been named as finalists for the 2018 R&D100 Awards. Each year, R&D Magazine names the 100 most technologically significant products and advancements, recognizing the winners and their organizations. Winners are selected from submissions from universities, corporations, and government labs throughout the world. This year’s finalists include the Power APIand LAMMPS. The Power API is portable programming interface for developing applications and tools that can be used to control and monitor the power use of high-performance computing systems in order to improve energy efficiency. LAMMPS is a molecular dynamics modeling and simulation application designed to run on large-scale high performance computing systems. The final award winners will be announced at a ceremony at the R&D 100 Conferencein mid-November.

    Contact: Brightwell, Ronald B. (Ron)
    October 2018
    2018-11733E

  • Sandia Joins the Linaro HPC Special Interest Group

    Sandia National Laboratories has joined Linaro’s High Performance Compute (HPC) Special Interest Group as an advanced end user of mission-critical HPC systems.  Linaro Ltd, is the open source collaborative engineering organization developing software for the Arm ecosystem....

    Sandia Joins the Linaro HPC Special Interest Group

    Sandia National Laboratories has joined Linaro’s High Performance Compute (HPC) Special Interest Group as an advanced end user of mission-critical HPC systems.  Linaro Ltd, is the open source collaborative engineering organization developing software for the Arm ecosystem. Sandia recently announced Astra, one of the first supercomputers to use processors based on the Arm architecture in a large-scale high-performance computing platform.  This system requires a complete vertically integrated software stack for Arm: from the operating system through compilers and math libraries. Sandia and Linaro will work together with the other members of the HPC SIG to jointly address hardware and software challenges, expand the HPC ecosystem by developing and proving new technologies and increase technology and vendor choices for future platforms. More info is available here.

    Contact: Younge, Andrew J
    August 2018
    2018-9504L

  • The Next Platform Highlights CCR Work on Memory-Centric Programming

    A recent article from The Next Platform, an online publication that offers in-depth coverage of high-end computing, recently featured an article entitled “New Memory Challenges Legacy Approaches to HPC Code....

    The Next Platform Highlights CCR Work on Memory-Centric Programming

    A recent article from The Next Platform, an online publication that offers in-depth coverage of high-end computing, recently featured an article entitled “New Memory Challenges Legacy Approaches to HPC Code.” The article discusses a paper co-authored by CCR researcher Ron Brightwell that was published last November as part of the Workshop on Memory Centric Programming for HPC at the SC’17 conference. In the article, Brightwell and one of his co- authors, Yonghong Yan from the University of South Carolina, discuss the programming challenges created by recent advances in memory technology and the deepening memory hierarchy. The article examines the notion of memory-centric programming and how programming systems need to evolve to provide better abstractions to help insulate application developers from the complexities associated with current and future advances in memory technology for high-performance computing systems. 

    Contact: Brightwell, Ronald B. (Ron)
    February 2018
    2018-1257E

UNCLASSIFIED UNLIMITED RELEASE DOCUMENTS ONLY