Center for Computing Research (CCR)

Center for Computing Research

Scalable System Software, 01423

The Scalable System Software department traces its roots back to the early days of distributed memory massively parallel processing (MPP) systems of the late 1980's. During this time, Sandia established the viability of MPP systems, such as the nCUBE-10 and the Intel Paragon, in solving mission-critical applications using modeling and simulation. The department grew out of the need to design, develop, and deploy more efficient system software focused on the meeting the performance and scalability demands of these applications running on the largest and fastest computing systems in the world. The group became firmly established in the early 1990's when researchers at Sandia partnered with the University of New Mexico to develop a customized system software environment based on a lightweight compute node operating system designed specifically for large-scale, distributed memory, message-passing machines. This initial lightweight kernel environment was successfully deployed on several large production systems at Sandia and eventually evolved into the operating system that ran on the compute nodes of the world's first general-purpose parallel computer to achieve a teraFLOPS, the Intel ASCI Red system. As parallel computing architectures and applications and have continued to evolve, the department has expanded into several other system software areas around operating systems, but has continued to focus on addressing the needs of extreme-scale systems and applications.

People

Projects

The Scalable System Software department supports the design, implementation, and evaluation of system software for extreme-scale parallel computing platforms with a focus on maximizing the performance, scalability, robustness, and efficiency of key scientific, engineering, and analysis applications.  Areas of active research include: lightweight compute node operating systems, dynamic adaptive runtime systems, low-level high-performance networking software, application and system resiliency, power/energy optimization, application performance analysis, RAS software infrastructure, support for integrated analysis, and parallel file systems and I/O middleware.

Software

The Scalable System Software Department contributes to many open-source software projects, including the following:

News

  • CCR Researcher Discusses Ceph Storage on Next Platform TV

    CCR system software researcher Matthew Curry appeared on the June 22nd episode of “Next Platform TV” to discuss the increased use of the Ceph storage system in high-performance computing (HPC)....

    CCR Researcher Discusses Ceph Storage on Next Platform TV

    CCR system software researcher Matthew Curry appeared on the June 22nd episode of “Next Platform TV” to discuss the increased use of the Ceph storage system in high-performance computing (HPC). Matthew’s interview with Nicole Hemsoth of the Next Platform starts at the 18:40 mark of the video. In the interview, Matthew describes the Stria system, which is an unclassified version of Astra, which was the first petascale HPC system based on the Arm processor. Matthew also describes the use of the Ceph storage system and some of the important aspects that are being tested and evaluated on Stria. More details and the entire episode are here.

    Contact: Curry, Matthew Leon
    July 2020
    2020-6988E

  • CCR Researcher Discusses IO500 on Next Platform TV

    CCR system software researcher Jay Lofstead appeared on the September 3rd episode of “Next Platform TV” to discuss the IO500 benchmark, including how it is used for evaluating large- scale storage systems in high-performance computing (HPC) and the future of the benc...

    CCR Researcher Discusses IO500 on Next Platform TV

    CCR system software researcher Jay Lofstead appeared on the September 3rd episode of “Next Platform TV” to discuss the IO500 benchmark, including how it is used for evaluating large- scale storage systems in high-performance computing (HPC) and the future of the benchmark. Jay’s discussion with Nicole Hemsoth of the Next Platform starts at the 32:04 mark of the video. In the interview, Jay describes the origins of the IO500 benchmark and the desire to provide a standard method for understanding how well an HPC storage system is performing for different workloads and different storage and file system configurations. Jay also describes how the benchmark has evolved since its inception, as well as the influence of the benchmark, and the ancillary impacts of ranking IO systems. More details and the entire episode are here:

    https://www.nextplatform.com/2020/09/03/next-platform-tv-for-september-3-2020/

    Contact: Lofstead, Gerald Fredrick (Jay)
    September 2020
    2020-9390E

  • Sandia Researchers Collaborate with Red Hat on Container Technology

    Sandia researchers in the Center for Computing Research collaborated with engineers from Red Hat, the world’s leading provider of open source solutions for enterprise computing, to enable more robust production container capabilities for high-performance computing (HPC) systems....

    Sandia Researchers Collaborate with Red Hat on Container Technology

    Sandia researchers in the Center for Computing Research collaborated with engineers from Red Hat, the world’s leading provider of open source solutions for enterprise computing, to enable more robust production container capabilities for high-performance computing (HPC) systems. CCR researchers demonstrated the use of Podman, which allows ordinary users to build and run containers without needing the elevated security privileges of an administrator, on the Stria machine at Sandia. Stria is an unclassified version of Astra, which was the first petascale HPC system based on an Arm processor. While Arm processors have shown to be very capable for HPC workloads, they are not as prevalent in laptops and workstations as other processors. To address this limitation, Podman provides the ability to build containers directly on machines like Stria and Astra without requiring root-level access. This capability is a critical advancement in container functionality for the HPC application development environment. The CCR team is continuing to work with Red Hat on improving Podman for traditional HPC applications as well as machine learning and deep learning workloads. More details on this collaboration can be found here:

    Contact: Younge, Andrew J
    July 2020
    2020-6891E

  • Sandia-led Supercontainers Project Featured in ECP Podcast

    As the US Department of Energy’s (DOE) Exascale Computing Project (ECP) has evolved since its inception in 2016, what’s known as containers technology and how it fits into the wider scheme of exascale computing and high-performance computing (HPC) has been an area of ongoing interest in its own right within the HPC community....

    Sandia-led Supercontainers Project Featured in ECP Podcast

    As the US Department of Energy’s (DOE) Exascale Computing Project (ECP) has evolved since its inception in 2016, what’s known as containers technology and how it fits into the wider scheme of exascale computing and high-performance computing (HPC) has been an area of ongoing interest in its own right within the HPC community.

    Container technology has revolutionized software development and deployment for many industries and enterprises because it provides greater software flexibility, reliability, ease of deployment, and portability for users. But several challenges must be addressed to get containers ready for exascale computing.

    The Supercontainers project, one of ECP’s newest efforts, aims to deliver containers and virtualization technologies for productivity, portability, and performance on the first exascale computing machines, which are planned for 2021.

    ECP’s Let’s Talk Exascale podcast features as a guest Supercontainers project team member Andrew Younge of Sandia National Laboratories. The interview was recorded this past November in Denver at SC19: The International Conference for High Performance Computing, Networking, Storage, and Analysis.

    Contact: Younge, Andrew J
    April 2020
    2020-3907E

UNCLASSIFIED UNLIMITED RELEASE DOCUMENTS ONLY