Center for Computing Research
Scalable System Software, 01423
The Scalable System Software department traces its roots back to the early days of distributed memory massively parallel processing (MPP) systems of the late 1980's. During this time, Sandia established the viability of MPP systems, such as the nCUBE-10 and the Intel Paragon, in solving mission-critical applications using modeling and simulation. The department grew out of the need to design, develop, and deploy more efficient system software focused on the meeting the performance and scalability demands of these applications running on the largest and fastest computing systems in the world. The group became firmly established in the early 1990's when researchers at Sandia partnered with the University of New Mexico to develop a customized system software environment based on a lightweight compute node operating system designed specifically for large-scale, distributed memory, message-passing machines. This initial lightweight kernel environment was successfully deployed on several large production systems at Sandia and eventually evolved into the operating system that ran on the compute nodes of the world's first general-purpose parallel computer to achieve a teraFLOPS, the Intel ASCI Red system. As parallel computing architectures and applications and have continued to evolve, the department has expanded into several other system software areas around operating systems, but has continued to focus on addressing the needs of extreme-scale systems and applications.
|Ronald B. Brightwell (Ron)|
Manager, Scalable System Software
Sandia National Laboratories
P.O. Box 5800, MS 1319
Matthew Leon Curry
Kurt Brian Ferreira
DeVonna Skye Flanery
Michael J. Levenhagen
Scott Larson Nicoll Levy
Gerald Fredrick Lofstead (Jay)
Stephen Lecler Olivier
William Whitney Schonbein
John P. Vandyke
Courtenay T. Vaughan
Harry Lee Ward
Andrew J Younge
The Scalable System Software department supports the design, implementation, and evaluation of system software for extreme-scale parallel computing platforms with a focus on maximizing the performance, scalability, robustness, and efficiency of key scientific, engineering, and analysis applications. Areas of active research include: lightweight compute node operating systems, dynamic adaptive runtime systems, low-level high-performance networking software, application and system resiliency, power/energy optimization, application performance analysis, RAS software infrastructure, support for integrated analysis, and parallel file systems and I/O middleware.
The Scalable System Software Department contributes to many open-source software projects, including the following:
- CCR Researcher Kurt Ferreira Co-Authors Best Paper at APDCM Workshop
CCR Researcher Kurt Ferreira and his co-authors have been awarded Best Paper at the upcoming Workshop on Advances in Parallel and Distributed Computational Models (APDCM) at the International Parallel and Distributed Processing Symposium....
CCR Researcher Kurt Ferreira Co-Authors Best Paper at APDCM Workshop
CCR Researcher Kurt Ferreira and his co-authors have been awarded Best Paper at the upcoming Workshop on Advances in Parallel and Distributed Computational Models (APDCM) at the International Parallel and Distributed Processing Symposium. Their paper entitled "Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms" proposes a cooperative checkpoint scheduling policy that combines optimal checkpointing periods with I/O scheduling in an effort to ensure minimal overheads in the presence of bursty, competing I/O. This work provides crucial analysis and direct guidance on maximizing throughput on current and future extreme-scale platforms. This year marks the 20th APDCM Workshop, which intends “to provide a timely forum for the exchange and dissemination of new ideas, techniques and research in the field of the parallel and distributed computational models.”
Contact: Ferreira, Kurt Brian
- The Next Platform Highlights CCR Work on Memory-Centric Programming
A recent article from The Next Platform, an online publication that offers in-depth coverage of high-end computing, recently featured an article entitled “New Memory Challenges Legacy Approaches to HPC Code....
The Next Platform Highlights CCR Work on Memory-Centric Programming
A recent article from The Next Platform, an online publication that offers in-depth coverage of high-end computing, recently featured an article entitled “New Memory Challenges Legacy Approaches to HPC Code.” The article discusses a paper co-authored by CCR researcher Ron Brightwell that was published last November as part of the Workshop on Memory Centric Programming for HPC at the SC’17 conference. In the article, Brightwell and one of his co- authors, Yonghong Yan from the University of South Carolina, discuss the programming challenges created by recent advances in memory technology and the deepening memory hierarchy. The article examines the notion of memory-centric programming and how programming systems need to evolve to provide better abstractions to help insulate application developers from the complexities associated with current and future advances in memory technology for high-performance computing systems.
Contact: Brightwell, Ronald B. (Ron)