Center for Computing Research (CCR)

Center for Computing Research

Kurt Brian Ferreira

Kurt Brian Ferreira
Scalable System Software
Email: kbferre@sandia.gov
Phone: 505/844-0433
Fax: 505/845-7442

Mailing address:
Sandia National Laboratories
P.O. Box 5800, MS 1319
Albuquerque, NM
87185-1320

Principal Member of Technical Staff 

My area of expertise is system software and resilience/fault-tolerance methods for large-scale, massively parallel, distributed-memory, scientific computing systems. I have designed and developed a number of innovative, high-performance, and resilient implementations of low-level system software for several HPC platforms including the Cray Red Storm (XT3) machine at Sandia National Laboratories. My research interests include the design and construction of operating systems for massively parallel processing machines and innovative application and system-level fault-tolerance mechanisms for HPC.

 

Education/Background

 I received my BS in mathematics and BS in computer science in 2000 from New Mexico Tech and my MS in computer science in 2008 and my PhD in computer science in 2011 from the University of New Mexico

if(typeof(dstb)!= "undefined"){ dstb();}

Selected Publications & Presentations

2013
2012
  • Barrett, Brian W, Richard F Barrett, James M Brandt, Ron B Brightwell, Matthew L Curry, Nathan D Fabian, Kurt B Ferreira, Ann C Gentile, K Scott Hemmert, Suzanne M Kelly, Ruth A Klundt, James H Laros, III, Vitus J Leung, Michael J Levenhagen, Gerald F Lofstead, Kenneth D Moreland, Ron A Oldfield, Kevin T Pedretti, Arun F Rodrigues, David Thompson, Tom Tucker, Lee H Ward, John P Van Dyke, Courtenay T Vaughan, Kyle B Wheeler, "Report of Experiments and Evidence for ASC L2 Milestone 4467 - Demonstration of a Legacy Application's Path to Exascale," SAND Report, March 2012.
  • Ferreira, Kurt, Kevin T. Pedretti, Ron Brightwell, Patrick Bridges, David Fiala, Frank Mueller, "An Operating System Resilient to DRAM Failures," Presentation, Workshop on Exascale Operating Systems and Runtime Software, October 2012.
  • Ferreira, Kurt, Kevin Pedretti, Patrick Bridges, Ron Brightwell, David Fiala, "Evaluating Operating System Vulnerability to Memory Errors," Workshop Paper, Workshop on Runtime and Operating Systems for Supercomputers, June 2012.
  • Ferreira, Kurt Brian, David Fiala, Frank Mueller, Christian Engelmann, Ron Brightwell, Rolf Riesen, "Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing," Conference Paper, International Conference for High-Performance Computing, Networking, Storage, and Analaysis (SC '12), November 2012.
  • Fiala, David, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, Ron Brightwell, "Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing," Conference Paper, ACM/IEEE International Conference on High-Performance Computing, Networking, Storage, and Analysis, November 2012.
  • Ibtesham, Dewan, Dorian Arnold, Patrick G Bridges, Kurt B Ferreira, Ron Brightwell, "On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-based Fault Tolerance," Conference Paper, International Conference on Parallel Processing, September 2012.
  • Kharbas, Kishor, David A. Fiala, Frank Mueller, Kurt B. Ferreira, Christian Engelmann, "Combining Partial Redundancy and Checkpointing for HPC," Conference Paper, The 32nd International Conference on Distributed Computing Systems (ICDCS2012), June 2012.
  • Laros, James H, III, Kevin T Pedretti, Suzanne M Kelly, Wei Shu, Kurt B Ferreira, John P Van Dyke, Courtenay T Vaughan, Book, Energy-Efficient High Performance Computing - Measurement and Tuning, October 2012.
  • Levy, Scott, Kurt B. Ferreira, Patrick G. Bridges, Dorian Arnold, David Fiala, "Exploiting Content Similarity to Improve Memory Performance in Exascale Systems," Workshop Paper, Workshop on Exascale Operating Systems and Runtime Software, October 2012.
  • Riesen, Rolf, Kurt Brian Ferreira, Dilma Da Silva, Pierre Lamarinier, Dorian Arnold, Patrick G. Bridges, "Alleviating Scalability Issues of Checkpointing Protocols," Conference Paper, International Conference for High-Performance Computing, Networking, Storage, and Analaysis (SC '12), November 2012.
  • Rodrigues, Arun F, Keren Bergman, David P Bunde, Elliot Cooper-Balis, Kurt B Ferreira, K Scott Hemmert, Brian W Barrett, Cassandra Versaggi, Robert Hendry, Bruce Jacob, Hyesoon Kim, Vitus J Leung, Michael J Levenhagen, Mitchelle Rasquinha, Rolf Riesen, Paul Rosenfeld, Maria del Carmen Ruiz Varela, Sudhakar Yalamanchili, "Improvements to the Structural Simulation Toolkit," Conference Paper, SIMUTools, March 2012.
  • Stearley, Jon R, Kurt Ferreira, David R Robinson, Jim Laros, Kevin Pedretti, Dorian Arnold, Patrick Bridges, Rolf Riesen, "Does Partial Replication Pay Off?," Workshop Paper, Workshop on Fault Tolerance at Extreme Scale (FTXS12), June 2012.
2011
  • Bridges, Patrick G., Mark Hoemmen, Kurt B. Ferreira, Michel A. Heroux, Philip Soltero, Ron Brightwell, "Cooperative Application/OS DRAM Fault Recovery," Workshop Paper, 4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011), September 2011.
  • Ferreira, Kurt B., Rolf Riesen, Patrick G. Bridges, Dorian C. Arnold, James H. Laros, III, Ron A. Oldfield, Kevin T. Pedretti, Ron Brightwell, Jon R Stearley, "Evaluating the Viability of Process Replication Reliability for Exascale Systems," Conference Paper, International Conference for High Performance Computing, Networking, Storage and Analysi (SC'11), November 2011.
  • Ferreira, Kurt B., Patrick G. Bridges, Ron Brightwell, Kevin T. Pedretti, "Impact of System Design Parameters on Application Noise Sensitivity," Journal Article, Journal of Cluster Computing, Accepted/Published September 2011.
  • Ferreira, Kurt B., "Keeping Checkpoint/Restart Viable for Exascale Systems," Thesis/Dissertation, Universoty of New Mexico, July 2011.
  • Ferreira, Kurt B., Rolf Riesen, Ron Brightwell, Patrick G. Bridges, Dorian C. Arnold, "Libhashckpt: Hash-based Incremental Checkpointing Using GPUs," Conference Paper, EuroMPI 2011 , September 2011.
  • Fiala, David, Kurt B. Ferreira, Frank Mueller, Christian Engelmann, "A Tunable, Software-based DRAM Error Detection and Correction Library for HP," Workshop Paper, 4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011), January 2011.
  • Ibtesham, Dewan, Dorian C. Arnold, Kurt B. Ferreira, Patrick G. Bridges, "On the Viability of Checkpoint Compression for Extreme-Scale Fault Tolerance," Workshop Paper, 4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011), September 2011.
  • Leon, Edgar A., Rolf Riesen, Kurt B. Ferreira, Arthur B. Maccabe, "Cache Injection for Parallel Applications," Conference Paper, The 20th International ACM Symposium on High-Performance Parallel and Distributed Computing, June 2011.
  • Riesen, Rolf, Kurt B. Ferreira, Maria Ruiz Varela, Michela Taufer, Arun Rodrigues, "Simulating Application Resilience at Exascale," Workshop Paper, 4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011), September 2011.
2010
  • Brightwell, Ron, Kurt B. Ferreira, Rolf Riesen, "Transparent Redundant Computing with MPI," Conference Paper, EuroMPI 2010, September 2010.
  • Ferreira, Kurt, Patrick Bridges, Ron Brightwell, Kevin Pedretti, "The Impact of System Design Parameters on Application Noise Sensitivity," Conference Paper, IEEE International Conference on Cluster Computing, September 2010.
  • Ferreira, Kurt B., Patrick G. Bridges, Ron Brightwell, Kevin T. Pedretti, "The Inpact of System Design Parameters on Application Noise Sensitivity," Conference Paper, IEEE International Conference on Cluster Computing, September 2010.
  • Oldfield, Ron A, Ron Brightwell, Kevin Pedretti, Kurt Ferreira, Rolf Riesen, James Laros, Suzanne Kelly, Todd Kordenbrock, "System Software Research for Extreme-Scale Computing," Presentation, LCF Seminar, March 2010.
  • Riesen, Rolf, Kurt Ferreira, Jon Stearley, "See Applications Run and Throughput Jump: The Case for Redundant Computing in HPC," Workshop Paper, 1st International Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2010) in conjunction with The 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2010), June 2010.
  • Stearley, Jon R, Kurt Ferreira, Rolf Riesen, David Robinson, "Reliability Modeling of Redundancy for HPC Systems," Conference Paper, Supercomputing 2010 (submitted), November 2010.
2009
2008
  • Brightwell, Ron, Kevin Pedretti, Kurt Ferreira, "Instrumentation and Analysis of MPI Queue Times on the SeaStar High-Performance Network ," Workshop Paper, Workshop on Advanced Networking and Communication, August 2008.
  • Ferreira, Kurt, Kevin T. Pedretti, Michael Levenhagen, Ron Brightwell, "Exploring Memory Management Strategies in Catamount," Conference Paper, Cray User Group (CUG), May 2008.
  • Ferreira, Kurt B, Ron Brightwell, Patrick Bridges, "Characterizing Applications Sensitivity to OS Interference Using Kernel-Level Noise Injection," Conference Paper, International Conference of High Performance Computing, Networking, Storage, and Analysis (SC'08), November 2008.
  • Ferreira, Kurt B, Kevin Pedretti, Ron Brightwell, "Virtual Memory Mapping and its Impact on Performance: A Case Study," Presentation, 9th LCI International Conference on High-Performance Clustered Computing, April 2008.

Awards & Recognition

2010
  • Pedretti, Kevin, Ron Brightwell, Kurt Ferreira, Suzanne Kelly, Michael Levenhagen, Courtenay Vaughan, Employee Recognition Award, Kitten Operating System Virtualization Team, Sandia National Laboratories, March 23, 2010.
  • Pedretti, Kevin, Ron Brightwell, Kurt Ferreira, Suzanne Kelly, Michael Levenhagen, Courtenay Vaughan, Employee Recognition Award, Kitten Virtualization Team, Sandia National Laboratories, "For demonstrating the viability of virtual machine technology for HPC applications at a scale two orders of magnitude beyond any previous study", July 10, 2010.