USRC’s resilience research is largely composed of fault injection with the FSEFI/PFSEFI tool, data analytics, and machine learning. The analytics and machine learning work is focused on learning from field data from supercomputers (telemetry, failures, job logs, etc.).
USRC’s resilience research is lead by Dr. Nathan DeBardeleben.
Below are the staff considered to be in the “Resilience” group.
Dr. Nathan DeBardeleben
USRC Director and Senior Research Scientist
Nathan conducts resilience and fault-tolerance research with research scientists, students, and visiting professors. Among interests in hardware reliability, algorithmic design, and a general interest in computing on (perhaps extremely) unreliable hardware, Nathan is project lead of the Fine-Grained Soft Error Fault Injection (F-SEFI) framework. F-SEFI is a tool for exploring how real applications running on real systems tolerate emulated soft errors. The tool injects soft errors with extreme precision at specific points in a running application on real hardware, with real OS kernels, and real middleware (not PIN-based, not LLVM-based, not source code modification). F-SEFI builds on an open source virtual machine and processor emulator to emulate faulty hardware but does so only in ways to affect the application of interest, thereby making it more tractable to study how applications respond to specific types of soft errors. Additionally, Nathan conducts field studies on Department of Energy supercomputers to study memory and processor resilience including correctable faults and uncorrectable errors.
Sean works on kernel level support of systems in research and production at Los Alamos. At USRC he researches Soft-error Resilience and in Scalable System Software.
Dr. Laura Monroe
Laura is a researcher in resilience and novel computing techniques, especially probabilistic computing. Her current interest is the design of algorithms and systems to address expected increasing fault rates in hardware in a probabilistic manner. Another interest is the application of discrete mathematics to the design and understanding of computing systems. She also led the production visualization effort at LANL for many years, and was the originator and project leader of the recent redesign and redeployment of the LANL visualization corridor, encompassing the computing systems, networking, and display systems used for LANL ASC large-scale visualization. She served on the design teams for the Cielo and Trinity supercomputers and was one of the designers of the Viewmaster visualization compute cluster. She has published in the areas of probabilistic computing and algorithms, resilience, error-correcting codes, virtual reality and visualization. She received her Ph.D. In Mathematics and Computer Science in the field of Error-Correcting Codes, working with Dr. Vera Pless.
Lissa is an applied machine learning researcher and data scientist working on the resilience and fault-tolerance team. At USRC, her work spans using statistical relational models for fault characterization and mitigation as well as developing anomaly detection techniques for large-scale monitoring of supercomputing facilities. Before joining USRC, Lissa contributed to quantum algorithms for machine learning at LANL’s Center for Nonlinear Studies. Her background, including work on social network analysis with the Human Language Technology group at MIT Lincoln Laboratory and a short time at a startup back in Massachusetts, is primarily in the development and application of probabilistic graphical models to new relational and/or temporal domains. Lissa received her MS in Computer Science from the University of Massachusetts Amherst and her BA, also in Computer Science, from Amherst College.
Dr. Qiang Guan
Dr. Qiang Guan is a computer scientist at Los Alamos National Laboratory and the Ultra-scale System Research Center (USRC) since Nov 2015. He obtained his Ph.D. degree in Computer Science and Engineering from the University of North Texas, Denton, Texas, in 2014 (Ph.D. advisor: Dr. Song Fu). He received his M.S. degree in Information Engineering from Myongji University, Seoul, South Korea, in 2008 and his B.S. degree in Communication Engineering from Northeastern University, Shenyang, China, in 2005. His research interests include, soft error fault injection, data visualization, virtualization, resilience, cloud performance modeling and optimization, cloud dependability and reliability, power management and green computing, resource management, data mining and machine learning, signal processing and image processing.
Dr. Paolo Rech
Associate Professor, UFRGS
Paolo Rech received his master and Ph.D. degrees from Padova University, Padova, Italy, in 2006 and 2009, respectively. He is currently an associate professor at the Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil.His main research interests include the evaluation and mitigation of radiation induced effects in large-scale HPC centers and safety-critical applications. Paolo lead the group that performed the first radiation experiment on GPUs in 2011. Since then, he has been studying the effects of radiation in parallel HPC devices. Now, he is collaborating with NVIDIA and AMD to evaluate and enhance the reliability of modern architectures. In collaboration with LANL and USRC he is designing experimentally-tuned selective hardening strategies to detect critical SDCs without unnecessary overhead. Lately, Paolo has been working on automotive applications reliability, understanding the error propagation in neural networks and designing novel hardening solutions for embedded safety critical applications.
Dr. William M Jones
Associate Professor and Chair, Coastal Carolina University
Will is an associate professor and chair of the Department of Computing Sciences at Coastal Carolina University (CCU). He attended Clemson University where he obtained a BS ('99), MS ('00) and PhD ('05), each in Computer Engineering. Before accepting a position at CCU, Will was an assistant professor in the Department of Electrical and Computer Engineering at the United States Naval Academy, as well as an adjunct professor at Clemson University in the ECE department and at Tri-County Technical College in the Department of Mathematics. His research interests include parallel computing, parallel file systems, computational grids, job scheduling, resilience, fault injection, performance evaluation and modeling, and discrete event simulation. In addition to traditional computer science courses, he also enjoys teaching computer architecture, digital logic design, FPGA programming and AC/DC circuit analysis. Will has been investigating the behavior of ABFT algorithms in the presences of hardware and memory faults through the use of F-SEFI, a soft error fault injector. This work has been in collaboration with Claude Davis, a Clemson University master's student, CCU student Scott Lavigne, a CS undergraduate, along with several members of the HPC-5 group, including Nathan DeBardeleben, Laura Monroe, Sean Blanchard, and Qiang Guan.
Dr. Song Fu
Assistant Professor in Computer Science and Engineering, University of North Texas
Song Fu is an Assistant Professor in the Department of Computer Science and Engineering at the University of North Texas. His research focuses on reliability and energy efficiency of parallel and distributed systems. Song works with Nathan DeBardeleben, Mike Lang, and the USRC Systems Group on resilience, fault tolerance, and power management of ultra-scale computers. The goal is to reduce the vulnerability of HPC applications and systems to soft errors and failures and to improve power utilization to maximize machine room throughput.
Dr. Satyajayant Misra
Assist Professor, Computer Science Dept, New Mexico State University
Dr. Satyajayant Misra’s research interests are anonymity, security, and survivability in wireless sensor networks, wireless ad hoc networks, and vehicular networks. He is also interested in the design of algorithms for energy harvesting wireless sensor networks and to support real time and multimedia communication in wireless networks. Dr. Misra works with USRC on resilience, fault tolerance, and load balancing in ultra-scale supercomputing architectures. He concentrated on optimization of service placement in supercomputing networks under various operating constraints. His primary goal was to improve system utilization, resilience, fault-tolerance and reduce system bottlenecks.
Dr. Dorian Arnold
Professor, UNM Dept of Computer Science, University of New Mexico
Dorian is an assistant professor in the Department of Computer Science at the University of New Mexico. His research focuses on the performance and reliability of extremely large scale systems with tens of thousands, hundreds of thousands or even millions of processing elements.
Dorian is working with Mike Lang, Hugh Greenberg and the USRC Systems Group on the Redfish Project. This group investigates the basic, general computation, communication and storage primitives that underlie HPC system services and provide a library of building blocks that provides a flexible, resilient and scalable implementation of these primitives.
PhD Student, University of California, Merced
Kai currently a Computer Science PhD student in EECS at University of California, Merced. Before coming to UC Merced, he got his Masters degree in Computer Science and Engineering from Michigan State University in 2016. His research broadly falls into general areas of High Performance Computing (Large-Scale Parallel Systems). Specifically, he focuses on the following areas:(i) Parallel programming models and runtime; (ii) Performance optimization and modeling; (iii) Resilience and Consistency; (iv)Non-volatile memory; (v) Fault Tolerance in Extreme-Scale Parallel Systems. At USRC, Kai is working on building faults model on serial codes and predicting the faults on parallel codes.
Dr. Li Tan
Postdoctoral Researcher, Los Alamos National Laboratory
Li Tan graduated with a Ph.D. degree in Computer Science from University of California, Riverside (UCR) in 2015. His chief research interest is High Performance Computing (HPC), in particular improving resilience/reliability and energy/power efficiency for high performance scientific algorithms and applications, and software debugging in large-scale HPC environments. At USRC, he works in fine-grained resilience and low-power modeling and provisioning for HPC applications, using fault injection and near-threshold voltage reduction techniques. He served as a reviewer for prestigious conferences and journals on high performance parallel and distributed computing, such as SC, IPDPS, PACT, CCGrid, IEEE TPDS, IJHPCA, and JSS. He is a recipient of Dean's Distinguished Fellowship from UCR in 2010. He is a Member of the IEEE and a Member of the ACM.
Graduate Student, Ohio State University
Scott is a graduate student studying network errors on LANL's Trinity supercomputer. While obtaining his BS in Computer Science with a minor in Applied Mathematics from Coastal Carolina University, Scott worked on various projects for the USRC, ranging from fault injection studies with F-SEFI to analyzing ECC of interest to the team. In the fall, Scott will begin the direct PhD track at The Ohio State University.
Rusty H Davis
Graduate Student, Clemson University
Rusty graduated with his B.S. in computer science from the School of Computing at Clemson University in May 2016. He will begin pursuing his masters of Computer Science at Clemson University in Fall 2016. Rusty has been working with the USRC since the summer of 2014. His initial work was with Dr. Nathan DeBardeleben and Dr. William Jones concerning Algorithmic-Based Fault Tolerant Matrix Multiplication. His current work is focused on quantifying the resiliency of Algorithmic-Based Fault Tolerant Fast Fourier Transforms and creating an interface for the F-SEFI fault injector. His research interests include High Performance Computing, Operating Systems, and Resilience/Fault tolerance.
Post Bachelor, Los Alamos National Laboratory
Heather graduated from the University of Georgia with a B.S. in Computer Science. At USRC, she will be working on looking at faults that occur in computer memory.