USRC’s LANL collaborators drive the research and development in a number of computer science areas. Below is the current list of LANL collaborators.
Dr. Nathan DeBardeleben
USRC Director and Senior Research Scientist
Nathan conducts resilience and fault-tolerance research with research scientists, students, and visiting professors. Among interests in hardware reliability, algorithmic design, and a general interest in computing on (perhaps extremely) unreliable hardware, Nathan is project lead of the Fine-Grained Soft Error Fault Injection (F-SEFI) framework. F-SEFI is a tool for exploring how real applications running on real systems tolerate emulated soft errors. The tool injects soft errors with extreme precision at specific points in a running application on real hardware, with real OS kernels, and real middleware (not PIN-based, not LLVM-based, not source code modification). F-SEFI builds on an open source virtual machine and processor emulator to emulate faulty hardware but does so only in ways to affect the application of interest, thereby making it more tractable to study how applications respond to specific types of soft errors. Additionally, Nathan conducts field studies on Department of Energy supercomputers to study memory and processor resilience including correctable faults and uncorrectable errors.
Michael has been working with UNIX systems for over twenty years, joining LANL from 1999 to 2010 as a member of the Performance and Architecture Lab (PAL) http://www.c3.lanl.gov/pal/
focusing on performance of large-scale systems. Currently he is the team leader for Ultrascale Systems Research focusing on resilient scalable systems software for large-scale systems. He received his MS in Electrical Engineering from University of New Mexico, and BS in Computer Engineering from UNM.
Dr. Howard Pritchard
Howard Pritchard is researcher in HPC network software. He is actively involved in the Open MPI project and Open Fabrics Interfaces Working Group. He is also involved in the OpenSHMEM community, and leads a project to combine this programming model with the Habanero asynchronous task-based runtime. Before joining USRC and LANL, Howard was a Principal Engineer at Cray Inc. where he worked on the design and implementation of various components of the Cray XE and XC network software stack.
Dave works in the intersection of application and architectures. HPC Software Environments encompass what is needed by users, developers and system individuals. Workflow characterization and quantification is being used to map the need with performance metrics captured to map the direction needed for that community as well as vendor architecture efforts. Dave is also involved in cross-lab programming environment open-source projects, monitoring efforts, and university projects.
Sean works on kernel level support of systems in research and production at Los Alamos. At USRC he researches Soft-error Resilience and in Scalable System Software.
Dr. Bradley Settlemyer
Brad Settlemyer is a storage systems researcher and systems programmer specializing in high performance computing. He received his Ph.D in computer engineering from Clemson University in 2009 and works as a research scientist in Los Alamos National Laboratory's HPC Design group. He has published papers on emerging storage systems, long distance data movement, network modeling, and storage system algorithms.
Dr. Laura Monroe
Laura is a researcher in resilience and novel computing techniques, especially probabilistic computing. Her current interest is the design of algorithms and systems to address expected increasing fault rates in hardware in a probabilistic manner. Another interest is the application of discrete mathematics to the design and understanding of computing systems. She also led the production visualization effort at LANL for many years, and was the originator and project leader of the recent redesign and redeployment of the LANL visualization corridor, encompassing the computing systems, networking, and display systems used for LANL ASC large-scale visualization. She served on the design teams for the Cielo and Trinity supercomputers and was one of the designers of the Viewmaster visualization compute cluster. She has published in the areas of probabilistic computing and algorithms, resilience, error-correcting codes, virtual reality and visualization. She received her Ph.D. In Mathematics and Computer Science in the field of Error-Correcting Codes, working with Dr. Vera Pless.
Hugh participated in the design and implementation of the Linux Noise Detective. The Linux Noise detective is a Linux kernel module and a GUI to collect process data directly from the kernel (on multiple cluster nodes simultaneously) and analyze the data to determine the sources of system noise. He also participated in the design and the development of the XGet file transfer software. XGet scalably transfers files to nodes within a cluster by building a tree of participants and delegating serving duties to optimal slave nodes. He participated in the development of the XCPU cluster management system. XCPU keeps the state of the cluster distributed across all nodes, allowing easy configuration of hot-spare management nodes and graceful failover that doesn't require canceling the running jobs in case of head node failure.
Lissa is an applied machine learning researcher and data scientist working on the resilience and fault-tolerance team. At USRC, her work spans using statistical relational models for fault characterization and mitigation as well as developing anomaly detection techniques for large-scale monitoring of supercomputing facilities. Before joining USRC, Lissa contributed to quantum algorithms for machine learning at LANL’s Center for Nonlinear Studies. Her background, including work on social network analysis with the Human Language Technology group at MIT Lincoln Laboratory and a short time at a startup back in Massachusetts, is primarily in the development and application of probabilistic graphical models to new relational and/or temporal domains. Lissa received her MS in Computer Science from the University of Massachusetts Amherst and her BA, also in Computer Science, from Amherst College.
Lucho co-developed the v9fs filesystem, which is now a standard part of the Linux kernel distribution. His previous work includes CellFS programming model and XCPU and XCPU2 process-management systems which addressed issues of large-scale system complexity, resiliency, and manageability.At USRC, Lucho works on scalable system software and accelerated access to application data.
Dr. Qiang Guan
Dr. Qiang Guan is a computer scientist at Los Alamos National Laboratory and the Ultra-scale System Research Center (USRC) since Nov 2015. He obtained his Ph.D. degree in Computer Science and Engineering from the University of North Texas, Denton, Texas, in 2014 (Ph.D. advisor: Dr. Song Fu). He received his M.S. degree in Information Engineering from Myongji University, Seoul, South Korea, in 2008 and his B.S. degree in Communication Engineering from Northeastern University, Shenyang, China, in 2005. His research interests include, soft error fault injection, data visualization, virtualization, resilience, cloud performance modeling and optimization, cloud dependability and reliability, power management and green computing, resource management, data mining and machine learning, signal processing and image processing.