Lissa Baseman presented in the Industry Track III: Dependability
Data and Security the paper Automating DRAM Fault Mitigation By Learning From Experience (slides). USRC intern, Olena Tkachenko, provided much of the analysis for this work and the paper is in collaboration with AMD and Sandia National Laboratories.
Dr. Tan presented at the RADIANCE (International Workshop on Recent Advances in the DependabIlity AssessmeNt of Complex systEms). His presentation was entitled RSVP: Soft Error Resilient Power Savings at Near-ThresholdVoltage using Register Vulnerability (slides) and was co-authored by other USRC members.
This week is The 26th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC). Opening the conference this week was the 7th Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop. FTXS is a workshop co-created by USRC’s Dr. Nathan DeBardeleben and has been run by Nathan ever since.
In the International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 2017) on Tuesday, the paper UNITY: Unified Memory and File Space will be presented. This work includes contributions by USRC’s Mike Lang, Latchesar Ionkov, and Doug Otstott.
USRC’s Dr. Qiang Guan and Dr. Nathan DeBardeleben have a paper in the main conference (19% acceptance rate) primarily authored by USRC alumnus Bo Fang, entitled LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures (slides).
The poster session included work by USRC alumnus Song Huang and work in progress by current USRC PhD student, Zongze Li.
Summer 2017 is here and USRC has a great group of new and returning interns.
Graduate Student, Carnegie Mellon University
Qing is a 4th-year Ph.D. student at Carnegie Mellon University Computer Science Department. At Carnegie Mellon, Qing works with Professor Garth Gibson, researchers at the Carnegie Mellon Parallel Data Lab, and scientists at Los Alamos National Lab (LANL), on file system metadata designs (IndexFS and DeltaFS) for massive-scale science applications. Their IndexFS paper has won Best Paper Award at the Supercomputing Conference (SC) 2014. At NMC Ultra System Research Center (USRC), Qing works with Brad Settlemyer and other USRC and LANL scientists on VPIC and DeltaFS integration, and high-performance metadata implementation and demonstration.
Kai currently a Computer Science PhD student in EECS at University of California, Merced. Before coming to UC Merced, he got his Masters degree in Computer Science and Engineering from Michigan State University in 2016. His research broadly falls into general areas of High Performance Computing (Large-Scale Parallel Systems). Specifically, he focuses on the following areas:(i) Parallel programming models and runtime; (ii) Performance optimization and modeling; (iii) Resilience and Consistency; (iv)Non-volatile memory; (v) Fault Tolerance in Extreme-Scale Parallel Systems. At USRC, Kai is working on building faults model on serial codes and predicting the faults on parallel codes.
Olena graduated with a B.S. in Computer Science from FIU's School of Computing and Information Sciences in Miami. At FIU she did research at the VISA lab as a URA working on masquerading network traffic for Mission Critical Cloud Computing, and isolation benchmarking of containers. While working at LANL as a PostBac she designed an application model (IMCSim) of the implicit Monte Carlo particle code IMC using the Performance Prediction Toolkit (PPT), a discrete-event simulation-based modeling framework for predicting code performance on a large range of parallel platforms. At USRC she is currently working on predicting DRAM fault locations in HPC systems using structured learning and various ML techniques. Her research interests include HPC, ML, and fault prediction/mitigation.
Michael evaluates and designs distributed file system metadata management systems. His publications prototype ideas on CephFS, the file system that uses the Ceph distributed object store. His lab also has a special interest in storage system programmability and reproducibility in systems research. At USRC, Michael is working with Brad Settlemyer on load balancing policies for HXHIM, an HPC key-value store.
PhD Student, Computer Science, Illinois Institute of Technology
I'm a second-year Phd student majoring in Computer Science at Illinois Institute of Technology(IIT), advised by Dr. Ioan Raicu. My research interest is HPC.Currently at USRC I'm working on Burst Buffer simulation in Dragonfly network. The goal is to develop a simulator which models supercomputers with dragonfly network and burst buffer storage architecture. With such a simulator, we will be able to carry out more research on problems such as system bottleneck and burst buffer related scheduling.
Scott is a graduate student studying network errors on LANL's Trinity supercomputer. While obtaining his BS in Computer Science with a minor in Applied Mathematics from Coastal Carolina University, Scott worked on various projects for the USRC, ranging from fault injection studies with F-SEFI to analyzing ECC of interest to the team. In the fall, Scott will begin the direct PhD track at The Ohio State University.
PhD Student, Computer Science, North Carolina State University
Abida will be a PhD student at North Carolina State University in computer science. She has a bachelor's degree in mathematics from Carnegie Mellon University and a master's degree in computer science from Georgia Tech.During her time at USRC, Abida will help with the project Latent Anomaly Detection for Supercomputing System Performance.
Alexandra is working on a bachelors degree in Computer Science and Mathematics at Rollins College. At the USRC she is working on creating a model of system logs from high performance computers. The model will later be used in anomaly detection.
Rusty graduated with his B.S. in computer science from the School of Computing at Clemson University in May 2016. He will begin pursuing his masters of Computer Science at Clemson University in Fall 2016. Rusty has been working with the USRC since the summer of 2014. His initial work was with Dr. Nathan DeBardeleben and Dr. William Jones concerning Algorithmic-Based Fault Tolerant Matrix Multiplication. His current work is focused on quantifying the resiliency of Algorithmic-Based Fault Tolerant Fast Fourier Transforms and creating an interface for the F-SEFI fault injector. His research interests include High Performance Computing, Operating Systems, and Resilience/Fault tolerance.
Ashley is currently a PhD student studying Computer Science at New Mexico State University. She has a BS and MS in Electrical Engineering also from NMSU. Her research is on multivariate time series prediction and segmentation. At USRC she is working on on an anomaly detection project focused on detecting anomalies in energy data.
In case you are wondering, USRC has been around since roughly 2010. While we will try and put some of that older content onto this page over time, generally we will focus on USRC from this point forward.