All times are in Pacific Daylight Time (PDT).
7:00 - 7:10 Welcome from the organizers
7:10 - 7:50 Dynamic and Intelligent workfows with eFlows4HPC
Invited Talk by Prof. Rosa Badia, Barcelona Supercomputing Center, Spain
Distributed computing infrastructures are evolving from traditional models to environments that involve sensors, edge devices, instruments, etc, and, as well, high-end computing systems such as clouds and HPC clusters. A key aspect is how to describe and develop the applications to be executed in such platforms.
What is more, the use of data analytics and artificial intelligence in general is on high demand in current HPC applications. However, the methodologies to develop such workflows that integrate HPC simulations and data analytics is not well integrated. eFlows4HPC project has recently started with the goal of providing workflow software stack and an additional set of services to enable the integration of HPC simulations and modelling with big data analytics and machine learning in scientific and industrial applications. The project will demonstrate its advances through three application Pillars with high industrial and social relevance: manufacturing, climate and urgent computing for natural hazards; these applications will help to prove how the realization of forthcoming efficient HPC and data-centric applications can be developed with new workflow technologies. The talk will present the motivation, challenges and project workplan.
Rosa M. Badia holds a PhD on Computer Science (1994) from the Technical University of Catalonia (UPC). She is the manager of the Workflows and Distributed Computing research group at the Barcelona Supercomputing Center (BSC). She is considered one of the key researchers in Parallel programming models for multicore and distributed computing due to her contribution to task-based programming models during the last 15 years. The research group focuses on PyCOMPSs/COMPSs, an parallel task-based programming distributed computing, and its application to the development of large heterogeneous workflows that combine HPC, Big Data and Machine Learning. The group is also doing research around the dislib, a parallel machine learningn libray parallelized with PyCOMPSs. Dr Badia has published near 200 papers in international conferences and journals in the topics of her research. She has been very active in projects funded by the European Commission in contracts with industry. She has been actively contributing to the BDEC international initiative and is a member of HiPEAC Network of Excellence. She received the Euro-Par Achievement Award 2019 for her contributions to parallel processing and the DonaTIC award, category Academia/Researcher in 2019. Rosa Badia is the IP of eFlows4HPC.
7:50 - 8:15 A Distributed Multi-GPU System for Large-Scale Node Embedding at Tencent
Wanjing Wei, Yangzihao Wang, Pin Gao, Shijie Sun, Donghai Yu, Tencent Ltd., China
Abstract: Real-world node embedding applications often contain hundreds of billions of edges with high-dimension node features. Scaling node embedding systems to efficiently support these applications remains a challenging problem. In this paper we present a high-performance multi-GPU node embedding system. It uses model parallelism to split node embeddings onto each GPU's local parameter server, and data parallelism to train these embeddings on different edge samples in parallel. We propose a hierarchical data partitioning strategy and an embedding training pipeline to optimize both communication and memory usage on a GPU cluster. With the decoupled design of CPU tasks (random walk) and GPU tasks (embedding training), our system is highly flexible and can fully utilize all computing resources on a GPU cluster. Comparing with the current state-of-the-art multi-GPU single-node embedding system, our system achieves 5.9x-14.4x speedup on average with competitive or better accuracy on open datasets. Using 40 NVIDIA V100 GPUs on a network with almost three hundred billion edges and more than one billion nodes, our implementation requires only 3 minutes to finish one training epoch.
8:15 - 8:55 The Three Pillars of Large-scale Deep Learning
Invited Talk by Prof. Torsten Hoefler, ETH Zurich, Switzerland
Abstract: High-performance deep learning on supercomputers drives the push to train larger and larger models to achieve surprising results. The recent trend to scale natural language models to more parameters continues to provide better results. Gigantic language models such as GPT-3 require large-scale machines to be trained, however, even smaller models such as ResNet-50 can benefit from HPC-style training infrastructures. In this talk, we discuss the three pillars for such infrastructures: high-performance I/O, compute, and communication. All three are necessary to train large-scale models in practice and we discuss the newest developments and ideas in those three aspects.
Bio: Torsten Hoefler directs the Scalable Parallel Computing Laboratory (SPCL) at D-INFK ETH Zurich. He received his PhD degree in 2007 at Indiana University and started his first professor appointment in 2011 at the University of Illinois at Urbana-Champaign.
8:55 - 9:20 Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences
Quentin Anthony, Lang Xu, Hari Subramoni, Dhabaleswar Panda, Ohio State University, USA
Abstract: Deep Learning (DL) models for super-resolution (DLSR) are an emerging trend in response to the growth of ML/DL applications requiring high-resolution images. DLSR methods have also shown promise in domains such as medical imaging, surveillance, and microscopy. However, DLSR models are extremely computationally demanding, and require unreasonably long training times on modern Volta GPUs. In our experiments, we observed only 10.3 images/second on a single Volta GPU for training EDSR, a state-of-the-art DLSR model for single-image super-resolution. In comparison, a Volta GPU can process 360 images/second while training ResNet-50, a stateof-the-art model for image classification. Therefore, we believe supercomputers provide a good candidate to speed up DLSR model training. In this paper, we select EDSR as the representative DLSR PyTorch model. Further, we introduce Horovodbased distributed EDSR training. However, we observed poor default EDSR scaling performance on the Lassen HPC system at Lawrence Livermore National Laboratory. To investigate the performance degradations, we perform exhaustive communication profiling. These profiling insights are then used to optimize CUDA-Aware MPI for DLSR models by ensuring advanced MPI designs involving CUDA IPC and registration caching are properly applied by DL frameworks. We present a comprehensive scaling study of EDSR with MVAPICH2-GDR and NCCL up to 512 GPUs on Lassen. We demonstrate an improvement in scaling efficiency by 15.6% over default Horovod training, which translates to a 1.26× speedup in training performance
9:20 - 9:40 Break
9:40 - 10:20 AI for Social Impact: Results from Multi-agent Reasoning and Learning in the Real World
by Prof. Milind Tambe, Harvard University, USA and Google Research, India
10:20 - 10:45 Distributed Deep Learning Using Volunteer Computing-Like Paradigm
Medha Atre, Birendra Jha, Ashwini Rao, Eydle Inc., USA
Abstract: Use of Deep Learning (DL) in commercial applications such as image classification, sentiment analysis and speech recognition is increasing. When training DL models with large number of parameters and/or large datasets, cost and speed of training can become prohibitive. Distributed DL training solutions that split a training job into subtasks and execute them over multiple nodes can decrease training time. However, the cost of current solutions, built predominantly for cluster computing systems, can still be an issue. In contrast to cluster computing systems, Volunteer Computing (VC) systems can lower the cost of computing, but applications running on VC systems have to handle fault tolerance, variable network latency and heterogeneity of compute nodes, and the current solutions are not designed to do so. We design a distributed solution that can run DL training on a VC system by using a data parallel approach. We implement a novel asynchronous SGD scheme called VC-ASGD suited for VC systems. In contrast to traditional VC systems that lower cost by using untrustworthy volunteer devices, we lower cost by leveraging preemptible computing instances on commercial cloud platforms. By using preemptible instances that require applications to be fault tolerant, we lower cost by 70-90\% and improve data security.
10:45 - 11:25 Riding the Composable Systems Wave to Improve DNN Distributed Training Performance
by Prof. Christopher Carothers, RPI, USA
Abstract: Composable systems provide hardware and software mechanisms to decouple physical compute and storage resources from their physical locations. When provisioned, the allocated set of resources more precisely meet application and workload demands yielding a much more efficient utilization of data center resources. The Compute Express Link (CXL) protocol which rides on top of the current PICe bus protocol is an emerging industry standard that could be used to create high-performance composable systems. For distributed training of large-scale neural networks as well as other HPC workloads, this protocol offers the potential of significantly improving network bandwidth and overall end-to-end packet latency compared to other HPC networks technologies like Infiniband. In this talk, we examine how a modern deep learning training algorithm performs on a future CXL enabled composable system for the ResNet50 benchmark across a range of architecture configurations.
Bio: Professor Chris Carothers is a faculty member in the Computer Science Department at Rensselaer Polytechnic Institute. He received the Ph.D., M.S., and B.S. from Georgia Institute of Technology in 1997, 1996, and 1991, respectively. Prior to joining RPI in 1998, he was a research scientist at the Georgia Institute of Technology. His research interests are focused on massively parallel computing which involve the creation of high-fidelity models of extreme-scale networks and computer systems. These models have executed using nearly 2,000,000 processing cores on the largest leadership class supercomputers in the world. Additionally, Professor Carothers serves as the Director for the Rensselaer Center for Computational Innovations (CCI). CCI is a partnership between Rensselaer and IBM. The center provides computational and storage resources to a diverse network of researchers, faculty, and students from Renssleaer, government laboratories, and companies across a number of science and engineering disciplines. The flagship supercomputer is an 8 peta-flop (PF) IBM AC922 hybrid CPU/GPU supercomputer named "AiMOS" which serves as the primary testbed for the IBM AI Hardware Center.
11:25 - 11:40 Ex-NNQMD: Extreme-Scale Neural Network Quantum Molecular Dynamics
Pankaj Rajak (Argonne National Laboratory, USA), Thomas Linker, Ken-ichi Nomura, Anikeya Aditya, Kuang Liu (University of Southern California, USA), Kohei Shimamura, Shogo Fukushima (Kumamoto University, Japan), Ye Luo (Argonne National Laboratory, USA), Fuyuki Shimojo (Kumamoto University, Japan), Rajiv K. Kalia, Aiichiro Nakano and Priya Vashishta (University of Southern California, USA).
Abstract: Deep learning is revolutionizing countless scientific and engineering fields. In particular, SC20 Gordon Bell award represented a breakthrough in molecular simulation, i.e., 100-million-atom simulation with quantum-mechanical accuracy on the Summit supercomputer at ORNL, using deep potential molecular dynamics (MD). Moving forward, while these simulations were performed only in gentle equilibrium conditions, far-from-equilibirum MD simulation involving light-induced electronic excited states finds numerous scientific and engineering applications. However, it remains a challenge to perform such far-from-equilibrium simulations at larger sptiotemporal scales, where growing number of unphysical predictions of interatomic force prohibits simulations involving larger numbers of atoms for longer times. In this paper, we propose a physically-based inductive bias, maximally-preserved Maxwell-Boltzmann (MPMB), to overcome this fidelity-scaling problem. Along with hybrid divide-and-conquer parallelization and single-node level optimization using multithreading and data parallel SIMD, the resulting Ex-NNQMD (extreme-scale neural network quantum molecular dynamics) algorithm has achieved unprecedented scales of far-from-equilibrium simulations: 1) 5.1-billion atom system with a parallel efficiency of 0.94, and 2) a sustained performance of 6.4 nanoseconds/day for 10-million atom system both on 262,144 cores of the Theta supercomputer at Argonne Leadership Computing Facility. Extended fidelity scaling and efficient parallelization have allowed us for the first time to study light-induced ferroelectric switching under extreme electronic excitation at experimentally relevant spatiotemporal scales with accuracy.
11:40 - 12:00 Break
12:00 - 12:40 Innovating across the AI stack for scale and speed
by Dr. Rania Khalaf, IBM Research AI, USA
12:40 - 13:20 Ray as the Unified Compute Substrate for Machine Learning Applications
Invited Talk by Dr. Zhe Zhang, Anyscale Inc. USA
Abstract: Today's applications more and more involve Machine Learning, and in an increasingly integrated way. But to use ML successfully is difficult. Each successful model requires complex data engineering, expensive training (both hardware and human resources), and flexible inference with low latency. How to develop ML infrastructure software to support these compute workloads is a huge challenge -- especially with the fast evolvement of ML theory and algorithms, and the explosion of data.
Since its inception a few years ago, Ray has been trying to address the problem by finding the right layer of abstraction, and building a solid architecture to support it. In this talk, I will introduce Ray's basic abstractions -- distributed Class / Method / Object, followed by Ray's main architectural components. Then I'll discuss how ML infrastructure teams have been using Ray to achieve groundbreaking improvements to productivity, scalability, and performance.
Bio: Zhe currently leads open source engineering at Anyscale. Before that, he was at LinkedIn managing the Big Data / AI Compute team (providing Hadoop/Spark/TensorFlow as services). Ever since 2014, Zhe's work has been closely related to open source. He's a committer and PMC member of Apache Hadoop, and a Member of Apache Software Foundation.
13:20 - 13:35 Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour
Arissa Wongpanich (UC Berkeley, USA), Hieu Pham (Google Research, USA), James Demmel (UC Berkeley, USA), Mingxing Tan, Quoc Le, Yang You and Sameer Kumar (Google Research, USA)
Abstract: EfficientNets are a family of state-of-the-art image classification models based on efficiently scaled convolutional neural networks. Currently, EfficientNets can take on the order of days to train; for example, training an EfficientNet-B0 model takes 23 hours on a Cloud TPU v2-8 node. In this paper, we explore techniques to scale up the training of EfficientNets on TPU-v3 Pods with 2048 cores, motivated by speedups that can be achieved when training at such scales. We discuss optimizations required to scale training to a batch size of 65536 on 1024 TPU-v3 cores, such as selecting large batch optimizers and learning rate schedules as well as utilizing distributed evaluation and batch normalization techniques. Additionally, we present timing and performance benchmarks for EfficientNet models trained on the ImageNet dataset in order to analyze the behavior of EfficientNets at scale. With our optimizations, we are able to train EfficientNet on ImageNet to an accuracy of 83% in 1 hour and 4 minutes.
13:35 - 13:50 Performance Analysis of Deep Learning Workloads on a Composable System
Kaoutar El Maghraoui, Lorraine Herger, Chekuri Choudary, Kim Tran, Todd Deshane, David Hanson, IBM, USA
Abstract: A composable infrastructure is defined as resources, such as compute, storage, accelerators and networking, that are shared in a pool and that can be grouped in various configurations to meet application requirements. This freedom to 'mix and match' resources dynamically allows for experimentation early in the design cycle, prior to the final architectural design or hardware implementation of a system. We describe the design of an enterprise composable infrastructure that we have implemented and evaluate the impact of resource dis-aggregation on representative deep learning benchmarks.