All times are in Pacific Daylight Time (PDT).

7:00 - 7:10 Welcome from the organizers

7:10 - 7:50 Dynamic and Intelligent workfows with eFlows4HPC

Invited Talk by Prof. Rosa Badia, Barcelona Supercomputing Center, Spain


Distributed computing infrastructures are evolving from traditional models to environments that involve sensors, edge devices, instruments, etc, and, as well, high-end computing systems such as clouds and HPC clusters. A key aspect is how to describe and develop the applications to be executed in such platforms.

What is more, the use of data analytics and artificial intelligence in general is on high demand in current HPC applications. However, the methodologies to develop such workflows that integrate HPC simulations and data analytics is not well integrated. eFlows4HPC project has recently started with the goal of providing workflow software stack and an additional set of services to enable the integration of HPC simulations and modelling with big data analytics and machine learning in scientific and industrial applications. The project will demonstrate its advances through three application Pillars with high industrial and social relevance: manufacturing, climate and urgent computing for natural hazards; these applications will help to prove how the realization of forthcoming efficient HPC and data-centric applications can be developed with new workflow technologies. The talk will present the motivation, challenges and project workplan.


Rosa M. Badia holds a PhD on Computer Science (1994) from the Technical University of Catalonia (UPC). She is the manager of the Workflows and Distributed Computing research group at the Barcelona Supercomputing Center (BSC). She is considered one of the key researchers in Parallel programming models for multicore and distributed computing due to her contribution to task-based programming models during the last 15 years. The research group focuses on PyCOMPSs/COMPSs, an parallel task-based programming distributed computing, and its application to the development of large heterogeneous workflows that combine HPC, Big Data and Machine Learning. The group is also doing research around the dislib, a parallel machine learningn libray parallelized with PyCOMPSs. Dr Badia has published near 200 papers in international conferences and journals in the topics of her research. She has been very active in projects funded by the European Commission in contracts with industry. She has been actively contributing to the BDEC international initiative and is a member of HiPEAC Network of Excellence. She received the Euro-Par Achievement Award 2019 for her contributions to parallel processing and the DonaTIC award, category Academia/Researcher in 2019. Rosa Badia is the IP of eFlows4HPC.

7:50 - 8:15 A Distributed Multi-GPU System for Large-Scale Node Embedding at Tencent

Wanjing Wei, Yangzihao Wang, Pin Gao, Shijie Sun, Donghai Yu, Tencent Ltd., China

Abstract: Real-world node embedding applications often contain hundreds of billions of edges with high-dimension node features. Scaling node embedding systems to efficiently support these applications remains a challenging problem. In this paper we present a high-performance multi-GPU node embedding system. It uses model parallelism to split node embeddings onto each GPU's local parameter server, and data parallelism to train these embeddings on different edge samples in parallel. We propose a hierarchical data partitioning strategy and an embedding training pipeline to optimize both communication and memory usage on a GPU cluster. With the decoupled design of CPU tasks (random walk) and GPU tasks (embedding training), our system is highly flexible and can fully utilize all computing resources on a GPU cluster. Comparing with the current state-of-the-art multi-GPU single-node embedding system, our system achieves 5.9x-14.4x speedup on average with competitive or better accuracy on open datasets. Using 40 NVIDIA V100 GPUs on a network with almost three hundred billion edges and more than one billion nodes, our implementation requires only 3 minutes to finish one training epoch.

8:15 - 8:55 The Three Pillars of Large-scale Deep Learning

Invited Talk by Prof. Torsten Hoefler, ETH Zurich, Switzerland

Abstract: High-performance deep learning on supercomputers drives the push to train larger and larger models to achieve surprising results. The recent trend to scale natural language models to more parameters continues to provide better results. Gigantic language models such as GPT-3 require large-scale machines to be trained, however, even smaller models such as ResNet-50 can benefit from HPC-style training infrastructures. In this talk, we discuss the three pillars for such infrastructures: high-performance I/O, compute, and communication. All three are necessary to train large-scale models in practice and we discuss the newest developments and ideas in those three aspects.

Bio: Torsten Hoefler directs the Scalable Parallel Computing Laboratory (SPCL) at D-INFK ETH Zurich. He received his PhD degree in 2007 at Indiana University and started his first professor appointment in 2011 at the University of Illinois at Urbana-Champaign.

Torsten has served as the lead for performance modeling and analysis in the US NSF Blue Waters project at NCSA/UIUC. Since 2013, he is professor of computer science at ETH Zurich and has held visiting positions at Argonne National Laboratories, Sandia National Laboratories, and Microsoft Research Redmond (Station Q).

Dr. Hoefler's research aims at understanding the performance of parallel computing systems ranging from parallel computer architecture through parallel programming to parallel algorithms. He is also active in the application areas of Weather and Climate simulations as well as Machine Learning with a focus on Distributed Deep Learning. In those areas, he has coordinated tens of funded projects and an ERC Starting Grant on Data-Centric Parallel Programming.

He has been chair of the Hot Interconnects conference and technical program chair of the Supercomputing and ACM PASC conferences. He is associate editor of the IEEE Transactions of Parallel and Distributed Computing (TPDS) and the Parallel Computing Journal (PARCO) and a key member of the Message Passing Interface (MPI) Forum.

He has published more than 200 papers in peer-reviewed international conferences and journals and co-authored the latest versions of the MPI specification. He has received best paper awards at the ACM/IEEE Supercomputing Conference in 2010, 2013, and 2014 (SC10, SC13, SC14), EuroMPI 2013, IPDPS'15, ACM HPDC'15 and HPDC'16, ACM OOPSLA'16, and other conferences. Torsten received ETH Zurich's Latsis Prize in 2015, the SIAM SIAG/Supercomputing Junior Scientist Prize in 2012, the IEEE TCSC Young Achievers in Scalable Computing Award in 2013, the Young Alumni Award 2014 from Indiana University, and the best student award 2005 of the Chemnitz University of Technology. Torsten was elected into the first steering committee of ACM's SIGHPC in 2013 and he was re-elected in 2016. His Erdős number is two (via Amnon Barak) and he is an academic descendant of Hermann von Helmholtz.

8:55 - 9:20 Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences

Quentin Anthony, Lang Xu, Hari Subramoni, Dhabaleswar Panda, Ohio State University, USA

Abstract: Deep Learning (DL) models for super-resolution (DLSR) are an emerging trend in response to the growth of ML/DL applications requiring high-resolution images. DLSR methods have also shown promise in domains such as medical imaging, surveillance, and microscopy. However, DLSR models are extremely computationally demanding, and require unreasonably long training times on modern Volta GPUs. In our experiments, we observed only 10.3 images/second on a single Volta GPU for training EDSR, a state-of-the-art DLSR model for single-image super-resolution. In comparison, a Volta GPU can process 360 images/second while training ResNet-50, a stateof-the-art model for image classification. Therefore, we believe supercomputers provide a good candidate to speed up DLSR model training. In this paper, we select EDSR as the representative DLSR PyTorch model. Further, we introduce Horovodbased distributed EDSR training. However, we observed poor default EDSR scaling performance on the Lassen HPC system at Lawrence Livermore National Laboratory. To investigate the performance degradations, we perform exhaustive communication profiling. These profiling insights are then used to optimize CUDA-Aware MPI for DLSR models by ensuring advanced MPI designs involving CUDA IPC and registration caching are properly applied by DL frameworks. We present a comprehensive scaling study of EDSR with MVAPICH2-GDR and NCCL up to 512 GPUs on Lassen. We demonstrate an improvement in scaling efficiency by 15.6% over default Horovod training, which translates to a 1.26× speedup in training performance

9:20 - 9:40 Break

9:40 - 10:20 AI for Social Impact: Results from Multi-agent Reasoning and Learning in the Real World

by Prof. Milind Tambe, Harvard University, USA and Google Research, India

Abstract: With the maturing of AI and multiagent systems research, we have a tremendous opportunity to direct these advances towards addressing complex societal problems. I focus on the problems of public health and conservation, and address one key cross-cutting challenge: how to effectively deploy our limited intervention resources in these problem domains. I will present results from work around the globe in using AI for HIV prevention, Maternal and Child care interventions, TB prevention and COVID modeling, as well as for wildlife conservation. Achieving social impact in these domains often requires methodological advances. To that end, I will highlight key research advances in multiagent reasoning and learning, in particular in, computational game theory, restless bandits and influence maximization in social networks.In pushing this research agenda, our ultimate goal is to facilitate local communities and non-profits to directly benefit from advances in AI tools and techniques.

Bio: Milind Tambe is Gordon McKay Professor of Computer Science and Director of Center for Research in Computation and Society at Harvard University; concurrently, he is also Director "AI for Social Good" at Google Research India. He is a recipient of the IJCAI John McCarthy Award, ACM/SIGAI Autonomous Agents Research Award from AAMAS, AAAI Robert S Engelmore Memorial Lecture award, INFORMS Wagner prize, Rist Prize of the Military Operations Research Society, Columbus Fellowship Foundation Homeland security award, over 25 best papers or honorable mentions at conferences such as AAMAS, AAAI, IJCAI and meritorious commendations from agencies such as the US Coast Guard and the Los Angeles Airport. Prof. Tambe is a fellow of AAAI and ACM.

10:20 - 10:45 Distributed Deep Learning Using Volunteer Computing-Like Paradigm

Medha Atre, Birendra Jha, Ashwini Rao, Eydle Inc., USA

Abstract: Use of Deep Learning (DL) in commercial applications such as image classification, sentiment analysis and speech recognition is increasing. When training DL models with large number of parameters and/or large datasets, cost and speed of training can become prohibitive. Distributed DL training solutions that split a training job into subtasks and execute them over multiple nodes can decrease training time. However, the cost of current solutions, built predominantly for cluster computing systems, can still be an issue. In contrast to cluster computing systems, Volunteer Computing (VC) systems can lower the cost of computing, but applications running on VC systems have to handle fault tolerance, variable network latency and heterogeneity of compute nodes, and the current solutions are not designed to do so. We design a distributed solution that can run DL training on a VC system by using a data parallel approach. We implement a novel asynchronous SGD scheme called VC-ASGD suited for VC systems. In contrast to traditional VC systems that lower cost by using untrustworthy volunteer devices, we lower cost by leveraging preemptible computing instances on commercial cloud platforms. By using preemptible instances that require applications to be fault tolerant, we lower cost by 70-90\% and improve data security.

10:45 - 11:25 Riding the Composable Systems Wave to Improve DNN Distributed Training Performance

by Prof. Christopher Carothers, RPI, USA

Abstract: Composable systems provide hardware and software mechanisms to decouple physical compute and storage resources from their physical locations. When provisioned, the allocated set of resources more precisely meet application and workload demands yielding a much more efficient utilization of data center resources. The Compute Express Link (CXL) protocol which rides on top of the current PICe bus protocol is an emerging industry standard that could be used to create high-performance composable systems. For distributed training of large-scale neural networks as well as other HPC workloads, this protocol offers the potential of significantly improving network bandwidth and overall end-to-end packet latency compared to other HPC networks technologies like Infiniband. In this talk, we examine how a modern deep learning training algorithm performs on a future CXL enabled composable system for the ResNet50 benchmark across a range of architecture configurations.

Bio: Professor Chris Carothers is a faculty member in the Computer Science Department at Rensselaer Polytechnic Institute. He received the Ph.D., M.S., and B.S. from Georgia Institute of Technology in 1997, 1996, and 1991, respectively. Prior to joining RPI in 1998, he was a research scientist at the Georgia Institute of Technology. His research interests are focused on massively parallel computing which involve the creation of high-fidelity models of extreme-scale networks and computer systems. These models have executed using nearly 2,000,000 processing cores on the largest leadership class supercomputers in the world. Additionally, Professor Carothers serves as the Director for the Rensselaer Center for Computational Innovations (CCI). CCI is a partnership between Rensselaer and IBM. The center provides computational and storage resources to a diverse network of researchers, faculty, and students from Renssleaer, government laboratories, and companies across a number of science and engineering disciplines. The flagship supercomputer is an 8 peta-flop (PF) IBM AC922 hybrid CPU/GPU supercomputer named "AiMOS" which serves as the primary testbed for the IBM AI Hardware Center.

11:25 - 11:40 Ex-NNQMD: Extreme-Scale Neural Network Quantum Molecular Dynamics

Pankaj Rajak (Argonne National Laboratory, USA), Thomas Linker, Ken-ichi Nomura, Anikeya Aditya, Kuang Liu (University of Southern California, USA), Kohei Shimamura, Shogo Fukushima (Kumamoto University, Japan), Ye Luo (Argonne National Laboratory, USA), Fuyuki Shimojo (Kumamoto University, Japan), Rajiv K. Kalia, Aiichiro Nakano and Priya Vashishta (University of Southern California, USA).

Abstract: Deep learning is revolutionizing countless scientific and engineering fields. In particular, SC20 Gordon Bell award represented a breakthrough in molecular simulation, i.e., 100-million-atom simulation with quantum-mechanical accuracy on the Summit supercomputer at ORNL, using deep potential molecular dynamics (MD). Moving forward, while these simulations were performed only in gentle equilibrium conditions, far-from-equilibirum MD simulation involving light-induced electronic excited states finds numerous scientific and engineering applications. However, it remains a challenge to perform such far-from-equilibrium simulations at larger sptiotemporal scales, where growing number of unphysical predictions of interatomic force prohibits simulations involving larger numbers of atoms for longer times. In this paper, we propose a physically-based inductive bias, maximally-preserved Maxwell-Boltzmann (MPMB), to overcome this fidelity-scaling problem. Along with hybrid divide-and-conquer parallelization and single-node level optimization using multithreading and data parallel SIMD, the resulting Ex-NNQMD (extreme-scale neural network quantum molecular dynamics) algorithm has achieved unprecedented scales of far-from-equilibrium simulations: 1) 5.1-billion atom system with a parallel efficiency of 0.94, and 2) a sustained performance of 6.4 nanoseconds/day for 10-million atom system both on 262,144 cores of the Theta supercomputer at Argonne Leadership Computing Facility. Extended fidelity scaling and efficient parallelization have allowed us for the first time to study light-induced ferroelectric switching under extreme electronic excitation at experimentally relevant spatiotemporal scales with accuracy.

11:40 - 12:00 Break

12:00 - 12:40 Innovating across the AI stack for scale and speed

by Dr. Rania Khalaf, IBM Research AI, USA

Abstract: We look at innovation across the software-hardware stack and focus on two open source toolkits from IBM Research targeted at large scale AI systems: IBM Federated Learning and the IBM Analog Hardware Acceleration Kit. IBM Federated learning is a software framework for large scale distributed machine learning under data sharing, privacy, and regulatory constraints. The framework provides a pluggable approach allowing new algorithms and protocols to be plugged in different layers for an extensible system and to allow for easy benchmarking and experimentation. The IBM Analog Hardware Acceleration Kit provides a simulation environment to experiment with latest analog hardware advances like in-memory computing. While in-memory computing is a promising future technology for accelerating AI workloads, noise and non-idealities demand for improved algorithmic solutions. The goal is to conveniently estimate the impact of the emerging analog material properties and non-idealities on the accuracy for arbitrary neural networks. Both these toolkits are aimed at creating a vibrant ecosystem of research and innovations for large scale and accelerated AI. We will highlight key capabilities of these technologies through representative use-cases and discuss future directions.

Bio: Rania Khalaf is the Director of AI Platforms and Runtimes at IBM Research where she leads teams pushing the envelope in AI platforms to make creating AI models and applications easy, fast, and safe for data scientists and developers. Her multi-disciplinary teams tackle key problems at the intersection of core AI, distributed systems, human computer interaction and cloud computing. Prior to this role, Rania was Director of Cloud Platform, Programming Models and Runtimes. Rania serves as a Judge for the MIT Solve AI for Humanity Prize, on the Leadership Challenge Group for MIT Solve's Learning for Girls and Women Challenge and on the Advisory Board of the Hariri Institute for Computing at Boston University. She has received several Outstanding Technical Innovation awards for major impact to the field of computer science and was a finalist for the 2019 MassTLC CTO of the Year award.

12:40 - 13:20 Ray as the Unified Compute Substrate for Machine Learning Applications

Invited Talk by Dr. Zhe Zhang, Anyscale Inc. USA

Abstract: Today's applications more and more involve Machine Learning, and in an increasingly integrated way. But to use ML successfully is difficult. Each successful model requires complex data engineering, expensive training (both hardware and human resources), and flexible inference with low latency. How to develop ML infrastructure software to support these compute workloads is a huge challenge -- especially with the fast evolvement of ML theory and algorithms, and the explosion of data.

Since its inception a few years ago, Ray has been trying to address the problem by finding the right layer of abstraction, and building a solid architecture to support it. In this talk, I will introduce Ray's basic abstractions -- distributed Class / Method / Object, followed by Ray's main architectural components. Then I'll discuss how ML infrastructure teams have been using Ray to achieve groundbreaking improvements to productivity, scalability, and performance.

Bio: Zhe currently leads open source engineering at Anyscale. Before that, he was at LinkedIn managing the Big Data / AI Compute team (providing Hadoop/Spark/TensorFlow as services). Ever since 2014, Zhe's work has been closely related to open source. He's a committer and PMC member of Apache Hadoop, and a Member of Apache Software Foundation.

13:20 - 13:35 Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour

Arissa Wongpanich (UC Berkeley, USA), Hieu Pham (Google Research, USA), James Demmel (UC Berkeley, USA), Mingxing Tan, Quoc Le, Yang You and Sameer Kumar (Google Research, USA)

Abstract: EfficientNets are a family of state-of-the-art image classification models based on efficiently scaled convolutional neural networks. Currently, EfficientNets can take on the order of days to train; for example, training an EfficientNet-B0 model takes 23 hours on a Cloud TPU v2-8 node. In this paper, we explore techniques to scale up the training of EfficientNets on TPU-v3 Pods with 2048 cores, motivated by speedups that can be achieved when training at such scales. We discuss optimizations required to scale training to a batch size of 65536 on 1024 TPU-v3 cores, such as selecting large batch optimizers and learning rate schedules as well as utilizing distributed evaluation and batch normalization techniques. Additionally, we present timing and performance benchmarks for EfficientNet models trained on the ImageNet dataset in order to analyze the behavior of EfficientNets at scale. With our optimizations, we are able to train EfficientNet on ImageNet to an accuracy of 83% in 1 hour and 4 minutes.

13:35 - 13:50 Performance Analysis of Deep Learning Workloads on a Composable System

Kaoutar El Maghraoui, Lorraine Herger, Chekuri Choudary, Kim Tran, Todd Deshane, David Hanson, IBM, USA

Abstract: A composable infrastructure is defined as resources, such as compute, storage, accelerators and networking, that are shared in a pool and that can be grouped in various configurations to meet application requirements. This freedom to 'mix and match' resources dynamically allows for experimentation early in the design cycle, prior to the final architectural design or hardware implementation of a system. We describe the design of an enterprise composable infrastructure that we have implemented and evaluate the impact of resource dis-aggregation on representative deep learning benchmarks.

13:50 - 14:00 Closing Remarks by the organizers