Loading Events

PhD Thesis Defense

September

2
Wed
Hao Zhang Robotics Institute,
Carnegie Mellon University
Wednesday, September 2
2:00 pm to 3:00 pm
Machine Learning Parallelism Could Be Adaptive, Composable and Automated

Zoom Link

Abstract:
In recent years, researchers in SysML have created algorithms and systems that parallelize ML training over multiple devices or computational nodes. As ML models become more structurally complex, many systems have struggled to provide all-round performance on a variety of models. Particularly, ML scale-up is usually underestimated in terms of the amount of knowledge and time required to map from an appropriate distribution strategy to the model. Applying parallel training systems to complex models adds non-trivial development overhead in addition to model prototyping, and often results in lower-than-expected performance.

In this thesis talk, I will first present a simple design principle, adaptive parallelism, that applies suitable parallelization techniques to model building blocks (e.g. layers) according to their specific ML properties. Following it, we derive a series of new parallelization strategies, and corresponding system implementations, that can adapt to the model and cluster specifications. We examine these strategies and show that they can significantly boost the scalability of ML training on clusters from 2-10x in their applicable scenarios.

Generalized from these cases, I will then present a composable distributed ML system, AutoDist, that allows system optimizations at the sub-model granularity, to capture different runtime characteristics exhibited in model building blocks, and joint optimization of multiple, orthogonal parallelization aspects. In AutoDist, we develop ways to express aspects of ML parallelisms, including synchronization architecture, model partitioning, placement, consistency, etc. within a unified representation. Based on the representation, AutoDist introduces system-level compositionality, that enables rapid compositions of parallelization strategies from existing techniques, and simplifies parallel ML programming.

Further, on top of AutoDist, I present an ML-based framework, AutoSync, to automatically optimize synchronization strategies in data-parallel distributed training. AutoSync navigates the space spanned by the proposed representation, and generates suitable distributed strategies for unseen models. We show that AutoSync can achieve high performance “out-of-the-box” — it automatically identifies synchronization strategies that report 1.2 – 1.6x speedups over existing hand-optimized systems, lowering the technical barrier of distributed ML and helping make it accessible to a larger community of users.

More Information

Thesis Committee Members:
Eric Xing, Chair
Gregory R. Ganger
Deva Ramanan
Jinyang Li, New York University
Christopher Ré, Stanford University