Loading Events

PhD Speaking Qualifier

October

3
Thu
Hao Zhang Robotics Institute,
Carnegie Mellon University
Thursday, October 3
11:00 am to 12:00 pm
GHC 4405
Scaling Up Deep Learning with Model and Algorithm Awareness

Abstract:
In recent years, the pace of innovations in the fields of deep learning has accelerated. To cope with the sheer computational complexity of training large ML models on large datasets, researchers in the systems and ML communities have created software systems that parallelize training algorithms over multiple CPUs or GPUs (multi-device parallelism), or even multiple computing nodes over a network (distributed machine learning). As ML and deep learning models become more structurally complex, these systems have struggled to provide excellent all-round performance on a wide variety of models. In practice, the design of existing systems usually instantiate only one distribution technology (e.g. parameter server or all-reduce). Consequently, one monolithic distribution method (thus system implementation) is applied for all models, even though a particular communication and synchronization strategy is only well-suited to a limited number of model architectures.

Thus, we ask whether it is possible to design ML systems that are adaptive and aware of the characteristics of incoming models and algorithms. We state that exploiting ML characteristics can bring up scalability, ease prototyping, and even improve application results. To support the statement, we show three instantiations where we can improve the system design by having awareness of (1) the layered structure and property of CNNs, (2) the sparsity of parameter variables, and (3) the recursive nature of RNNs. In all three cases, we characterize the factors that influence the choice of distribution strategy for a given ML model part or layer. Based on these factors, we develop principled strategies for optimal communication management over different ML model parts/layers, and validate these strategies with theoretical and empirical justification. As a result, we realize these strategies by building software systems that manage multiple end-to-end ML training tasks in real-world, heterogeneous computing environments.

Committee:
Eric Xing (advisor)
Kris Kitani
Graham Neubig
Xiaolong Wang