Scalable and Efficient I/O for Distributed Deep Learning

Purpose

This project enables scalable and efficient I/O for distributed deep learning training in computer clusters with existing hardware/software stack.

Overview

Emerging Deep Learning (DL) applications introduce heavy I/O workloads on computer clusters. The inherent long lasting, repeated, high volume, and highly concurrent random file access pattern can easily saturate the metadata and data service and negatively impact other users. In this project, we try to design a transient runtime file system that optimizes DL I/O on existing hardware/software stacks. With a comprehensive I/O profile study on real world DL applications, we implemented FanStore. FanStore distributes datasets to the local storage of compute nodes, and maintains a global namespace. With the techniques of function interception, distributed metadata management, and generic data compression, FanStore provides a POSIX-compliant interface with native hardware throughput in an efficient and scalable manner. Users do not have to make intrusive code changes to use FanStore and take advantage of the optimized I/O. Our experiments with benchmarks and real applications show that FanStore can scale DL training to 512 compute nodes with over 90% scaling efficiency.

Impact

  • FanStore dramatically enhances existing clusters' capability of distributed deep learning without hardware or system software change.
  • Besides the well-known ImageNet-based Convolutional Neural Network training, we enable another two real world scientific applications using FanStore, which were prohibitive in computer clusters due to the I/O traffic. The first application is enhancing neural image resolution using super resolution generative adversarial network (SRGAN). It comes with ~600 GB image data and is implemented with TensorLayer, TensorFlow, and Horovod software stack. The second application is predicting disruptions of plasma disruptions using fusion recurrent neural network (FRNN). It trains with 1.7 TB text data and is implemented with TensorFlow and MPI.
  • The impact of FanStore is way beyond the applications used in the study, it is beneficial for almost all deep learning applications that have the datasets in POSIX files.

Funding Source(s)

Base funding

Paper References

Zhang, Zhao, Lei Huang, Uri Manor, Linjing Fang, Gabriele Merlo, Craig Michoski, John Cazes, and Niall Gaffney. "FanStore: Enabling Efficient and Scalable I/O for Distributed Deep Learning." arXiv preprint arXiv:1809.10799 (2018).

Zhao Zhang

Research Associate
zzhang@tacc.utexas.edu

Lei Huang

Research Associate
huang@tacc.utexas.edu

John Cazes

Deputy Director Of High Performance Computing
cazes@tacc.utexas.edu

Niall Gaffney

Director of Data Intensive Computing
ngaffney@tacc.utexas.edu