Scaling up and speeding up deep learning research with Tensor2Tensor
Researchers working on the cutting edge need to be able to experiment with novel model architectures, try them on multiple datasets, compare against strong baselines, scale up seamlessly, reuse basic components, and iterate quickly. Tensor2Tensor is a library built on top of TensorFlow that has enabled that for a group of researchers and engineers within Google Brain, and it has recently been open-sourced.
Cloud TPUs are a new accelerator that Google is offering as in alpha. In this talk, we'll discuss:
- A description of the TPUs and what they offer for training models.
- The modifications you need to make to your TensorFlow model to work with TPUs.
- Best practices to get the best performance out of these devices.
Jeremy Howard, fast.ai / platform.ai
Using GPU acceleration with Pytorch to make your algorithms 2000% faster
Most developers are aware that some algorithms can be run on a GPU, instead of a CPU, and see orders of magnitude speedups. However, many people assume that:
1. Only specialist areas like deep learning are suitable for GPU
2. Learning to program a GPU takes years of developing specialist knowledge
It turns out that neither assumption is true! Nearly any non-recursive algorithm that operates on datasets of 1000+ items can be accelerated by a GPU. And recent libraries like Pytorch make it nearly as simple to write a GPU accelerated algorithm as a regular CPU algorithm.
In this talk we’ll explain what the mean-shift clustering algorithm is, and why it’s important for many data science applications. We’ll first implement it in python (with numpy), and will then show how to port it to Pytorch, showing how to get a 20x performance improvement in the process.
Illia Polosukhin, TensorFlow
Optimizing Distributed TensorFlow
TensorFlow allows to run distributed training, but making the most out of hardware still takes a lot of work. In this talk, you will learn: - How to setup distributed Tensorflow across multiple CPUs and GPUs. - Analyze TensorFlow timeline to figure out bottlenecks - Tune various components of the training stack to achieve optimal training speed.
Sharan Narang, Baidu AI Lab
The Need for Speed: Benchmarking Deep Learning
Baidu released its open source benchmarking tool, DeepBench. This tool measures deep learning performance operations on various hardware, specifically for both training and inference operations.
Sharan will discuss the different techniques involved in training deep learning models and the challenges faced in measuring their performance. Because inference differs from training, and in order to see major improvements and advance deep learning, it is important to speed up inference for these algorithms. Sharan will subsequently discuss the key differences between inference and training, various techniques used to speed up deep learning inference, and walk audience members through the results achieved utilizing different platforms.
Peter Zhokhov, Sentient / StudioML
Studio.ML is a model management framework written in Python to help simplify and expedite your model building experience. It was developed to minimize any overhead involved with the scheduling, running, monitoring or management of artifacts of your machine learning experiments in Python without invasion of your code.
Peter will demonstrate how to train deep learning models in Studio, locally, in the cloud and using custom compute.
Boris Ginsburg, NVIDIA
Training Deep Convolutional Networks with XL Batch
In this talk we discuss the large batch training for deep convolutional networks. We found that the current recipe (linear scaling of learning rate with the "warm-up" start) failed on AlexNet with the batch size of 4096.
The loss in accuracy for a large batch is related primarily to 2 factors: higher training loss ("optimization gap" ) and the increased gap between test loss and train loss (“generalization gap”). Adding Batch Normalization (BN) layers to AlexNet helps to solve the regularization gap, but BN alone is not enough to close the "optimization" gap. We found that simple linear learning scaling of initial learning rate makes the network to diverge even with learning rate warm-up. To overcome this problem we propose layer-wise adaptive rate scaling algorithm (LARS), that dynamically adjusts the learning rate for each layer based on the magnitude of the weights. Using LARS we were able to train AlexNet and ResNet-50 with batch size of 16K.