Putting Deep Learning into Production

Deep learning models are achieving state-of-the-art results in speech, image/video classification and numerous other areas, but deploying them to production often involves a unique set of challenges including prediction latency, significant training cost, device memory requirements, etc.

This conference will focus on some best practices for deploying deep learning models into production.  Speakers will discuss topics like:

  • Ways to speed up training time
  • Using pre-trained models
  • Transferring knowledge from a different task
  • Reducing model size to improve prediction latency
  • Fitting models onto devices


Jan 21, 2017, 9:00a - 5:00p



Capital One
201 3rd St, 5th Floor
San Francisco


Organized by





Jeremy Howard

Techniques for dealing with large datasets

We will look at a number of tricks for handling large datasets, with a particular focus on image datasets. These techniques are equally valuable for training, and production. Using these techniques we will learn how to process the ImageNet dataset using a single GPU in a relatively short amount of time. 

Some of the topics that we will cover include:

  • Decoding and resizing images and caching them in efficient bcolz arrays
  • processing memory map arrays in chunks
  • strategies for managing faster solid-state storage more efficiently
  • pre-computing partial data augmentations, zooms, and crops
  • identifying when and how to pre-compute results of intermediate layers of neural networks
  • parallelising preprocessing
  • using SIMD
  • adding batch normalisation to existing models
  • changing dropout probabilities without retraining
  • how to process samples on cheap CPU instances to save time and money

Illia Polosukhin
Google / Tensorflow Contributor

TensorFlow has taken the deep learning world by storm.  This workshop will be led by one of TensorFlow’s main contributors, Illia Polosukhin. Illia’s hands-on workshop will cover:

  •  Using TensorFlow Serving and Compiler (XLA)

  • Reducing Deep Learning models memory and latency for deployment.

  • Distilling expensive Deep Learning models into smaller models

Arshak Navruzyan

Billions of events are created every day by applications that do telemetry, capture videos and track movement of physical assets. Although this data is rich, it can be hard to interpret and get value out of.   

Events may not have explicit labels (i.e. business meaning) but they contain useful structure that deep learning algorithms can discover. A domain expert can quickly turn discovered structure into labels to automate predictions in a business process.

I will discuss (and demonstrate) some recent work at Adversarial.AI to turn event streams into AI applications.

Abhradeep Guha Thakurta
University of California Santa Cruz

Machine learning has fundamentally transformed the way we interact with many networked devices around us. However, machine learning's effectiveness also raises profound concerns about privacy --- how we control the collection and use of our information. This tension between collection of users’ information to improve sales revenue of organizations (e.g., via targeted advertising), and the corresponding privacy concerns is increasing at an alarming rate. In this talk, I will introduce privacy preserving algorithms for large-scale machine learning. These algorithms will preserve a rigorous privacy guarantee (differential privacy), and will have provable utility guarantees. Furthermore, they will be amenable to highly distributed systems (e.g., learning on data samples from millions of smartphones). We will illustrate this via a case study of classifying emails into junk vs non-junk. To that end, we will use variants of classic algorithms like gradient descent and cutting plane, and also new algorithmic ideas via functional approximation. If time permits, I will provide some code example for these algorithms in Python.

Andres Rodriguez
Intel Nervana

Deep learning is unlocking tremendous economic value across various market sectors. Individual data scientists can draw from several open source frameworks and basic hardware resources during the very initial investigative phases but quickly require significant hardware and software resources to build and deploy production models. Intel offers various software and hardware to support a diversity of workloads and user needs. Intel Nervana delivers a competitive deep learning platform to make it easy for data scientists to start from the iterative, investigatory phase and take models all the way to deployment. This platform is designed for speed and scale, and serves as a catalyst for all types of organizations to benefit from the full potential of deep learning. Example of supported applications include but not limited to automotive speech interfaces, image search, language translation, agricultural robotics and genomics, financial document summarization, and finding anomalies in IoT data.

Chris Fregly

In this completely demo-based talk, Chris Fregly from PipelineIO will demo the latest 100% open source research in high-scale, fault-tolerant model serving using Tensorflow, Spark ML, Jupyter Notebook, Docker, Kubernetes, and NetflixOSS Microservices.

This talk will discuss the trade-offs of mutable vs. immutable model deployments, on-the-fly JVM byte-code generation, global request batching, miroservice circuit breakers, and dynamic cluster scaling - all from within a Jupyter notebook.

Alex Miller

Yelp users have uploaded millions of photos, and the rate of photos being added is only increasing. In order to deliver the best experience for these users, the photo understanding team has used deep learning to identify the most beautiful photos and display them throughout the site. In this talk we discuss the motivation for using a deep learning approach, explain how it was implemented, and show some illustrative results.

Michael Mahoney
UC Berkeley Department of Statistics


One of the most important technical aspects about recent work in deep learning is that it is computationally-intensive in ways that most machine learning problems are not.  This presents the opportunity to explore the productivity-performance space in high performance/productivity computing, two areas that have developed in scientific computing and databases largely independently.

Motivated by this, here, we explore the trade-offs of performing linear algebra for data analysis and machine learning using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to terabyte-sized problems in particle physics, climate modeling and bio-imaging, as use cases where interpretable analytics is of interest. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark’s data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.

We'll conclude with a discussion of recent work on how traditional approaches to matrix and graph algorithms are not particularly appropriate or well-suited for deep learning applications, possible solutions to this, and the productivity-performance tradeoffs this will entail.


Sponsors and Media Partners