1. Bighead Airbnb’s End-to-End Machine Learning Infrastructure Andrew Hoh and Krishna Puttaswamy ML Infra @ Airbnb

2.Q4 2016: Formation of our ML Infra team In 2016 ● Only a few major models in production ● Models took on average 8 week to 12 weeks to build ● Everything built in Aerosolve, Spark and Scala ● No support for Tensorflow, PyTorch, SK-Learn or other popular ML packages ● Significant discrepancies between offline and online data ML Infra was formed with the charter to: ● Enable more users to build ML products ● Reduce time and effort ● Enable easier model evaluation

3. ML has had a massive impact on Airbnb’s product ● Search Ranking Before ML ● Smart Pricing Infrastructure ● Fraud Detection

4. But there were many other areas that had high-potential for ML, but had yet to realize its full potential. ● Paid Growth - Hosts After ML ● Classifying listing Infrastructure ● Room Type Categorizations ● Experience Ranking + Personalization ● Host Availability ● Business Travel Classifier ● Make Listing a Space Easier ● Customer Service Ticket Routing ● … And many more

5. Vision Airbnb routinely ships ML-powered features throughout the product. Mission Equip Airbnb with shared technology to build production-ready ML applications with no incidental complexity. (Technology = tools, platforms, knowledge, shared feature data, etc.)

6. Machine Learning Infrastructure can: ● Remove incidental complexities, by providing generic, reusable solutions ● Simplify the workflow by providing tooling, libraries, and environments that make ML Value of ML development more efficient Infrastructure And at the same time: ● Establish a standardized platform that enables cross-company sharing of feature data and model components ● “Make it easy to do the right thing” (ex: consistent training/streaming/scoring logic)

7.Bighead: Motivations

8.Q1 2017: Figuring out what to build Learnings: ● No consistency between ML Workflows ● New teams struggle to begin using ML ● Airbnb has a wide variety in ML use cases ● Existing ML workflows are slow, fragmented, and brittle ● Incidental complexity vs. intrinsic complexity ● Build and forget - ML as a linear process



11.Key Design Decisions ● Consistent environment across the stack with Docker ● Consistent data transformation ○ Multi-row aggregation in the warehouse, single row transformation is part of the model ○ Model transformation code is the same in online and offline ● Common workflow across different ML frameworks ○ Supports Scikit-learn, TF, PyTorch, etc. ● Modular components ○ Easy to customize parts ○ Easy to share data/pipelines

12.Bighead Architecture Offline Docker Image Service Online Zipline features DeepThought Service git repo Bighead UI BigQueue Service Bighead Service Zipline Worker Clients Feature Repository Deployment Redspot Zipline Data App for ML ML Automator Models (K/V Store) (w/ Airflow) User ML Model Bighead library ML DAG Generator ML Models git repo

13.Components ● Data Management: Zipline ● Training: Redspot / BigQueue ● Core ML Library: ML Pipeline ● Productionisation: Deep Thought (online) / ML Automator (offline) ● Model Management: Bighead service

14.Zipline (ML Data Management Framework)

15.Zipline - Why ● Defining features (especially windowed) with hive was complicated and error prone ● Backfilling training sets (on inefficient hive queries) was a major bottleneck ● No feature sharing ● Inconsistent offline and online datasets ● Warehouse is built as of end-of-day, lacked point-in-time features ● ML data pipelines lacked data quality checks or monitoring ● Ownership of pipelines was in disarray

16.For information on Zipline, please watch the recording of our other Spark Summit session: Zipline: Airbnb’s Machine Learning Data Management Platform

17.Redspot (Hosted Jupyter Notebook Service)

18.Bighead Architecture Offline Docker Image Service Online Zipline features DeepThought Service git repo Bighead UI BigQueue Service Bighead Service Zipline Worker Clients Feature Repository Deployment Redspot Zipline Data App for ML ML Automator Models (K/V Store) (w/ Airflow) User ML Model Bighead library ML DAG Generator ML Models git repo

19.Redspot - Why ● Started with Jupyterhub (open-source project), which manages multiple Jupyter Notebook Servers (prototyping environment) ● But users were installing packages locally, and then creating virtualenv for other parts of our infra ○ Environment was very fragile ● Users wanted to be able to use jupyterhub on larger instances or instances with GPU ● Wanting to share notebooks with other teammates was common too ● Files/content resilient to node failures

20.Containerized environments ● Every user’s environment is containerized via docker ○ Allows customizing the notebook environment without affecting other users ■ e.g. install system/python packages ○ Easier to restore state therefore helps with reproducibility ● Support using custom docker images ○ Base images based on user’s needs ■ e.g. GPU access, pre-installed ML packages



23.Remote Instance Spawner ● For bigger jobs and total isolation, Redspot allows launching a dedicated instance ● Hardware resources not shared with other users ● Automatically terminates idle instances periodically

24. Remote Instances (x32) Users redspot-singleuser redspot-singleuser-gpu Redspot redspot-standalone Docker Daemon MySQL Local Docker Containers (x91) Jupyterhub EFS Docker Daemon Docker Daemon Data Backend S3, Hive, Presto, Spark X1e.32xlarge (128 vCPUs, 3.9 TB)

25.Docker Image Repo/Service ● Native dockerfile enforces strict single inheritance. ○ Prevents composition of base images ○ Might lead to copy/pasting Dockerfile snippets around. ● A git repo of Dockerfiles for each stage and yml file expressing: ○ Pre-build/post-build commands. ○ Build time/runtime dependencies. (mounting ubuntu16.04-py3.6-cuda9-cudnn7: base: ubuntu14.04-py3.6 directories, docker runtime) description: "A base Ubuntu 16.04 image with python 3.6, CUDA 9, CUDNN 7" ● Image builder: stages: - cuda/9.0 - cudnn/7 ○ Build flow tool for chaining stages to args: produce a single image. python_version: '3.6' cuda_version: '9.0' cudnn_version: '7.0' ○ Build independent images in parallel.

26.Redspot Summary ● A multi-tenant notebook environment ● Makes it easy to iterate and prototype ML models, share work ○ Integrated with the rest of our infra - so one can deploy a notebook to prod ● Improved upon open source Jupyterhub ○ Containerized; can bring custom Docker env ○ Remote notebook spawner for dedicated instances (P3 and X1 machines on AWS) ○ Persist notebooks in EFS and share with teams ○ Reverting to prior checkpoint ● Support 200+ Weekly Active Users

27.Bighead Library

28.Bighead Library - Why ● Transformations (NLP, images) are often re-written by different users ● No clear abstraction for data transformation in the model ○ Every user can do data processing in a different way, leading to confusion ○ Users can easily write inefficient code ○ Loss of feature metadata during transformation (can’t plot feature importance) ○ Need special handling of CPU/GPU ○ No visualization of transformations ● Visualizing and understanding input data is key ○ But few good libraries to do so

29.Bighead Library ● Library of transformations; holds more than 100+ different transformations including automated preprocessing for common input formats (NLP, images, etc.) ● Pipeline abstraction to build a DAG of transformation on input data ○ Propagate feature metadata so we can plot feature importance at the end and connect it to feature names ○ Pipelines for data processing are reusable in other pipelines ○ Feature parallel and data parallel transformations ○ CPU and GPU support ○ Supports Scikit APIs ● Wrappers for model frameworks (XGB, TF, etc.) so they can be easily serialized/deserialized (robust to minor version changes) ● Provides training data visualization to help identify data issues