Applying Deep Learning To Airbnb Search

The application to search ranking is one of the biggest machine learning success stories at Airbnb. Much of the initial gains were
driven by a gradient boosted decision tree model. !e gains, however, plateaued over time. !is paper discusses the work done in
applying neural networks in an a”empt to break out of that plateau. We present our perspective not with the intention of pushing the
frontier of new modeling techniques. Instead, ours is a story of the elements we found useful in applying neural networks to a real life product. Deep learning was steep learning for us. To other teams embarking on similar journeys, we hope an account of our struggles and triumphs will provide some useful pointers.

展开查看详情

1. Applying Deep Learning To Airbnb Search Malay Haldar, Mustafa Abdool, Prashant Ramanathan, Tao Xu, Shulin Yang, Huizhong Duan, Qing Zhang, Nick Barrow-Williams, Bradley C. Turnbull, Brendan M. Collins and �omas Legrand Airbnb Inc. malay.haldar@airbnb.com ABSTRACT �e application to search ranking is one of the biggest machine learning success stories at Airbnb. Much of the initial gains were driven by a gradient boosted decision tree model. �e gains, how- arXiv:1810.09591v2 [cs.LG] 24 Oct 2018 ever, plateaued over time. �is paper discusses the work done in applying neural networks in an a�empt to break out of that plateau. We present our perspective not with the intention of pushing the frontier of new modeling techniques. Instead, ours is a story of the Figure 1: Example search session elements we found useful in applying neural networks to a real life product. Deep learning was steep learning for us. To other teams embarking on similar journeys, we hope an account of our struggles experience 5-star, etc. Our current discussion is focused on one and triumphs will provide some useful pointers. Bon voyage! particular model in this ecosystem. Considered the most complex piece of the ranking puzzle, this model is responsible for ordering CCS CONCEPTS the listings available according to the guest’s likelihood of booking. •Retrieval models and ranking ! Learning to rank; •Machine A typical guest search session is depicted in Figure 1. It is com- learning approaches ! Neural networks; •Electronic com- mon for guests to do multiple searches, clicking through some of merce ! Online shopping; the listings to view their details page. Successful sessions end with the guest booking a listing. �e searches performed by a guest KEYWORDS and their interactions are logged. While training, a new model Search ranking, Deep learning, e-commerce has access to the logs of the current and previous models used in production. �e new model is trained to learn a scoring function 1 INTRODUCTION that assigns impressions of the booked listings in the logs at as high a rank as possible, similar to [19]. �e new model is then �e home sharing platform at Airbnb is a two sided marketplace for tested online in an A/B testing framework to see if it can achieve hosts to rent out their spaces, referred to as listings, to be booked a statistically signi�cant increase in conversions compared to the by prospective guests from all around the world. A typical booking current model. starts with the guest issuing a search at airbnb.com for homes An overview of the paper: we start o� with a summary of how available in a particular geographic location. �e task of search the model architecture evolved over time. �is is followed by fea- ranking is to respond to the guest with an ordered list of a handful ture engineering and system engineering considerations. We then of listings from the thousands available in the inventory. describe some of our tooling and hyper-parameter explorations, �e very �rst implementation of search ranking was a manually �nishing with a retrospective. cra�ed scoring function. Replacing the manual scoring function with a gradient boosted decision tree (GBDT) model gave one of the 2 MODEL EVOLUTION largest step improvements in homes bookings in Airbnb’s history, with many successful iterations to follow. �e gains in online book- Our transition to deep learning wasn’t the result of an atomic move; ings eventually saturated with long spells of neutral experiments. it was the culmination of a series of iterative re�nements. Figure �is made the moment ripe for trying sweeping changes to the 2a shows the comparative improvements in our principal o�ine system. metric NDCG (normalized discounted cumulative gain), where the Starting with this background, the current paper discusses our impression of the booked listing is assigned a relevance of 1, and experiences in transitioning one of the at-scale search engines on all other impressions 0 relevance. �e x-axis depicts the models the internet to deep learning. �e paper is targeted towards teams along with the time they were launched to production. that have a machine learning system in place and are starting to Figure 2b shows the comparative increase in conversions from think about neural networks (NNs). For teams starting to explore the models. Overall, this represents one of the most impactful machine learning, we would recommend a look at [27] as well. applications of machine learning at Airbnb. �e sections to follow �e search ranking model under discussion is part of an ecosys- brie�y describe each of these models. tem of models, all of which contribute towards deciding which listings to present to the guest. �ese include models that predict 2.1 Simple NN the likelihood of a host accepting the guest’s request for booking, Andrej Karpathy has advice regarding model architecture: don’t models that predict the probability the guest will rate the on trip be a hero [11]. Well, that’s not how we started. Driven by ”Why

2. def apply_discount(x): Apply positional discount curve return np.log(2.0)/np.log(2.0 + x) def compute_weights(logit_op, session): Compute loss weights based on delta ndcg. logit_op is a [BATCH_SIZE, NUM_SAMPLES] shaped tensor corresponding to the output layer of the network. Each row corresponds to a search and each column a listing in the search result. Column 0 is the booked listing, while columns 1 through NUM_SAMPLES - 1 the not-booked listings. logit_vals = session.run(logit_op) ranks = NUM_SAMPLES - 1 - (a) NDCG gains o�line logit_vals.argsort(axis=1).argsort(axis=1) discounted_non_booking = apply_discount(ranks[:, 1:]) discounted_booking = apply_discount(np.expand_dims(ranks[:, 0], axis=1)) discounted_weights = np.abs(discounted_booking - discounted_non_booking) return discounted_weight # Compute the pairwise loss pairwise_loss = tf.nn.sigmoid_cross_entropy_with_logits( targets=tf.ones_like(logit_op[:, 0]), logits=logit_op[:, 0] - logit_op[:, i:] ) # Compute the lambdarank weights based on delta ndcg weights = compute_weights(logit_op, session) #Multiply pairwise loss by lambdarank weights loss = tf.reduce_mean(tf.multiply(pairwise_loss, weights)) (b) Booking gains online Table 1: TensorFlowTM code to adapt pairwise loss to Lambdarank. Figure 2: Relative gains across models the NN for NDCG. �is involved two crucial improvements over the regression based formulation of the simple NN: can’t we be heroes?”, we started o� with some intricate custom architectures, only to get overwhelmed by their complexity and • Moving to a pairwise preference formulation where the ended up consuming cycles. listings seen by a booker were used to construct pairs of �e �rst architecture that we �nally managed to get online was {booked listing, not-booked listing} as training examples. a simple single hidden layer NN with 32 fully connected ReLU During training we minimized cross entropy loss of the activations that proved booking neutral against the GBDT model. score di�erence between the booked listing over the not- �e NN was fed by the same features as the GBDT model. �e booked listing. training objective for the NN was also kept invariant w.r.t the GBDT • Weighing each pairwise loss by the di�erence in NDCG model: minimizing the L2 regression loss where booked listings are resulting from swapping the positions of the two listings assigned a utility of 1.0 and listings that are not booked a utility of making up the pair. �is prioritized the rank optimization 0.0. of the booked listing towards the top of the search result �e value of the whole exercise was that it validated that the list, instead of the bo�om. For example, improving the entire NN pipeline was production ready and capable of serving rank of a booked listing from position 2 to 1 would get live tra�c. Aspects of this pipeline are discussed later under the priority over moving a booked listing from position 10 to feature engineering and system engineering sections. 9. Table 1 shows a partial implementation in TensorFlowTM , in par- 2.2 Lambdarank NN ticular, how the pairwise loss was weighted. Not being a hero got us o� to a start, but not very far. In time we would adapt Karpathy’s advice to: don’t be a hero, in the beginning. 2.3 Decision Tree/Factorization Machine NN Our �rst breakthrough came when we combined a NN with the While the main ranking model serving production tra�c was a NN idea behind Lamdarank [2]. O�ine we were using NDCG as our at this point, we had other models under research. �e notable ones principal metric. Lambdarank gave us a way to directly optimize were: 2

3. Output Logit �e features feeding the DNN were mostly simple properties of the listings such as price, amenities, historical booking count, etc, fed directly with minimal feature engineering. Exceptions include Hidden Relu layer features output from another model: • Price of listings that have the Smart Pricing feature enabled, supplied by a specialized model [24]. Embedding Embedding • Similarity of the listing to the past views of the user, com- Other Features FM Prediction puted based on co-view embeddings [9]. L0 L1 �ese models tap into data that isn’t directly part of the search q0 ranking training examples, providing the DNN with additional q1 information. Gradient Boosted Decision Trees Factorization Machine To take a closer look at the DNN, we plot its learning curve in Figure 3: NN with GBDT tree nodes and FM prediction as features Figure 4, comparing NDCG over the training and test data set as a function of number of training pairs. Training on 1.7 billion pairs, we were able to close the generalization gap between the training and the test data set. Could we have launched the DNN directly, skipping all the stages of evolution? We try to answer this in the retrospective section, once more of the context surrounding the model is in place. As a side note, while DNNs have achieved human level perfor- mance on certain image applications [21], it is very hard for us to judge where we stand for a similar comparison. Part of the problem is that it’s unclear how to de�ne human level performance. Going through the logs, it’s quite challenging for us to identify which listing was booked. We �nd no objective truth in the logs, only tradeo�s highly conditional upon the budget and tastes of the guest which remain mostly hidden. Other researchers [10] note the di�- culty in using human evaluation even for familiar shopping items. Figure 4: DNN learning curve For our application these di�culties are further exacerbated due to the novelty of the inventory. Speaking of di�culties, next we discuss something rarely talked • Iterations on the GBDT model with alternative ways to about: failed a�empts. sample searches for constructing the training data. • A factorization machine (FM) [14] model that predicted the booking probability of a listing given a query, by mapping 3 FAILED MODELS listings and queries to a 32-dimensional space. �e narrative of one successful launch followed by another pre- �ese new ways of looking at the search ranking problem revealed sented in the previous section doesn’t tell the whole story. Reality is something interesting: although performances of the alternative studded with unsuccessful a�empts that outnumber the successful models on test data were comparable to the NN, the listings up- ones. Retelling every a�empt made would be time consuming, so ranked by them were quite di�erent. Inspired by NN architectures we pick two particularly interesting ones. �ese models are interest- like [23], the new model was an a�empt to combine the strengths ing because they illustrate how some technique that is very e�ective of all three models. For the FM model we took the �nal prediction and popular in the wild may not work as well when brought home. as a feature into the NN. From the GBDT model, we took the index of the leaf node activated per tree as a categorical feature. Figure 3 3.1 Listing ID gives an overview. Each listing at Airbnb has a corresponding unique id. One of the ex- citing new opportunities unlocked by NNs was to use these listing 2.4 Deep NN ids as features. �e idea was to use the listing ids as index into an �e complexity of the model at this point was staggering, and embedding, which allowed us to learn a vector representation per some of the issues mentioned in [17] begun to rear their heads. In listing, encoding their unique properties. A reason for our excite- our �nal leap, we were able to deprecate all that complexity by ment was the success other applications had achieved in mapping simply scaling the training data 10x and moving to a DNN with 2 such high cardinality categorical features to embeddings, such as hidden layers. Typical con�guration of the network: an input layer learning embeddings for words in NLP applications [6], learning with a total of 195 features a�er expanding categorical features to embeddings for video and user id in recommender systems [4], etc. embeddings, feeding the �rst hidden layer with 127 fully connected ReLUs, and then the second hidden layer with 83 fully connected However, in the di�erent variations we tried, listing ids mostly ReLUs. led to over��ing. Figure 5 plots the learning curve from one such 3

4. Booking Logit Long View Logit Hidden Relu layer Embedding Embedding Continuous Features Figure 5: Learning curve with listing id feature Listing ID Categorical Features Figure 7: Muti-task architecture predicting bookings and views using two separate output layers; one optimized the loss with the booked listings as positive labels and the other targeted long views. Both output layers shared a common hidden layer as shown in Figure 7. Most importantly, the listing id embedding was shared as it was in the fan-in of the hidden layer. �e idea was that the model would be able to transfer its learnings from long views to predict Figure 6: Distribution of views to booking ratio across listings bookings and avoid over��ing. Since the number of long view labels outnumbered the booking labels by orders of magnitude, a a�empt, where we saw signi�cant improvement in NDCG on the compensating higher weight was applied to the booking loss to training set, but none on the test set. preserve focus on the booking objective. �e loss for each long �e reason why such an established technique fails at Airbnb is view label was further scaled by lo ( iew duration) as proposed because of some unique properties of the underlying marketplace. in [25]. For scoring listings online, we used the booking prediction �e embeddings need substantial amounts of data per item to con- only. verge to reasonable values. When items can be repeated without When tested online, the model increased long views by a large constraints, such as online videos or words in a language, there is margin. But bookings remained neutral. Manually inspecting list- no limit to the amount of user interaction an item can have. Ob- ings which had a high ratio of long views compared to bookings, taining large amounts of data for the items of interest is relatively we found several possible reasons which could have resulted in this easy. Listings, on the other hand, are subjected to constraints from gap. Such long views could be driven by high-end but high priced the physical world. Even the most popular listing can be booked listings, listings with long descriptions that are di�cult to parse, at most 365 times in an entire year. Typical bookings per listing or extremely unique and sometimes humorous listings, among sev- are much fewer. �is fundamental limitation generates data that is eral other reasons. Once again, it is the uniqueness of the Airbnb very sparse at the listing level. �e over��ing is a direct fallout of marketplace where long views are correlated with bookings, but this limitation. have a large orthogonal component as well that makes predicting bookings based on them challenging. A be�er understanding of 3.2 Multi-task learning listing views continues to be a topic of research for us. While bookings have a physical limitation, user views of the listing details pages are not constrained in the same way, and those we 4 FEATURE ENGINEERING have in spades. Figure 6 shows the distribution of views to bookings �e baseline GBDT pipeline we started with had extensive feature ratio for listings, with bookings typically orders of magnitude more engineering. Typical transformations included computing ratios, sparse than views. Taking a step further, we found long views of averaging over windows, and other �avors of composition. �e listing details pages, unsurprisingly, correlated with bookings. feature engineering tricks had accumulated over years of exper- To tackle the over��ing listing ids, we built a model taking imentation. Yet it was unclear if the features were the best they a page out of the multi-task learning playbook [15]. �e model could be, or even up to date with the changing dynamics of the mar- simultaneously predicted the probability of booking and long view ketplace. A big a�raction of NNs was to bring in feature automation, 4

5.feeding raw data and le�ing the feature engineering happen in the hidden units of the NN driven by data. Yet this section is dedicated to feature engineering, because we found that making NNs work e�ciently involved a li�le more than feeding raw data. �is �avor of feature engineering is di�erent from the traditional one: instead of deriving the math to perform on the features before feeding them into the model, the focus shi�s to ensuring the features comply with certain properties so that the NN can do the math e�ectively by itself. Figure 8: Distribution of output layer. 4.1 Feature normalization In our very �rst a�empt at training a NN, we simply took all the features used to train the GBDT model and fed it to the NN. �is went down very badly. �e loss would saturate in the middle of training and additional steps would have no e�ect. We traced the issue to the fact that the features were not normalized properly. For decision trees, the exact numeric values of the features hardly Figure 9: Example distributions from second hidden layer. ma�er, as long as their relative ordering is meaningful. Neural networks on the other hand are quite sensitive to the numeric values the features take. Feeding values that are outside the usual range of features can cause large gradients to back propagate. �is can permanently shut o� activation functions like ReLU due to vanishing gradients [3]. To avoid it we ensure all features are Figure 10: Example distributions from �rst hidden layer. restricted to a small range of values, with the bulk of the distribution in the {-1, 1} interval and the median mapped to 0. �is by and large involves inspecting the features and applying either of the �ese plots drive our intuition for why DNNs may be generalizing two transforms: well for our application. When building a model feeding on hun- dreds of features, the combinatorial space of all the feature values is • In case the feature distribution resembles a normal distri- impossibly large, and during training o�en a fraction of the feature bution, we transform it by (f eature al µ)/ , where µ combinations are covered. �e smooth distributions coming from is the feature mean and the standard deviation. the lower layers ensure that the upper layers can correctly �inter- • If the feature distribution looks closer to a power law dis- 1+f eatur e al polate� the behavior for unseen values. Extending this intuition tribution, we transform it by lo ( 1+median ). all the way to the input layer, we put our best e�ort to ensure the input features have a smooth distribution. 4.2 Feature distribution How can we test if the model is generalizing well beyond logged In addition to mapping features to a restricted numerical range, we examples? �e real test is of course online performance of the ensured most of them had a smooth distribution. Why obsess over model, but we found the following technique useful as a sanity the smoothness of distributions? Below are some of our reasons. check: scaling all the values of a given feature in the test set, such as price to 2x, 3x, 4x etc. and observing changes in NDCG. We 4.2.1 Spo�ing bugs. When dealing with hundreds of millions of found that the model’s performance was remarkably stable over feature samples, how can we verify a small fraction of them are not these values it had never seen before. buggy? Range checks are useful but limited. We found smoothness Most features a�ained a smooth distribution once debugged of distribution an invaluable tool to spot bugs as the distribution of and applied the appropriate normalization. However, for a few errors o�en stand in contrast to the typical distribution. To give we had to do specialized feature engineering. An example is the an example, we had bugs related to currencies in the prices logged geo location of a listing, represented by its latitude and longitude. for certain geographies. And for periods greater than 28 days, the Figure 11a and 11b show the distribution of raw lat/lng. To make logged price was the monthly price instead of the daily price. �ese the distribution smooth, instead of raw lat/lng we compute the errors showed up as spikes on the initial distribution plots. o�set from the center of the map displayed to the user. Shown 4.2.2 Facilitating generalization. Answering exactly why DNNs in Figure 11c, the mass seems concentrated at the center due to are good at generalizing is a complicated topic at the forefront of the tail end of maps which are zoomed out a lot. So we take lo () research [26]. Meanwhile our working knowledge is based on the of the lat/lng o�set, which yields the distribution in Figure 11d. observation that in the DNNs built for our application, the outputs �is allows us to construct two features with smooth distributions, of the layers get progressively smoother in terms of their distribu- Figure 11e and Figure 11f. tions. Figure 8 shows the distribution from the �nal output layer, To be clear, the raw lat/lng to o�sets from map center transform is while Figure 9 and Figure 10 show some samples from the hidden a lossy many-to-one function as it can convert multiple lat/lng to the layers. For plo�ing the values from the hidden layers, the zeros same o�set values. �is allows the model to learn global properties have been omi�ed and the transform lo (1 + relu output) applied. based on distance rather than properties of speci�c geographies. 5

6. (a) Distribution of raw lat (b) Distribution of raw lng (c) Heatmap of raw lat/lng o�set (d) Heatmap of log lat/lng o�set Figure 13: Location preference learnt for the query ”San Francisco” rates. However, we had not added the minimum required stay as a feature in the model as it was calendar dependent and consid- ered too complex. But a�er looking at the occupancy distribution, we added average length of stay at the listing as a feature. Once occupancy is normalized by average length of stay, we see the distri- bution in Figure 12b. Some features that lack a smooth distribution in one dimension may become smooth in a higher dimension. It was (e) Distribution of log lat o�set (f) Distribution of log lng o�set helpful for us to think through if those dimensions were already Figure 11: Transforming geo location to smoothly distributed features available to the model and if not, then adding them. To learn properties localized to speci�c geographies we use high 4.3 High cardinality categorical features cardinality categorical features that we describe later. �e over��ing listing id was not the only high cardinality categori- 4.2.3 Checking feature completeness. In some cases, investigat- cal feature we tried. �ere were other a�empts, where true to the ing the lack of smoothness of certain features lead to the discovery promise of NNs, we got high returns with li�le feature engineering. of features the model was missing. As an example, we had the frac- �is is best demonstrated by a concrete example. �e preference of tion of available days a listing was occupied in the future as a signal guests for various neighborhoods of a city is an important location of quality, the intuition being that high quality listings get sold out signal. For the GBDT model, this information was fed by a heavily ahead of time. But the distribution of occupancy turned out to be engineered pipeline, that tracked the hierarchical distribution of perplexing in its lack of smoothness, shown in Figure 12a. A�er bookings over neighborhoods and cities. �e e�ort involved in investigation we found an additional factor that in�uenced occu- building and maintaining this pipeline was substantial. Yet it didn’t pancy: listings had varying minimum stay requirements, sometimes factor in key elements like prices of the booked listings. extending to months, due to which they got occupied at di�erent In the NN world, handling this information was simplicity itself. We created a new categorical feature by taking the city speci�ed in the query and the level 12 S2 cell [16] corresponding to a listing, then mapping the two together to an integer using a hashing func- tion. For example, given the query ”San Francisco” and a listing near the Embarcadero, we take the S2 cell the listing is situated in (539058204), and hash {”San Francisco”, 539058204} ! 71829521 to build our categorical feature. �ese categorical features are then mapped to an embedding, which feed the NN. During training, the model infers the embedding with back propagation which encodes the location preference for the neighborhood represented by the S2 cell, given the city query. (a) Distribution of raw occu- (b) Distribution of occupancy normal- Figure 13 visualizes the embedding values learnt for the query pancy ized by average length of stay ”San Francisco”. �is matched our own intuitive understanding Figure 12 of the area: the embedding values not only highlighted the major 6

7.points of interest in the city, it indicated a preference for locations world mostly driven by anxiety akin to FOMO (fear-of-missing-out). a li�le further south into the west bay instead of closer locations �e e�ort spent in surveying all the options and experimenting across the bridges that are major tra�c snarls. with the combinations didn’t produce any meaningful improve- ments for us. However, the exercise did give us some con�dence in 5 SYSTEM ENGINEERING our choices which we describe below. �is section is concerned with speeding up training and scoring. A Dropout. Our initial impression was that dropout is the coun- quick summary of our pipeline: search queries from a guest hits terpart of regularization for neural networks [22], and thereby a JavaTM server that does retrieval and scoring. �e server also essential. However for our application, the di�erent �avors of produces logs which are stored as serialized �ri�TM instances. dropout we tried, all lead to a slight degradation in o�ine metrics. �e logs are processed using a SparkTM pipeline to create training To reconcile our lack of success with dropout, our current interpre- data. Model training is done using TensorFlowTM . Various tools tation of it is closer to a data augmentation technique [1], e�ective wri�en in Scala and JavaTM are used for evaluating the models and when the randomness introduced mimic valid scenarios that may computing o�ine metrics. �e models are then uploaded to the be missing in the training data. For our case, the randomness was JavaTM server that does retrieval and scoring. All these components simply producing invalid scenarios that was distracting the model. run on AWS instances. As an alternative, we added hand cra�ed noise shapes taking Protobufs and Datasets. �e GDBT model was fed training data into account the distribution of particular features, resulting in an in CSV format, and we reused much of this pipeline to feed the improvement of ⇠1% in o�ine NDCG. But we failed to get any TensorFlowTM models using feed dict. At �rst glance this seems statistically signi�cant improvement in online performance. like a �non-machine learning� issue, and was quite low in our list of priorities. Our wake up call came when we found our GPU Initialization. Out of sheer habit, we started our �rst model by utilizations near ⇠25%. Most of the training time was spent in pars- initializing all weights and embeddings to zero, only to discover ing CSV and copying data through feed dict. We were e�ectively that is the worst way to start training a neural network. A�er towing a Ferrari with a mule. Retooling the pipeline to produce surveying di�erent techniques, our current choice is to use Xavier training data as Protobufs and using Dataset [8] gave a 17x speedup initialization [5] for the network weights and random uniform in to training and drove GPU utilization to ⇠90%. �is ultimately the {-1, 1} range for embeddings. allowed us to a�ack the problem by scaling the training data from Learning rate. An overwhelming range of strategies confronted weeks to months. us here, but for our application we found it hard to improve upon Refactoring static features. A large number of our features were the performance of Adam [12] with its default se�ings. Currently properties of listings that rarely changed. For instance, location, we use a variant LazyAdamOptimizer [7], which we found faster number of bedrooms, a long list of amenities, guest rules etc. Read- when training with large embeddings. ing all these features as part of each training example created an input bo�leneck. To eliminate this disk tra�c, we used only the Batch size. Varying batch size has dramatic e�ect on training listing id as a categorical feature. All quasi-static listing features speed, but its exact e�ect on the model itself is hard to grasp. �e were packed as a non-trainable embedding indexed by the listing id. most useful pointer we found was [18]. We however, didn’t quite For the features that had a mutation over the training period, this follow the advice in the paper. Having swept the learning rate issue traded o� a small level of noise against training speed. Resident under the LazyAdamOptimizer carpet, we just opted for a �xed in the GPU memory, the embedding eliminated multiple kilobytes batch size of 200 which seemed to work for the current models. of data per training example that used to get loaded from disk via the CPU. �is e�ciency made it possible to explore a whole new 7 FEATURE IMPORTANCE class of models which took into account �ne details about tens of Estimating feature importance and model interpretability in gen- listings the user had interacted with in the past. eral is an area where we took a step back with the move to NNs. Estimating feature importance is crucial in prioritizing engineer- JavaTM NN library. Towards the beginning of 2017 when we ing e�ort and guiding model iterations. �e strength of NNs is started shipping the TensorFlowTM models to production, we found in �guring out nonlinear interactions between the features. �is no e�cient solution to score the models within a JavaTM stack. is also the weakness when it comes to understanding what role a Typically a back and forth conversion of the data between JavaTM particular feature is playing as nonlinear interactions make it very and another language was required and the latency introduced in di�cult to study any feature in isolation. Next we recount some of the process was a blocker for us. To adhere to the strict latency our a�empts in deciphering NNs. requirement of search, we created a custom neural network scoring library in JavaTM . While this has served us well till this point, we Score Decomposition. A homegrown partial dependence tool sim- expect to revisit the issue to see the latest alternatives available. ilar to [13] was the backbone of feature analysis in the GBDT world. In the NN world trying to understand individual feature impor- 6 HYPERPARAMETERS tance only lead to confusion. Our �rst naive a�empt was to take While there were a few hyperparameters like number of trees, regu- the �nal score produced by the network, and try to decompose it larization, etc in the GBDT world, NNs take it to a whole new level. into contributions coming from each input node. A�er looking at During initial iterations, we spent considerable time exploring this the results, we realized the idea had a conceptual error: there was 7

8. Peak Of Optimism Plateau Of Reality Valley Of Despair Figure 14: Comparison of feature distribution for top and bottom ranked list- Figure 15: Anatomy of a journey ings in test set. no clean way to separate the in�uence of a particular incoming features in the di�erent value ranges. Figure 14 shows an example. node across a non-linear activation like ReLU. �e distribution of prices for top ranked listings are skewed to- wards lower values, indicating the sensitivity of the model to price. Ablation Test. �is was another simplistic a�ack on the problem. However, the distribution of reviews look very similar when com- �e idea here was to ablate the features one at a time, retrain the paring top and bo�om ranked listings indicating this version of the model and observe the di�erence in performance. We could then model was not utilizing reviews as expected, providing direction assign the features importance proportional to the drop in perfor- for further investigation. mance their absence produced. However, the di�culty here was that any performance di�erence obtained by dropping a single fea- ture resembled the typical noise in o�ine metrics observed anyway 8 RETROSPECTIVE while retraining models. Possibly due to non-trivial redundancy in Figure 15 summarizes our deep learning journey so far. Feeding on our feature set, the model seemed capable of making up for a single the ubiquitous deep learning success stories, we started at the peak absent feature from the remaining ones. �is leads to a Ship-of- of optimism, thinking deep learning would be a drop in replacement �eseus paradox: can you keep ablating one feature at a time from for the GBDT model and give us stupendous gains out of the box. the model, claiming it has no signi�cant drop in performance? A lot of initial discussions centered around keeping everything else invariant and replacing the current model with a neural network Permutation Test. We raised the sophistication in our next at- to see what gains we could get. �is set us up for a plunge into the tempt, taking inspiration from permutation feature importance valley of despair, when initially none of those gains materialized. In proposed for random forests [20]. We observed the performance of fact, all we saw in the beginning was regression in o�ine metrics. the model on a test set a�er randomly permuting the values of a Over time we realized that moving to deep learning is not a drop-in feature across the examples in the test. Our expectation was that model replacement at all; rather it’s about scaling the system. As more important a feature, the higher the resulting degradation from a result, it required rethinking the entire system surrounding the perturbing it. �is exercise however lead to somewhat nonsensical model. Con�ned to smaller scales, models like GBDT are arguably results, like one of the most important features for predicting book- at par in performance and easier to handle, and we continue to use ing probability came out to be the number of rooms in a listing. them for focused medium sized problems. �e reason was that in permuting the features one at a time, we So would we recommend deep learning to others? �at would be had baked in the assumption that the features were independent of a wholehearted Yes. And it’s not only because of the strong gains each other, which was simply false. Number of rooms for instance in the online performance of the model. Part of it has to do with is closely tied to price, number of guests staying, typical amenities how deep learning has transformed our roadmap ahead. Earlier etc. Permuting the feature independently created examples that the focus was largely on feature engineering, but a�er the move to never occurred in real life, and the importance of features in that deep learning, trying to do be�er math on the features manually invalid space sent us in the wrong direction. �e test however was has lost its luster. �is has freed us up to investigate problems at a somewhat useful in determining features that were not pulling their higher level, like how can we improve our optimization objective, weight. If randomly permuting a feature didn’t a�ect the model and are we accurately representing all our users? Two years a�er performance at all, it was a good indication that the model was taking the �rst steps towards applying neural networks to search probably not dependent on it. ranking, we feel we are just ge�ing started. TopBot Analysis. A homegrown tool designed to interpret the features without perturbing them in any way provided some in- 9 ACKNOWLEDGEMENTS teresting insights. Named TopBot, short for top-bo�om analyzer, Most of us have managed to bring down the metrics singlehandedly it took a test set as input and used the model to rank the listings at some point. But li�ing the metrics have always been the work of per test query. It then generated distribution plots of feature values a collective. While naming everyone individually is not practical, from the listings ranked at the top for each query, and compared we wish to thank those who directly worked towards making deep them to the distribution of feature values from the listings at the learning a success at Airbnb - Ajay Somani, Brad Hunter, Yangbo bo�om. �e comparison indicated how the model was utilizing the Zhu and Avneesh Saluja. 8

9.REFERENCES 113–120. h�ps://doi.org/10.1145/2645710.2645724 [1] Xavier Bouthillier, Kishore Konda, Pascal Vincent, and Roland Memisevic. 2016. [26] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Dropout as data augmentation. In arXiv e-prints. 2017. Understanding deep learning requires rethinking generalization. h�ps: [2] Chris J.C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An //arxiv.org/abs/1611.03530 Overview. Technical Report. h�ps://www.microso�.com/en-us/research/ [27] Martin Zinkevich. 2018. Rules of Machine Learning. Retrieved April 30, 2018 publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/ from h�ps://developers.google.com/machine-learning/rules-of-ml/ [3] Djork-Arné Clevert, �omas Unterthiner, and Sepp Hochreiter. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In Pro- ceedings of International Conference on Learning Representations 2016 (ICLR’16). [4] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. New York, NY, USA. [5] Xavier Glorot and Yoshua Bengio. 2010. Understanding the di�culty of training deep feedforward neural networks. In Proceedings of the �irteenth International Conference on Arti�cial Intelligence and Statistics (Proceedings of Machine Learning Research), Yee Whye Teh and Mike Ti�erington (Eds.), Vol. 9. PMLR, Chia Laguna Resort, Sardinia, Italy, 249–256. [6] Yoav Goldberg. 2015. A Primer on Neural Network Models for Natural Language Processing. CoRR abs/1510.00726 (2015). arXiv:1510.00726 h�p://arxiv.org/abs/ 1510.00726 [7] Google. 2018. Tensor�ow Documentation: LazyAdamOptimizer. h�ps://www.tensor�ow.org/versions/r1.9/api docs/python/tf/contrib/opt/ LazyAdamOptimizer [8] Google. 2018. Tensor�ow Programmer’s Guide: Importing Data. h�ps://www. tensor�ow.org/programmers guide/datasets [9] Mihajlo Grbovic and Haibin Cheng. 2018. Real-time Personalization using Em- beddings for Search Ranking at Airbnb. In Proceedings of the 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. [10] Shubhra Kanti Karmaker Santu, Parikshit Sondhi, and ChengXiang Zhai. 2017. On Application of Learning to Rank for E-Commerce Search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17). ACM, 475–484. h�ps://doi.org/10.1145/ 3077136.3080838 [11] Andrej Karpathy. 2018. CS231n Convolutional Neural Networks for Visual Recognition. h�p://cs231n.github.io/convolutional-networks/ [12] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980 h�p://arxiv.org/abs/ 1412.6980 [13] Scikit learn Documentation. 2018. Partial Dependence Plots. h�p://scikit-learn. org/stable/auto examples/ensemble/plot partial dependence.html [14] Ste�en Rendle. 2012. Factorization Machines with libFM. ACM Transactions on Intelligent Systems and Technology 3, 3, Article 57 (May 2012), 22 pages. [15] Sebastian Ruder. 2017. An Overview of Multi-Task Learning in Deep Neural Networks. CoRR abs/1706.05098 (2017). arXiv:1706.05098 h�p://arxiv.org/abs/ 1706.05098 [16] s2geometry.io. 2018. S2 Geometry. Retrieved April 30, 2018 from h�ps:// s2geometry.io [17] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Denni- son. 2015. Hidden Technical Debt in Machine Learning Systems. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’15). 2503–2511. [18] Samuel L. Smith, Pieter-Jan Kindermans, and �oc V. Le. 2017. Don’t De- cay the Learning Rate, Increase the Batch Size. CoRR abs/1711.00489 (2017). arXiv:1711.00489 h�p://arxiv.org/abs/1711.00489 [19] Daria Sorokina and Erick Cantu-Paz. 2016. Amazon Search: �e Joy of Rank- ing Products. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’16). 459–460. [20] Leo Breiman Statistics and Leo Breiman. 2001. Random Forests. In Machine Learning. 5–32. [21] Christian Szegedy, Sergey Io�e, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In AAAI. [22] Stefan Wager, Sida Wang, and Percy Liang. 2013. Dropout Training As Adaptive Regularization. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’13). Curran Associates Inc., USA, 351–359. [23] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD’17 (ADKDD’17). ACM, New York, NY, USA, Article 12, 7 pages. [24] Peng Ye, Julian Qian, Jieying Chen, Chen-Hung Wu, Yitong Zhou, Spencer De Mars, Frank Yang, and Li Zhang. 2018. Customized Regression Model for Airbnb Dynamic Pricing. In Proceedings of the 24th ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining. [25] Xing Yi, Liangjie Hong, Erheng Zhong, Nanthan Nan Liu, and Suju Rajan. 2014. Beyond Clicks: Dwell Time for Personalization. In Proceedings of the 8th ACM Conference on Recommender Systems (RecSys ’14). ACM, New York, NY, USA, 9