Improving Ad Click Prediction by Considering Non-displayed Events

Click-through rate (CTR) prediction is the core problem of building advertising systems. Most existing state-of-the-art approaches model CTR prediction as binary classi!cation problems, where displayed events with and without click feedbacks are respectively considered as positive and negative instances for training and of-“ine validation. However, due to the selection mechanism applied in most advertising systems, a selection bias exists between distributions of displayed and non-displayed events. Conventional CTR models ignoring the bias may have inaccurate predictions and cause a loss of the revenue. To alleviate the bias, we need to conduct counterfactual learning by considering not only displayed events but also non-displayed events. In this paper, through a review of existing approaches of counterfactual learning, we point out some difficulties for applying these approaches for CTR prediction in a real-world advertising system. To overcome these difficulties, we propose a novel framework for counterfactual CTR prediction. In experiments, we compare our proposed framework against state-of-the-art conventional CTR models and existing counterfactual learning approaches.


1.See discussions, stats, and author profiles for this publication at: Improving Ad Click Prediction by Considering Non-displayed Events Conference Paper · September 2019 DOI: 10.1145/3357384.3358058 CITATIONS READS 0 81 7 authors, including: Bowen Yuan Meng-Yuan Yang 4 PUBLICATIONS 18 CITATIONS ETH Zurich 4 PUBLICATIONS 18 CITATIONS SEE PROFILE SEE PROFILE Chih-Jen Lin National Taiwan University 143 PUBLICATIONS 53,343 CITATIONS SEE PROFILE All content following this page was uploaded by Bowen Yuan on 10 September 2019. The user has requested enhancement of the downloaded file.

2. Improving Ad Click Prediction by Considering Non-displayed Events Bowen Yuan Jui-Yang Hsia∗ Meng-Yuan Yang† National Taiwan University National Taiwan University ETH Zurich Hong Zhu Chih-Yao Chang Zhenhua Dong Huawei Noah’s ark lab Huawei Noah’s ark lab Huawei Noah’s ark lab Chih-Jen Lin National Taiwan University ABSTRACT ACM Reference Format: Click-through rate (CTR) prediction is the core problem of build- Bowen Yuan, Jui-Yang Hsia, Meng-Yuan Yang, Hong Zhu, Chih-Yao Chang, Zhenhua Dong, and Chih-Jen Lin. 2019. Improving Ad Click Prediction ing advertising systems. Most existing state-of-the-art approaches by Considering Non-displayed Events. In The 28th ACM International Con- model CTR prediction as binary classi�cation problems, where dis- ference on Information and Knowledge Management (CIKM ’19), Novem- played events with and without click feedbacks are respectively ber 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 16 pages. https: considered as positive and negative instances for training and of- // �ine validation. However, due to the selection mechanism applied in most advertising systems, a selection bias exists between distri- 1 INTRODUCTION butions of displayed and non-displayed events. Conventional CTR models ignoring the bias may have inaccurate predictions and cause Online advertising is a major pro�t model for most Internet com- a loss of the revenue. To alleviate the bias, we need to conduct coun- panies where advertisers pay publishers for promoting ads on their terfactual learning by considering not only displayed events but also services. Advertising systems perform in the role of a dispatcher non-displayed events. In this paper, through a review of existing ap- that decides which and where ads are displayed. Though there are proaches of counterfactual learning, we point out some di�culties many compensation methods in advertising systems, in this work, for applying these approaches for CTR prediction in a real-world ad- we focus on cost-per-click (CPC) systems, where advertisers are vertising system. To overcome these di�culties, we propose a novel only charged if their ads are clicked by users. We give a simpli�ed framework for counterfactual CTR prediction. In experiments, we example of CPC systems in Figure 1. compare our proposed framework against state-of-the-art conven- Because the number of positions from publishers is limited, the tional CTR models and existing counterfactual learning approaches. system must select the most valuable ads to display. Therefore, Experimental results show signi�cant improvements. when the system receives a request, all ads must be ranked accord- ing to the following expected revenue CCS CONCEPTS ECPM = P ⇥ R, (1) • Information systems → Computational advertising. where KEYWORDS P = Pr(click | event) (2) CTR prediction, Recommender system, Missing not at random, is the probability of an event associated with a (request, ad) pair Selection bias, Counterfactual learning being clicked, and R is the price paid by advertisers if the event is clicked. Then the system selects and displays the top events on the ∗ This author contributes equally with Bowen Yuan. publishers. In Figure 1, the example shows that �ve events bid for † Work done at Huawei. three advertising positions, so the system ranks them by (1). Event ‘1’,‘2’ and ‘3’ are the top three, and get the chance to be displayed. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed Once the event is displayed, the system will receive signals if for pro�t or commercial advantage and that copies bear this notice and the full citation a customer clicks it, which can be considered as the feedback of on the �rst page. Copyrights for components of this work owned by others than ACM this event. Then the system gains revenue from the advertiser. As must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci�c permission and/or a the example shown in Figure 1, among the three displayed events, fee. Request permissions from event ‘2’ is clicked, while the other two are not clicked. CIKM ’19, November 3–7, 2019, Beijing, China To maximize the revenue, the estimation of (2), also referred as © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6976-3/19/11. . . $15.00 click-through rate (CTR) prediction, is the core problem of build- ing advertising systems. To estimate CTR, the most widely used

3. Table 1: Main notation. E1 Event 1 (x 1 ) 7 L numbers of displayed events Displayed L̂ numbers of displayed and non-displayed events E2 Event 2 (x 2 ) 3 x, feature vector and label D numbers of features set of events logged by the uniform policy Event 3 (x 3 ) E3 7 St Sc set of events logged by non-uniform policies ˆ(·), (·) CTR model and imputation model E4 Event 4 (x 4 ) ? m, n numbers of requests and ads set of displayed (request, ad) pairs E5 Event 5 (x 5 ) ? Y , Ŷ , A label matrix, output matrix, and the matrix of imputed labels Figure 1: A simpli�ed example to illustrate the procedure of selecting events and collecting feedbacks in CPC systems. The symbols ‘3’, ‘7’, and ‘?’ indicate the event is clicked, not However, in Section 2.2, by following some past works (e.g., clicked, and not displayed, respectively. [18, 31]), we argue the importance of handling non-displayed events. That is, through considering non-displayed events as unlabeled in- stances, CTR prediction should be modeled as a learning problem approach is to learn from historically displayed events via machine with labeled and unlabeled instances. Further, through an A/B test learning techniques. Speci�cally, assume that we have the following conducted in our online system, we show that due to the incon- L events sistency between distributions of labeled and unlabeled samples, ( 1 , x 1 ), · · · , ( L , x L ), (3) the ignorance of non-displayed events may cause a strong bias and where l and x l are the label and the feature vector of the lth events, inaccurate predictions. To eliminate the bias in CTR prediction, respectively. To model CTR prediction as binary classi�cation prob- not only displayed events but also non-displayed events should be lems, we assume that the label satis�es considered. However, because of lacking labels for non-displayed ( events, we need to solve a learning problem with partial labels, 1 the event was displayed and clicked, = which is also referred to as “batch learning from logged bandit 1 the event was displayed but not clicked. feedbacks” or “counterfactual learning” [31]. Then we need to �nd a decision function ˆ(x) 2 [ 1, 1] so that it In Section 3, through a review on existing approaches of counter- can be used to predict the label. For example, in a typical classi�ca- factual learning, we point out some di�culties of applying them in tion setting the predicted label is a real-world CPC system. For example, many existing approaches ( required that events are sampled to be displayed through the impor- 1 if ˆ(x) 0, tance sampling technique. However, as illustrated in Figure 1, most 1 otherwise. real-world CPC systems deterministically select the top events with For CTR prediction, a probability output may be needed, so we can the highest expected revenues. Therefore, these approaches are in- transform ˆ(x) to a value in [0, 1] by a sigmoid function. To obtain applicable to practical CTR prediction. Additionally, some existing a good decision function, many state-of-the-art models for CTR approaches need to solve a di�cult optimization problem, a situa- prediction problems have been proposed (e.g., logistic regression tion not impractical for industrial-scale CTR prediction. Our main [4, 24, 28], factorization machines [27], �eld-aware factorization contribution in Section 4 is to propose a novel framework for coun- machines [15, 16], and deep learning approaches [8, 26, 37]). A terfactual CTR prediction to overcome these di�culties. To address review of some of these models will be given in Section 2.1. the issue of the training e�ciency, we link the proposed framework Recall the example in Figure 1. Two events are not displayed to recommender systems with implicit feedbacks and develop e�- so their feedbacks are not available. Most works mentioned above cient algorithms. In Section 5, experiments are conducted on a semi- ignore non-displayed events. We think the reason supporting this synthetic small data set and a real-world data set collected from decision is that some pioneering works (e.g., [28, Section 3]) thought our online system. Results show our framework is superior to both causality exists between a display and a click, where the click is conventional and counterfactual learning approaches. Table 1 gives understood as a consequence of the display. Then, they give the main notation in this paper. Supplementary materials and code for following basic but far-reaching assumption: experiments are at A��������� 1.1. The probability of an event being clicked but not displayed is zero. 2 HANDLING UNOBSERVED DATA IN CTR With the assumption, no click information is included for non- PREDICTION displayed events. Therefore, the probability of an event being clicked In this section, we �rst give a brief review on the widely-used can be measured by CTR de�ned as settings for CTR prediction. Next we show that under this setting, a number of clicks bias may exist and cause inconsistency between o�ine and online CTR = ⇥ 100%. evaluations. number of displayed events

4.2.1 A Review on CTR Prediction by Binary Given an event l, a logging policy Classi�cation decides if it can be displayed As mentioned in Section 1, the task of CTR prediction is to �nd a model, which inputs x, outputs ˆ(x) and generates a probability value as the prediction of (2). If we have L displayed events as Displayed Non-displayed training samples shown in (3), most CTR works solve the following optimization problem. ’ L Clicked Non-clicked Non-observable min R(W) + `( l , ˆl ), (4) l =1 l = 1 l =? W l =1 where ˆl ⌘ ˆ(x l ), W is a set of model parameters, `(a, b) is a Logged bandit feedbacks loss function convex in b, and R(W) is the regularization term with the parameter decided by users. Logistic regression (LR) is a Figure 2: A diagram of CTR prediction being formulized as commonly used model for CTR prediction. Assume D is the number a selective labels problem of features in each vector x l , and a vector w 2 R D is considered as the set W of model parameters. LR then sets where Lp is the number of positive instances, and Rankl is the ˆ(x) = w T x rank of a positive event l in all L events, which are ranked in a and considers the following logistic loss descending order according to their predicted values. `(a, b) = log(1 + e ab ). (5) 2.2 The Importance of Handling Past studies (e.g., [4, 16]) have shown that learning the e�ect of Non-displayed Events feature conjunctions is crucial for CTR prediction. However, due to Most CTR models conducted training and validation on the dis- the high dimensionality and the high sparsity of x, LR may have played events. Let di�culty to learn from all feature conjunctions of x. To overcome PL = Pr(click | event, displayed) (10) the di�culty, factorization machine (FM) [27] and its variants such as �eld-aware factorization machine (FFM) [16] are proposed. They be the probability of displayed events being clicked. Models ob- consider a low-rank approximation to the coe�cient matrix of all tained by solving (4) implicitly assume that feature conjunctions. The output of FM is P = PL . (11) ’ D ’ D ˆFM (x) = (w Tj1 w j2 )x j1 x j2 , (6) However, in the subsequent discussion, we show that this as- j 1 =1 j 2 j 1 +1 sumption is not precise. We start by following [18] to formalize CTR prediction as a selective labels problem. As illustrated in Figure where w j 2 are the latent vectors and k is a pre-speci�ed R k , 8j 2, we can view our system as a labeling platform, where customers value. FFM extends FM to assume that features are grouped to help to label L + L U instances. However, due to the limitation of ca- several “�elds.” pacity, an algorithm (also called logging policy) selects L instances ’D ’D and asks customers to label them. From this view, the displayed ˆFFM (x) = (w Tj , f w j2, f j )x j1 x j2 , (7) events are labeled instances (also called logged bandit feedbacks), 1 j2 1 j 1 =1 j 2 j 1 +1 while non-displayed events are unlabeled instances. Because of where f j1 and f j2 are the corresponding �elds of features j 1 and j 2 , lacking labels, the click distribution of non-displayed events, respectively, and w j1, f j 2 R k is a latent vector which learns the la- PU = Pr(click | event, non-displayed), (12) 2 tent e�ect between feature j 1 and the interacted features belonging is unknown rather than zero in Assumption 1.1. It can be known to the �eld f j2 . It has been shown in some past competitions (e.g., that if the logging policy completely uniformly selects events by [14, 16]) and industry scenarios [15] that FFM gives state-of-the-art random, distributions of click behavior can be the same on displayed results on CTR prediction. and non-displayed events. Then we can predict P via estimating PL . For evaluation, two metrics are commonly used [4, 9, 16]. Sup- Unfortunately, this condition can not be held for CTR prediction pose L is the number of events being evaluated. The �rst criterion due to the non-uniform selection of displayed events as introduced is the negative logarithmic loss (NLL): in Section 1. 1’ L To better explain this, we deployed a uniform policy on a small NLL ⌘ log(1 + e l ˆl ). (8) L part of live tra�cs in our online system, while the rest was still l =1 served by non-uniform policies. According to logging policies, the Another one is the area under the ROC curve (AUC): labeled events are grouped into the following two sets: ✓ ◆ ÕL p Lp S t = {( l , x l ) | event l was logged by the uniform policy}, Rankl l =1 2 AUC ⌘ , (9) Sc = {( l , x l ) | event l was logged by non-uniform policies}. (Lp (L Lp ))

5. more practical ways to eliminate the bias carried in the data logged 2 by non-uniform logging policies. Now we aim to consider both displayed and non-displayed events. 1.5 Assume we have the following L̂ events ( 1 , x 1 ), · · · , ( L̂ , x L̂ ), (15) 1 where L̂ with L̂ L is the number of all possible events. To esti- mate P without bias, we expect to solve the following optimization 0.5 problem. ’ L̂ min R(W) + `( l , ˆl ). (16) 0 W 10 0 10 -1 10 -2 10 -3 10 -4 10 -5 l =1 However, because feedbacks of non-displayed events can not be Figure 3: The coordinates ( , ) of each dot indicate an ad’s observed for most CPC systems, the corresponding l for l = L + displayed ratio de�ned in (13) and the log-scaled CTR ratio 1, · · · , L̂ are not available. Therefore, (16) can not be solved by between events from Sc and S t ; see (14). conventional methods based on supervised learning. Solving (16) with selective labels is also referred to as “batch learning from logged bandit feedbacks” and “counterfactual learning” [31]. We In Figure 3, we check how CTR estimated from logs of non-uniform give a brief review of existing approaches on this topic in the next policies di�ers from the true CTR. The x-axis is the displayed ratio section. of an ad, where each point is calculated by # displayed events associated with ad j 3 EXISTING APPROACHES FOR LEARNING j = . # (displayed+non-displayed) events associated with ad j FROM SELECTIVE LABELS (13) The y-axis is the log-scaled ratio of an ad’s observed CTR in Sc to Existing approaches for solving (16) with selective labels broadly that in S t , where each point is calculated by fall into the following three categories: direct method, the inverse- propensity-scoring method, and the doubly robust method. the CTR of ad j observed in Sc j = log( ). (14) the CTR of ad j observed in S t 3.1 Direct Method As we mentioned earlier, for events logged by a uniform policy, As we mentioned above, the key challenge in solving (16) is that we P = P L . Thus an ad’s CTR value from S t can be considered as lack labels of those non-displayed events. Direct methods are based the true P’s average over all events associated with this ad. From on an intuitive idea that we can �nd a relationship (·) between Figure 3, with j > 0, an ad’s CTR from events in Sc is higher than labels and features of displayed events such that (x l ) ⇡ l , for that from events in S t . This observation con�rms a gap between l = 1, · · · , L. We call (·) as an imputation model, which can impute P L and P when events are logged by non-uniform policies. More labels for all events and have the following L̂ labeled samples importantly, the gap varies on di�erent ads. For example, for those ads with lower scores, the gap enlarges, while for those ads with ( 1 , x 1 ), · · · , ( L̂ , x L̂ ), higher scores, the gap narrows. It can be realized that for an ad favoured by the proceeding policies, more associated events can where l ⌘ (x l ). Then we solve the following standard supervised be displayed and fewer events are not labeled. Then the observed learning problem. CTR is closer to the unbiased one obtained from the fully labeled set S t . We can conclude that if we �t CTR models on events logged ’ L̂ min R(W) + `( l , ˆl ). (17) by non-uniform policies, a biased estimator is obtained. It may W l =1 overestimate the expected revenue of ads with lower scores and wrongly display too many of them. Clearly, the performance of a direct method is determined by the The bias caused by using a non-uniform logging policy is often quality of (·). Many past works found (·) via conducting machine referred to as “selection bias” in the literature of econometrics [10], learning techniques on the labeled data set. For example, ridge linear and “missing not at random” in the �eld of statistics [21]. Ideally, a regression [6], and random forests [19] are considered. simple way to avoid the selection bias is to directly train CTR models on S t . However, in most scenarios of CTR prediction, the feature set 3.2 Inverse-Propensity-Scoring (IPS) Method is high dimensional and sparse, so a large training set is required By assuming the logging policy is stochastic, the IPS method pro- for learning a model with good generalization ability. The higher posed in [11] weights each labeled event with the inverse of its CTR shown in Figure 3 by displaying events in Sc indicates that if propensity score, which refers to the tendency or the likelihood we deploy the uniform policy on more tra�cs, the overall revenue of an event being logged. Speci�cally, given the event l, we can may decrease. Therefore, forming a large S t through deploying the observe an indicator Ol 2 {0, 1}, which indicates if the event was uniform policy on a signi�cant amount of tra�c is not practically displayed. Following [30, 32], we assume that Ol is sampled from a viable for many internet companies. In this paper, we will develop Bernoulli distribution as Ol ⇠ Bern(zl ), where zl is the probability

6.of the event being displayed. Here we consider zl as the propensity exist three main obstacles for applying these approaches for CTR scores of event l. The IPS method solves the following problem. prediction in a real-world CPC system. First, from the theoretic guarantee established by past literatures, ’ L̂ `( l , ˆl ) min R(W) + Ol , IPS works when the logging policy is stochastic such that with the W zl increase of L, su�cient events with various propensity scores can be l =1 observed. That is, though the policy tends to select the events with which, according to the observations of Ol , 8l, is equivalent to high propensity scores, those with low propensity scores still can ’ L `( l , ˆl ) have a chance to be displayed and included in the labeled set. For min R(W) + . (18) events with low propensity scores, the formulation in (18) leads to W zl l =1 larger corresponding instance weights, and then IPS can correct the The following proof has been given in past works, which shows selection bias. However, commercial products such as CPC systems that solving (18) can get an unbiased estimator of P. usually avoid a stochastic policy because of the risk of revenue loss. Instead, as we introduced in Section 1, the policy makes decisions ’ L̂ `( l , ˆl ) ’ L̂ `( l , ˆl ) by considering if the expected revenue is large enough. Therefore, EO [ Ol ] = · EO [Ol ] = (16). zl zl the logging policy of our CPC system can be considered as under l =1 l =1 the following deterministic setting. Before solving (18), we need to know zl for l = 1, · · · , L. The work [30] di�erentiates two settings for the assignment mechanism Ol = I(ECPMl > t), (20) of zl . The �rst is an experimental setting, where any event zl is where ECPMl is the expected revenue of the event l, and t is a decided by us and Ol is generated according to zl . Therefore we dynamic threshold. This policy deterministically selects the events can directly solve (18) with known zl . Many past works and data with high propensity scores to be displayed. In this case, the events sets (e.g., [20, 31]) are based on this setting. The second is an obser- with low propensity scores are absent in the labeled set. The de- vational setting, where Ol is not controlled by us, so zl is implicit. terministic logging policy makes theory and practice of the IPS In this case, we must estimate zl �rst by some techniques. For ex- method inapplicable to practical CTR prediction. ample, naive Bayes [30], survival models [36], and more generally, Secondly, for the direct method and the corresponding part in by assigning ( the doubly robust method, we need an imputation model (·) to 1 the event was displayed, generate labels for non-displayed events. However, as we have Ol = 0 the event was not displayed, shown that PU , PL in Section 2.2, many existing works [6, 19] �nd that imputation models learnt from displayed events will propagate we can �nd a relationship o(·) between Ol and the given features the selection bias to the non-displayed events. Then both the direct such that o(x l ) ⇡ Ol , for l = 1, · · · , L̂ by any machine learning and the doubly robust methods cannot give unbiased predictions. techniques (e.g., logistic regression [30], gradient boosting [25]). The third issue is that for estimators obtained by solving op- Then zl ⌘ o(x l ) for l = 1, · · · , L can be used as the propensity timization problems like (17) and (19), they involve O(L̂) costs of scores in (18). time and space. However, L̂ is extremely large considering all dis- played and non-displayed events. Therefore, to have a practical 3.3 Doubly Robust Method framework for large-scale counterfactual CTR prediction, any O(L̂) Doubly robust method for counterfactual learning was �rst pro- cost of storage and computation should be avoided. posed in [6], which takes advantage of the two aforementioned approaches by combining them together. Therefore, the following 4 A PRACTICAL FRAMEWORK FOR optimization problem is solved. COUNTERFACTUAL CTR PREDICTION ’ L ⇣ ⌘ ’ L̂ To address issues mentioned in Section 3.4, we propose a practical `( l , ˆl ) min R(W)+ `( l , ˆl ) + `( l , ˆl ) . (19) framework for counterfactual CTR prediction. Our framework ex- W zl tends the doubly robust method by dropping the propensity scores | {z } | {z } l =1 l =1 to address the issue of using a deterministic logging policy. We call IPS method part direct method part it as the propensity-free doubly robust method. To further improve The analysis in [6] shows that the bias of the doubly robust method the robustness of our framework, we propose a new way to obtain is the product of the two parts from the direct method and the IPS the imputation model by taking the advantage of the small but method. Therefore, the bias of the doubly robust method can be unbiased set S t de�ned in Section 2.2. Finally, by connecting the eliminated if either the direct method part or the IPS method part is CTR prediction with recommender systems, we propose an e�cient unbiased. Empirical studies con�rm that the doubly robust method training algorithm so that the prohibitive O(L̂) cost can be trickily performs better than solely a direct method or an IPS method. avoided if some conditions are satis�ed. 3.4 Discussion on Feasibility of Existing 4.1 Propensity-free Doubly Robust Method Approaches for CPC Systems A recent work [19] describes an analogous scenario that many Though the estimators obtained by above-mentioned methods are commercial statistical machine translation systems only determin- expected to give unbiased predictions of P, we show that there istically provide the best translation result to users and then collect

7. Uniform Unbiased Learned from Imputation impute labels of all displayed and non-displayed events; see the last Policy Set S t St Model (·) term in (21). Note that in our framework, learning the imputation model and Imputing learning the CTR model are two separate tasks. Di�erent from the Labels task of learning the CTR model which is generally complex and requires a large training set, the task of learning the imputation Displayed Non-uniform Sc [ S t Propensity-free CTR model is much more �exible. We can decide the imputation model Policies Doubly Robust Model ˆ(·) and the corresponding feature set depending on the size of S t . That Non-displayed is, when S t is large enough, the task can be as complex as that of learning the CTR model. But when the size of S t is limited, the task All events can be reduced to impute labels with a single value inferred from Figure 4: A diagram showing components of our framework S t (e.g., the average CTR in S t ). Through the experimental results shown in Section 5.5, we will see that unbiased imputation models help our framework to obtain good performances even if S t is very corresponding feedbacks. Then they �nd that in their scenario, the small. doubly robust method still can perform well even under a deter- ministic logging policy. Their analysis provides a reason behind the 4.3 E�cient Training result that the direct method part of doubly robust imputes labels By imputing labels for all possible events in the direct method part, for non-displayed events, which remedies the absence of unlabeled (21) becomes a standard classi�cation problem with L̂ instances. events and improves the generalization ability of the model. This Because L̂ can be extremely large for a real-world CPC system, any result motivates us to investigate if the similar idea can be extended cost of O(L̂) will make the training process infeasible. Therefore, on CTR prediction. Speci�cally, we follow [19] to set zl = 1 for all many existing optimization techniques, if directly applied, are un- displayed events logged by a deterministic policy. Then we modify able to e�ciently solve (21). For example, the stochastic gradient (19) to have the following problem of the propensity-free doubly (SG) method popularly used for CTR prediction [8, 16] may face robust method. di�culties in the process of sampling O(L̂) instances.1 To overcome L ⇣ ’ ⌘ ’ L̂ the di�culties, we connect CPC systems with a related but di�erent min R(W) + `( l , ˆl ) `( l , ˆl ) + `( l , ˆl ), (21) �eld, called recommender systems, where an analogous issue has W l =1 l =1 been well-studied. where is used as a hyper-parameter to balance the IPS part and In a typical recommender system, we observe part of a rating the direct method part. matrix Y 2 Rm⇥n of m users and n items. For each observed (i, j) Recall that the concept of the doubly robust method is that the pair, Yi j indicates the rating of item j given by user i. The goal is to bias can be eliminated if either one of the two sub-estimators is predict the rating of an item not rated by a user yet. Following the unbiased. Compared with (19), we can see that the IPS part in (21) idea, if we consider an event as a (request, ad) pair, an advertising under the setting of zl = 1 is essentially reduced to the naive binary system is naturally a recommender system. As illustrated in Figure classi�cation problem (4), which de�nitely cannot give unbiased 5, the L displayed events can be considered as L observed (request, predictions. Therefore, we rely on the direct method part to get unbi- ad) entries in a rating matrix Y 2 Rm⇥n , where m is the total ased predictions. However, if we follow existing works of the doubly number of requests and n is the total number of ads. According robust method (e.g., [6, 19]) to learn (·) from displayed events, the to whether the ad was clicked by the user or not, these observed carried selection bias will be propagated to non-displayed events entries can be grouped into the following two sets through the imputation model. In this case, the direct method part + = {(i, j) | ad j was displayed and clicked in request i}, can not give unbiased predictions either. In order to avoid this situ- = {(i, j) | ad j was displayed but not clicked in request i}. ation, we design a new approach to obtain the imputation model in the next section. By assigning ratings to observed entries as ( 4.2 Unbiased Imputation Models 1 (i, j) 2 + , Yi j = Recall that in Section 2.2, to help realize the severity of the selection 1 (i, j) 2 , bias, we collect a labeled set S t through deploying a uniform policy and replacing the index l with the index (i, j), we can re-write (21) on a small percentage of our online tra�c. Though S t is very small, as ’ ⇣ ⌘ it still contains some gold-standard unbiased information. We aim min R(W) + `(Yi j , Ŷi j ) `(Ai j , Ŷi j ) to leverage S t in our framework. W (i, j)2 Past studies that have used S t include, for example, [3, 29]. By (22) ’m ’n considering both Sc and S t as training sets, they jointly learn two + `(Ai j , Ŷi j ), models respectively from the two sets in a multi-task fashion and i=1 j=1 conduct predictions by combining the two models. Our approach di�ers from theirs on the usage of S t . As illustrated in Figure 4, we 1 The failure of SG on handling similar scenarios has been reported in several past learn the imputation model (·) from S t and apply the model to works [33, 35].

8. ad 1 ad 2 ··· ··· ad n • The model function ˆ(·) can not be an arbitrary one. Speci�cally, given a feature vector x i, j , ˆ(x i, j ) can be factorized into multiple request 1 3 7 ? ? ? parts, each of which is only related to i or j and convex in the corresponding parameters. For example, a modi�ed FFM model .. ? ? ? 3 7 proposed in [35] satis�es this condition. . Optimization methods have been proposed so that the O(mn) term in the overall cost can be replaced with a smaller O(m) + O(n) term. request m ? 3 ? ? 7 The main reason why O(mn) can be eliminated is because results Figure 5: An illustration of a rating matrix of m requests on all i and all j can be obtained by using values solely related to and n ads, where L events are displayed. The symbols ‘3’, ‘7’, i and j, respectively. A conceptual illustration is in the following and ‘?’ indicate the event is clicked, not clicked, and not dis- equation. played, respectively. ’ m ’ n ’ m ’n (· · · ) = (· · · ) ⇥ or + (· · · ). (27) i=1 j=1 i=1 j=1 where By considering the similarity between (22) and (26), with the = + [ corresponding restrictions in mind, we propose the following �nal realization of the framework (21): is the set of displayed pairs with | | = L, L̂ = mn, Ŷ 2 Rm⇥n is the ’ ⇣ ⌘ output matrix with Ŷi j = ˆ(x i, j ), and A 2 Rm⇥n is the matrix of min R(W) + `(Yi j , Ŷi j ) (Ai j Ŷi j )2 imputed labels with W (i, j)2 Ai j = (x i, j ). (23) ’m ’n (28) + (Ai j Ŷi j )2 . Under the above setting, L̂ = mn indicates the total number of i=1 j=1 possible events. The second term in (22) involves O(mn) = O(L̂) operations, which are prohibitive when m and n are large. The same For (i, j) in , we allow general convex loss functions in the original O(mn) di�culty occurs in a topic called “recommender systems IPS part to maintain the connection to conventional CTR training, with implicit feedbacks” [12, 13, 33, 35]. It describes a scenario while in the direct method part, we follow the above restriction for where a rating matrix Y includes both positive and negative entries. `¯ in (26) to consider the squared loss. We then compare the two However, only a subset + including some positive entries have O(mn) terms in (26) and (28). By respectively excluding C̄i j and been observed. An approach to handle this scenario is to treat all , ours considers a more complicated setting of A in (23), where unobserved pairs as a negative rating r , such as 0 or -1 in [34, 35]. all mn entries can be various rather than only a constant value in Let A 2 Rm⇥n be an arti�cial rating matrix with (26). There remains a cost of O(mn) caused by A in (28). Inspired by ( the restriction of ˆ(·) that is used to have Ŷi j = ˆ(x i, j ), to extend 1 (i, j) 2 + , existing optimization methods for (26) on our proposed framework, Ai j = (24) r (i, j) < + . an additional condition should be considered: • The imputation model function (·) can not be an arbitrary one. To �nd a model �tting all Ai j values, the cost of O(mn) arises in Speci�cally, given a feature vector x i, j , (x i, j ) can be factorized solving the following problem with mn instances. into multiple parts, each of which is only related to i or j but ’ m ’ n not required to be convex in parameters. Thus the choice of (·) min R(W) + Ci j `(Ai j , Ŷi j ). (25) is broader than that of ˆ(·). For example, the multi-view deep W i=1 j=1 learning approach proposed in [7] satis�es this condition. where Ci j is a cost parameter. To avoid the O(mn) cost, existing In our realization, we implement various imputation models from studies [1, 33–35] impose some resctrictions on terms over values simple to complex, all of which satisfy the above condition; see not in + and change (25) to details in Section 5.2.2. ’ ’ min R(W) + Ci j `(1, Ŷi j ) + ¯ , Ŷi j ) C̄i j `(r To use (28), we consider the FFM model for ˆ(x i, j ) and develop W (i, j)2 (i, j)< an optimization method. Details are in Section A of supplementary ’ ⇣ ⌘ + + materials. = R(W) + Ci j `(1, Ŷi j ) ¯ Ŷi j ) C̄i j `(r, (i, j)2 + 5 EXPERIMENTS ’m ’ n We begin by presenting our experimental settings including data ¯ , Ŷi j ), C̄i j `(r (26) + sets and parameter selections. Then, we design and conduct a series i=1 j=1 of experiments to demonstrate the e�ectiveness and robustness of where the following conditions should apply. our proposed framework in Section 4. • The cost parameters C̄i j in (26) cannot be arbitrary values. In practice, a constant value is often considered. 5.1 Data Sets ¯ ·) is needed. In most existing works • A simple loss function `(·, We consider the following data sets for experiments, where statistics the squared loss is considered. are described in Table 2.

9. 5.2.1 Conventional Approaches for CTR Prediction. This category Non-uniform includes approaches that solve the binary classi�cation problem in Sc Sc0 (4) with the following CTR model. policies: • FFM: This is the CTR model reviewed in Section 2.1. We consider Uniform the slightly modi�ed formulation proposed in [35]. S t S va S te policy: For the training set, we consider the following three combinations 1 2 3—9 days of Sc and S t . Figure 6: An illustration of data segmentation according to • Sc : This is a large set including events logged by non-uniform training, validation, and test periods. policies; see Section 2.2. This simulates the setting of most sys- tems for CTR prediction. Table 2: Data statistics • S t : This is a small set including events logged by a uniform policy; see Section 2.2. Because the set is free from the selection bias, Data set m n D |Sc | |S t | |S va | |S te | some works (e.g., [22]) deploy a model trained on S t , though an Yahoo! R3 14,877 1,000 15,877 176,816 2,631 2,775 48,594 issue is that |S t | is small. CPC 21.7M 140 1.4M 42.8M 0.35M 0.31M 2.0M • Sc [ S t : This is the set of all logged events. 5.2.2 Counterfactual Learning Approaches. This category includes • CPC: This is a large-scale data set for CTR prediction, which the following approaches, where all of them consider FFM as the includes events of nine days from a real-world CPC system. We CTR model and Sc [ S t as the training set. consider a two-slot scenario so that for each request, the two • IPS: The approach was reviewed in Section 3.2. For the estimation ads with the highest expected revenues from 140 candidates of propensity scores, we follow [30] to consider the following are displayed. To have training, validation and test sets under Naive Bayes propensity estimator. uniform and non-uniform policies, as illustrated in Figure 6, we segment the data to six subsets. The �ve subsets used in our P( , O = 1) zl ⇡ P(Ol = 1 | l = ) ⇡ , (29) experiments are described below. P( ) – S t , Sc de�ned in Section 2.2, are used separately or together as the training set depending the approach. where = { 1, 1} is the label, P( , O = 1) is the ratio of events – S va is the set obtained in the validation period under the uni- labeled as in the displayed set, and P( ) is the ratio of the events form logging policy. As an unbiased set, it is used by all meth- labeled as in a selection-bias free set. We estimated them by ods for the parameter selection. respectively using Sc [ S t and S t . – S te is the set obtained in the test period under the uniform • CausE: The “causal embedding” approach [3] jointly learns two logging policy. As an unbiased set, it is used to check the test CTR models ˆc (·) and ˆt (·) respectively from Sc and S t . They performance of all methods. solve the following optimization problem. – Sc0 is the set of events logged by non-uniform policies in the 1 ’ 1 ’ validation period. For methods that have involved Sc for train- min `( , ˆc (x)) + `( , ˆt (x)) Wc , Wt |Sc | |S t | ing in the validation process, after parameters are selected, Sc0 ( ,x )2S c ( ,x )2S t is included for training the �nal model to predict S te . + c R(Wc ) + t R(Wt ) + kt c k kWc Wt k22 , For the feature set, we collect user pro�les, contextual informa- tion (e.g., time, location) of each request and side information of where Wc and Wt are are parameters of ˆc (·) and ˆt (·), respec- each advertisement. All are categorical features. tively, and the last term controls the discrepancy between Wc • Yahoo! R3 [23]: The data set, from a music recommendation ser- and Wt . The original study in [3] only supports matrix factor- vice, includes a training and a test set of rated (user, song) pairs in ization models. We make an extension to support more complex a rating matrix R. For the training set, users deliberately selected models (e.g., FFM). For the prediction, we follow [3] to use only and rated the songs by preference, which can be considered as ˆc (·). a stochastic logging policy by following [30, 32]. For the test • New: This approach is a realization of our proposed framework in set, users were asked to rate 10 songs randomly selected by the (28), where we use the logistic loss de�ned in (5) as `(·, ·). We use system. Because the logging policy of the test set is completely a modi�ed FFM model in [35] for the CTR model ˆ(·). It satis�es random, the ratings contained in the test set can be considered the conditions needed in Section 4.3. For the imputation model as the ground truth without the selection bias. In Section B.1 of (·), we consider three settings. supplementary material, we give details of modifying the prob- – avg: This is a simple setting is to impute labels with the lem to simulate a deterministic logging policy, which is what our average CTR in S t . We use the following logit function to framework intends to handle. transform a probability to a model output in [ 1, 1]. CTR 5.2 Considered Approaches (x i, j ) = log( ), 8i, j 1 CTR The considered approaches in our comparison fall into two main categories. where CTR is the average CTR from events in S t .

10.Table 3: A performance comparison of di�erent approaches. Table 4: The performance when S t becomes smaller sets. |S t | We show the relative improvements over the average (Sc ) on indicates the size of S t as the percentage of total data. We test NLL and AUC scores. The best approach is bold-faced. present the relative improvements over the average (Sc ) on Metric NLL AUC NLL AUC test NLL and AUC scores. The CPC set is considered and the Data set best approach is bold-faced. Approach Yahoo! R3 CPC Metric NLL AUC NLL AUC average (Sc ) +0.0% +0.0% +0.0% +0.0% |S t | average (S t ) +79.1% +0.0% +29.6% +0.0% Approach 0.1% 0.01% FFM (Sc ) -7.7% +36.4% -16.9% +32.2% average (S t ) +29.2% +0.0% +28.7% +0.0% FFM (S t ) -20.7% +4.6% +32.4% +42.8% FFM (S t ) +28.6% +18.6% +26.5% +1.2% FFM (Sc [ S t ) +0.2% +36.4% +10.6% +34.6% FFM (Sc [ S t ) -12.5% +29.6% -16.9% +32.0% IPS +62.2% +23.2% +28.5% +31.2% IPS +26.1% +26.8% +18.7% +23.6% CausE +6.8% +37.8% +25.3% +39.4% CausE +25.6% +37.4% +22.4% +31.6% New (avg) +79.1% +51.8% +32.9% +48.6% New (avg) +33.5% +48.4% +30.4% +47.2% New (item-avg) +76.8% +54.2% +32.4% +48.2% New (item-avg) +31.5% +46.6% +26.7% +46.0% New (complex) -0.2% +37.4% +33.4% +50.6% New (complex) +32.2% +44.6% +32.5% +43.0% – item-avg: This is an item-wise extension of avg as events selected by non-uniform policies may cause a strong se- i, j CTRj lection bias. However, the result of FFM (S t ) on Yahoo! R3 set (x ) = log( ), 8i 1 CTR j is not ideal. The reason is that the feature set of Yahoo! R3 is too sparse by having only two identi�er features2 and S t is too where CTRj is the average CTR of ad j from events in S t . small to learn an FFM model. – complex: This is a complex setting to learn (·) from S t by • IPS has good NLL scores, but its AUC scores are not competitive. solving (4) with the modi�ed FFM model in [35]. In (29) to estimate propensity scores, the set S t is used. This We further include two constant predictors as baselines. may explain why the obtained NLL is similar to that by average • average (Sc ): This is a constant predictor de�ned as (S t ). On the other hand, the unsatisfactory AUC results may be CTR because the Naive Bayes estimator used in (29) is too simple. ˆ(x i, j ) = log( ), 8i, j • For CausE, it performs better than FFM (Sc ) and FFM (Sc [ S t ) 1 CTR but worse than FFM (S t ) on the CPC set. The reason might be where CTR is the average CTR from events in Sc . that models learnt from S t are better than those learnt from Sc , • average (S t ): It is the same as average (Sc ), except that the average but we follow [3] to use only ˆc (·) for prediction. CTR from events in S t is considered. • Our proposed framework performs signi�cantly better than oth- For evaluation criteria, we consider NLL and AUC respectively ers, though an exception is New (complex) on the Yahoo! R3 set. de�ned in (8) and (9) and report relative improvements over the The reason is analogous to the explanation of the bad result of method average (Sc ) as FFM (S t ) on Yahoo! R3 set where S t is too small. In contrast, criterionappr oach criterionaverage (Sc ) for the CPC set, because S t is large enough to learn a complex , imputation model, New (complex) performs the best. |criterionaverage (Sc ) | where criterion is either NLL or AUC, and approach is the approach 5.4 Impact of Unbiased Imputation Models being measured. Because the performance of average (Sc ) is un- Learnt from S t changed in all experiments, results presented in di�erent tables can In Section 4.2, we proposed unbiased imputation models because be comparable. Details of our parameter selection procedure are in past works [6, 19] �nd that learning from biased data sets will propa- supplementary materials. gate the carried selection bias to non-displayed events and worsens 5.3 Comparison on Various Approaches the performance. We conduct an experiment to verify this conjec- ture and the importance of using an unbiased imputation model. By comparing models in Table 3, we have the following observa- Detailed results are in Section B.3 of supplementary materials. tions. • For two constant predictors, NLL scores of average (S t ) are much 5.5 Impact of the Size of S t better than those of average (Sc ) estimated from the biased Sc set. To examine the robustness of the approaches relying on S t to learn The reason is that the distribution of S t is consistent with that an unbiased imputation model, we shrink S t to see how they per- of the test set. However, because average (S t ) is still a constant form on the CPC set. For experiments in Section 5.3, S t includes predictor, its AUC scores are the same as those of average (Sc ). 1% of the data. We now select two subsets respectively having 0.1% • For FFM (S t ), it performs much better than FFM (Sc ) and FFM (Sc [ S t ) on CPC set. This result con�rms our conjecture in Sec- tion 2.2 that conventional approaches considering only displayed 2 See details of the Yahoo! R3 set in supplementary materials

11.and 0.01% of total data. These subsets are then used as S t for exper- [9] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine iments, though each model’s parameters must be re-selected. The Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin Quiñonero Candela. 2014. Practical Lessons from Predicting Clicks on Ads at Facebook. In ADKDD. set Sc remains unchanged. [10] James J. Heckman. 1979. Sample selection bias as a speci�cation error. Econo- The result is in Table 4. By comparing the third column of Table 3, metrica 47, 1 (1979), 153–161. [11] Daniel G. Horvitz and Donovan J. Thompson. 1952. A Generalization of Sampling which shows the result of using 1% data as S t , we observe that with Without Replacement From a Finite Universe. J. Amer. Statist. Assoc. 47 (1952), the shrinkage of S t , the performance of all approaches gradually 663–685. becomes worse. However, our proposed framework still performs [12] Cho-Jui Hsieh, Nagarajan Natarajan, and Inderjit Dhillon. 2015. PU Learning for Matrix Completion. In ICML. signi�cantly better than others. In particular, for other approaches [13] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative �ltering for AUC values are strongly a�ected by the size of S t , but ours remain implicit feedback datasets. In ICDM. similar. Among the three settings of our framework, the simple [14] Machael Jahrer, Andreas Töscher, Jeong-Yoon Lee, Jingjing Deng, Hang Zhang, and Jacob Spoelstra. 2012. Ensemble of Collaborative Filtering and Feature setting New (avg) of using average CTR in S t as the imputation Engineered Model for Click Through Rate Prediction. In KDD Cup 2012 Workshop. model is competitive. This might be because for small S t , training [15] Yuchin Juan, Damien Lefortier, and Olivier Chapelle. 2017. Field-aware factoriza- tion machines in a real-world online advertising system. In WWW. a complicated imputation model is not easy. [16] Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware Factorization Machines for CTR Prediction. In RecSys. [17] Yehuda Koren, Robert M. Bell, and Chris Volinsky. 2009. Matrix Factorization 6 CONCLUSION Techniques for Recommender Systems. Computer 42 (2009). In this work, through an A/B test on our system, we show that [18] Himabindu Lakkaraju, Jon Kleinberg, Jure Leskovec, Jens Ludwig, and Send- hil Mullainathan. 2017. The selective labels problem: Evaluating algorithmic conventional approaches for CTR prediction considering only dis- predictions in the presence of unobservables. In KDD. played events may cause a strong selection bias and inaccurate [19] Carolin Lawrence, Artem Sokolov, and Stefan Riezler. 2017. Counterfactual predictions. To alleviate the bias, we need to conduct counterfac- Learning from Bandit Feedback under Deterministic Logging: A Case Study in Statistical Machine Translation. In EMNLP. tual learning by considering non-displayed events. We point out [20] Damien Lefortier, Adith Swaminathan, Xiaotao Gu, Thorsten Joachims, and some di�culties for applying counterfactual learning techniques Maarten de Rijke. 2016. Large-scale validation of counterfactual learning methods: A test-bed. In NIPS Workshop on Inference and Learning of Hypothetical and in real-world advertising systems. Counterfactual Interventions in Complex Systems. [21] Roderick J. A. Little and Donald B. Rubin. 2002. Statistical Analysis with Missing (1) For many advertising systems, the thresholding operation (20) Data (second ed.). Wiley-Interscience. is applied as a deterministic logging policy, but existing ap- [22] David C. Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C. Ma, proaches may require stochastic policies. Zhigang Zhong, Jenny Liu, and Yushi Jing. 2017. Related Pins at Pinterest: The Evolution of a Real-World Recommender System. In WWW. (2) Learning imputation models from displayed events causes the [23] Benjamin M. Marlin and Richard S. Zemel. 2009. Collaborative Prediction and selection bias being propagated to non-displayed events. Ranking with Non-random Missing Data. In RecSys. (3) The high cost of handling all displayed and non-displayed [24] H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat events causes some approaches impractical for large systems. Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica. 2013. Ad Click Prediction: a View from the Trenches. In To overcome these di�culties, we propose a novel framework for KDD. counterfactual CTR prediction. We introduce a propensity-free dou- [25] Yusuke Narita, Shota Yasui, and Kohei Yata. 2019. E�cient Counterfactual bly robust method. Then we propose a setting to obtain unbiased Learning from Bandit Feedback. In AAAI. [26] Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, Huifeng imputation models in our framework. For e�cient training we Guo, Yong Yu, and Xiuqiang He. 2018. Product-Based Neural Networks for User link the proposed framework to recommender systems with im- Response Prediction over Multi-Field Categorical Data. ACM TOIS 37 (2018), 5:1–5:35. plicit feedbacks and adapt algorithms developed there. Experiments [27] Ste�en Rendle. 2010. Factorization machines. In ICDM. conducted on real-world data sets show our proposed framework [28] Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting outperforms not only conventional approaches for CTR prediction clicks: estimating the click-through rate for new ADs. In WWW. [29] Nir Rosenfeld, Yishay Mansour, and Elad Yom-Tov. 2017. Predicting counterfac- but also some existing counterfactual learning methods. tuals from large historical data and small randomized trials. In WWW. [30] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations As Treatments: Debiasing Learning REFERENCES and Evaluation. In ICML. [1] Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Ste�en Rendle. 2017. A [31] Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged generic coordinate descent framework for learning from implicit feedback. In bandit feedback through counterfactual risk minimization. JMLR 16 (2015), WWW. 1731–1755. [2] Mathieu Blondel, Masakazu Ishihata, Akinori Fujino, and Naonori Ueda. 2016. [32] Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Debo- Polynomial Networks and Factorization Machines: new Insights and E�cient rah Estrin. 2018. Unbiased o�ine recommender evaluation for missing-not-at- Training Algorithms. In ICML. random implicit feedback. In RecSys. [3] Stephen Bonner and Flavian Vasile. 2018. Causal Embeddings for Recommenda- [33] Hsiang-Fu Yu, Mikhail Bilenko, and Chih-Jen Lin. 2017. Selection of Negative tion. In RecSys. Samples for One-class Matrix Factorization. In SDM. [4] Olivier Chapelle, Eren Manavoglu, and Romer Rosales. 2015. Simple and scalable [34] Hsiang-Fu Yu, Hsin-Yuan Huang, Inderjit S. Dihillon, and Chih-Jen Lin. 2017. response prediction for display advertising. ACM TIST 5, 4 (2015), 61:1–61:34. A Uni�ed Algorithm for One-class Structured Matrix Factorization with Side [5] Wei-Sheng Chin, Bo-Wen Yuan, Meng-Yuan Yang, and Chih-Jen Lin. 2018. An Information. In AAAI. E�cient Alternating Newton Method for Learning Factorization Machines. ACM [35] Bowen Yuan, Meng-Yuan Yang, Jui-Yang Hsia, Hong Zhu, Zhirong Liu, Zhenhua TIST 9 (2018), 72:1–72:31. Dong, and Chih-Jen Lin. 2019. One-class Field-aware Factorization Machines for [6] Miroslav Dudík, John Langford, and Lihong Li. 2011. Doubly robust policy Recommender Systems with Implicit Feedbacks. Technical Report. National Taiwan evaluation and learning. In ICML. University. [7] Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. 2015. A multi-view deep [36] Weinan Zhang, Tianxiong Zhou, Jun Wang, and Jian Xu. 2016. Bid-aware gradient learning approach for cross domain user modeling in recommendation systems. descent for unbiased learning with censored data in display advertising. In KDD. In WWW. [37] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui [8] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through DeepFM: a factorization-machine based neural network for CTR prediction. In rate prediction. In KDD. IJCAI.

12. Supplementary Materials for “Improving Ad Click Prediction by Considering Non-displayed Events” A DETAILS OF EFFICIENT TRAINING where Fu is the number of user �elds and Fv is number of item As we mentioned in Section 4.3, to solve (28) without O(mn) costs, �elds. The corresponding output function is the CTR model and the imputation model should satisfy some Fu’ +Fv Fu’ +Fv f i, j f i, j conditions. In this section, we take a modi�ed FFM model proposed ˆ(x i, j ) = (Wf 2 T x f )T (H f 1 T x f ) 1 1 2 2 in [35] as an example to present our e�cient training algorithm. f 1 =1 f 2 =f 1 The modi�ed FFM model, which is a multi-block convex function ’ F ’ F i, j i, j [2], is considered as both CTR and imputation models. = pf , f T qf , f , (A.4) 1 2 2 1 We note that [35] considers a slightly modi�ed FFM model to f 1 =1 f 2 =f 1 make each sub-problem convex, where the idea is extended from where related works in [2, 5]. If we let i, j f i, j f p f , f = Wf 2 T x f1 , q f , f = H f 1 T x f2 . (A.5) 1 2 1 2 1 2 2x 1 3 With the reformulation, our proposed (28) has a multi-block ob- 6 7 6 . 7 jective function. At each cycle of the block CD method we sequen- x = 6 .. 7 6 7 tially solve one of F (F + 1) convex sub-problems by the following 6x 7 4 F5 loop. 1: for f 1 {1 · · · F } do be considered as a concatenation of F sub-vectors, where x f 2 R D f 2: for f 2 { f 1 · · · F } do includes D f features which belong to the �eld f , then the modi�ed f 3: Solve a sub-problem of Wf 2 by �xing others function is 1 f 4: Solve a sub-problem of H f 1 by �xing others F ’ ’ F 2 f f 5: end for ˆ(x) = (Wf 2 T x f1 )T (H f 1 T x f2 ) 1 2 6: end for f 1 =1 f 2 =f 1 (A.1) f If we consider to update the block Wf 2 corresponding to �elds ’ F ’ F 1 = p f 1, f 2 T q f 2, f 1 , f 1 and f 2 , then the convex sub-problem is to minimize f 1 =1 f 2 =f 1 ’ (Ai j Ŷi j )2 ) f R(Wf 2 ) + (`(Ai j , Ŷi j ) 2 1 where (i, j)2 ’m ’ n (A.6) Ŷi j )2 , f Wf 2 = [w 1, f2 , · · · , w D f ,f2 ]T 2 R D f1 ⇥k , + (Ai j 1 1 i=1 j=1 ( f1 where l2 regularization is considered. That is, R(W ) = kW kF2 . In Wf f1 , f2 , f Hf 1 = 2 [35], the sub-problem is a special case of (A.6), where the objective 2 [h 1, f1 , · · · , h D f , f1 ]T f1 = f2 , function is quadratic because of the squared loss function. Thus 2 solving the sub-problem in [35] is the same as solving a linear sys- f f tem by taking the �rst derivative to be zero. In contrast, ours in (A.6) p f1, f2 = Wf 2 T x f1 and q f2, f1 = H f 1 T x f2 . (A.2) is a general convex function, so some di�erentiable optimization 1 2 techniques such as gradient descent, Quasi-Newton or Newton are We do not show that (A.1) is a minor modi�cation from (7); for needed. All these methods must calculate the gradient (i.e., the �rst details, please see [35]. derivative) or the Hessian-vector product (where Hessian is the In recommendation, each x is usually a combination of user second-order derivative), but the O(mn) cost is the main concern. feature u and item feature v. For x l corresponding to user i and In next content, by following [34, 35], we develop an e�cient item j, x l can be written as Newton method for solving (A.6), where any O(mn) cost can be  i avoided. u x i, j = . vj A.1 Newton Methods for Solving Sub-problems f i, j We discuss details in solving the sub-problem of Wf 2 in (A.6). Fol- Under any given �eld f , the vector x f consists of two sub-vectors 1 lowing [35], for easy analysis, we write (A.6) to a vector form respectively corresponding to request i and ad j. f (w̃), ( i uf f  Fu , f i, j xf = (A.3) where w̃ = vec(Wf 2 ) and vec(·) stacks the columns of a matrix. We j 1 vf f > Fu . consider Newton methods to solve the sub-problem.

13. An iterative procedure is conducted in a Newton method. Each A.2.1 The Computation of the Right-hand Side of the Linear System. Newton iteration minimizes a second-order approximation of the To e�ciently calculate the gradient, [35] takes a crucial property function to obtain a updating direction s of recommender systems into account. For each request i and ad j pairs, there are F (F + 1) numbers of 1 min rf (w̃)T s + sT r2 f (w̃)s. (A.7) feature embedding vectors according to di�erent �eld combinations s 2 of f 1 and f 2 . From (A.2) and (A.3), Because f (w̃) is convex, the direction s can be obtained by solving 8 the following linear system > <p i > f T = Wf 2 u if f 1  Fu , i, j p f , f ! f 1, f 2 1 1 1 2 >p j > f2 T j r2 f (w̃)s = rf (w̃). (A.8) : f 1, f 2 = Wf1 v f1 f 1 > Fu , We follow [35] to solve (A.7) by a iterative procedure called the and 8 > f T conjugate gradient method (CG). The main reason of using CG is i, j <qi > = H f 1 u if f 1  Fu , that r2 f (w̃) is too large to be stored, and each CG step requires q f ,f ! f2, f1 2 2 1 2 >q j > f1 T j mainly the product between r2 f (w̃) and a vector v : f 2, f 1 = H f2 v f2 f 1 > Fu . f When updating vec(Wf 2 ), the value Ŷi j can be written as r2 f (w̃)v, (A.9) 1 i, j f i, j f i, j i, j which, with the problem structure, may be conducted without ex- x f T Wf 2 q f , f + const. = vec(Wf 2 )T (q f , f ⌦ x f ) + const. 1 1 2 1 1 2 1 1 plicitly forming r2 f (w̃). i, j i, j = w̃ T (q f , f ⌦ x f ) + const. A di�erence from [35] is that the sub-problem in [35] is a qua- 2 1 1 dratic function because of the squared loss. Thus, an optimization i, j i, j T = w̃ T vec(x f q f , f ) + const. procedure like the Newton method is not needed. Instead, only one 1 2 1 (A.13) linear system is solved by CG. and its �rst derivative (Ŷi j ) 0 with respect to w̃ is To ensure the convergence of the Newton method, after obtain- ing an updating direction s, a line search procedure is conducted i, j i, j i, j T x̃ f = vec(x f q f , f ). to �nd a suitable step size . Then we update the w̃ by 1 1 2 1 Then rL+ (w̃) and rL (w̃) can be computed as w̃ w̃ + s. ’ rL+ (w̃) = (` 0 (Yi j , Ŷi j ) (Ŷi j Ai j ))(Ŷi j ) 0 We follow the standard backtracking line search to check a sequence (i, j)2 1, , 2 , · · · and choose the largest such that the su�cient decrease ’ i, j i, j T of the function value is obtained = vec((` 0 (Yi j , Ŷi j ) (Ŷi j Ai j ))x f q f , f ) 1 2 1 (i, j)2 f (w̃ + s) f (w̃)  rf T (w̃)s. (A.10) (A.14) where , 2 (0, 1) are pre-speci�ed constants. where indicates the derivative with respect to Ŷ , and ` 0 (Y , Ŷ ) ’m ’ n A.2 Algorithm Details rL (w̃) = vec( (Ŷi j Ai j )(Ŷi j ) 0 i=1 j=1 To discuss techniques for addressing the issue of O(mn) complexity, ’ m ’ n let i, j i, j T = vec( (Ŷi j Ai j )x f q f , f ). (A.15) f 1 2 1 w̃ = vec(Wf 2 ) i=1 j=1 1 and re-write (A.6) as From (A.4), ’ F ’ F i, j i, j Ŷi j = p , Tq , . f (w̃) = kw̃ k22 + + L (w̃) + L (w̃), (A.11) 2 =1 = where The rL+ (w̃) evaluation requires the summation of O(| |) terms. ’ 1 With | | ⌧ mn, rL+ (w̃) can be easily calculated but the bottleneck L+ (w̃) = `(Yi j , Ŷi j ) (Ai j Ŷi j )2 (A.12) is on rL (w̃), which sum up O(mn) terms. 2 (i, j)2 In order to deal with the O(mn) cost, in [35], an e�cient approach ’m ’n 1 has been provided to compute L (w̃) = (Ai j Ŷi j )2 . ’m ’ n 2 i, j i, j T i=1 j=1 rL (w̃) = vec( (Ŷi j r )x f q f , f ). (A.16) 1 2 1 i=1 j=1 Then the gradient and Hessian-vector product can be respectively re-written as where all Ai j are treated as a constant r . We can leverage their technique by breaking (A.15) into two parts r f˜ (w̃) = w̃ + rL+ (w̃) + rL (w̃), ’m ’ n i, j i, j T vec( Ŷi j x f q f ,f ) (A.17) r2 f˜ (w̃)v = v + r2 L+ (w̃)v + r2 L (w̃)v. i=1 j=1 1 2 1

14.and ’ Fu ’ Fu ’ F ’ F ’ m ’ j j n ai0 = p 0i , T q 0i , , and b j0 = p0 , T q0 , . i, j i, j T vec( Ai j x f q f , f ). (A.18) =1 = =Fu +1 = 1 2 1 i=1 j=1 The overall summation of (A.14) and (A.15) can be written as It is clear that (A.17) is actually a speci�c case in [35] with r = 0. f Furthermore, since a FFM imputation model is used, Ŷi j and Ai j UfT diag(z̄)Q f 1 , 1 2 have an identical structure. where ’ ’F ’ F z̄i = zi zi0 + (` 0 (Yi j , Ŷi j ) (Ŷi j Ai j )) i, j i, j Ai j = p0 , T q0 , (A.19) j2 i =1 = and i = {j | (i, j) 2 }. where 8 • For the second case, let > f T <p 0if ,f = W 0 f2 u if > f 1  Fu , h iT 0i, j Vf1 = v 1f , · · · , v nf 1 2 1 1 p f ,f ! 1 2 > >p 0 j T 0f2 v j 1 1 . : f1,f2 = W f1 f1 f 1 > Fu , Then (A.17) can be computed as and 8 > f1 T f <q 0if , f = H 0 f u if > i, j f 1  Fu , VfT diag(z)Q f 1 , 1 2 q 0 f ,f ! 2 1 2 2 > >q 0 j 1 2 f 0 1 T j where : f 2 , f 1 = H f 2 v f 2 f 1 > Fu . ’ m We have that W 0 and H 0 are the imputation model’s embedding zj = ( ai + mb j + e j ), matrixes. Therefore, we can apply same computation method on i=1 (A.17) and (A.18). and ’ Fu ’ F The computation of (A.17) depends on the following three cases e= (Q (P T 1m⇥1 )) 2 R n , 8 > > < 1 2 > f , f  Fu , =1 =Fu +1 Similarly, (A.18) can be computed as > f 1 , f 2 > Fu , > > f 1  Fu , f 2 > Fu . f : VfT diag(z 0 )Q 0 f1 , 1 2 Let where h iT h iT ’ m P f 2 = p 1f , f , · · · , pm 1 f f1 n f 1, f 2 , Q f2 = q f 2, f 1 , · · · , q f 2, f 1 , z j0 = ( ai0 + mb j0 + e j0 ), 1 1 2 h iT h iT i=1 P 0 f2 = p 0 1f1, f2 , · · · , p 0m f , Q 0 f1 = q 0 1f2, f1 , · · · , q 0nf2,f1 , f and 1 f 1, f 2 2 ’ Fu ’ F h iT e0 = (Q 0 (P 0 T 1m⇥1 )) 2 R n , Uf1 = u 1f , · · · , um f . =1 =Fu +1 1 1 • For the �rst case, (A.17) can be computed by The overall summation of (A.15) and (A.14) can be written as f f UfT diag(z)Q f 1 , VfT diag(z̄ 0 )Q f 1 , 1 2 1 2 where where ’ n ’ zi = (nai + b j + di ), z̄ j = z j z j0 + (` 0 (Yi j , Ŷi j ) (Ŷi j Ai j )) j=1 i2¯j ’ Fu ’ F and ¯ j = {i | (i, j) 2 }. d= (P (Q T 1n⇥1 )) 2 Rm , • Considering the third case, (A.17) can be computed as =1 =Fu +1 ((UfT a)qTo + (UfT 1m⇥1 )qTb + UfT T ), ’ Fu ’ Fu ’ F ’ F 1 1 1 ai = p i , T qi , , and b j = j j p , Tq , . where =1 = =Fu +1 = ’ Fu ’ F f T = (P (Q T Q f 1 )), Similarly, (A.18) can be computed by =1 =Fu +1 2 f UfT diag(z 0 )Q f 1 , f q o = Q f 1 T 1n⇥1 , and q b = Q f 1 T b. f 1 2 2 2 where Similarly, the (A.18) can be computed as ’ n zi0 = (nai0 + b j0 + di0 ), ((UfT a 0 )q oT + (UfT 1m⇥1 )q b0 T + UfT T 0 ), 1 1 1 j=1 where ’ Fu ’ F ’ Fu ’ F f d0 = (P 0 (Q 0 T 1n⇥1 )) 2 Rm , T0 = (P 0 (Q 0 T Q f 1 )) 2 =1 =Fu +1 =1 =Fu +1

15. and For the second case, details are omitted because an analogous q b0 = Q f 1 T b 0 . f derivation of the �rst case can be used. If the third case is considered, 2 (A.22) can be computed as The overall summation of (A.15) and (A.14) can be written as f f 2zT1 3 vec(UfT (Uf1 M)(Q f 1 T Q f 1 )). 6 7 1 2 2 6 . 7 UfT 6 .. 7 + ((UfT a)qTo + (UfT 1m⇥1 )qTb + UfT T ) The overall summation of (A.22) and (A.23) can be written as 1 6 7 6zT 7 1 1 1 4 m5 2 T 3 6 1 7 T 6 .. 7 vec(Uf 6 . 77 )) + f f ((UfT a 0 )q oT + (UfT 1m⇥1 )q b0 T + UfT T 0 ), vec(UfT (Uf1 M)(Q f 1 T Q f 1 )), 1 6 6 T 7 1 1 1 1 2 2 where 4 m5 ’ where ’ j zi = (` 0 (Yi j , Ŷi j ) (Ŷi j Ai j ))q f , f . 2 1 T j j j2 i i = (` 00 (Yi j , Ŷi j ) )(u if M q̃ f , f )q̃ f , f . 1 2 1 2 1 j2 A.2.2 The computation of Matrix-vector Products. From the repre- i sentation of Ŷi j in (A.13), (A.14) and (A.15), For complexity analysis please see [35]. ’ r2 L+ (w̃) = (` 00 (Yi j , Ŷi j ) i, j i, j )x̃ f x̃ f T (A.20) A.2.3 Line Search. To check the su�cient decrease condition in (i, j)2 1 1 (A.10), for the current iterate w̃ and a direction S, we must calculate and f (w̃ + vec(S)) f (w̃). (A.24) ’ m ’ n Because r2 L (w̃) = i, j i, j x̃ f x̃ f T , (A.21) 1 1 1 i=1 j=1 kw̃ k + L (w̃) 2 where ` 00 (Y , Ŷ ) indicates the second derivative with respect to Ŷ . is quadratic, (A.24) is equivalent to Following the calculation in [35], the two products r2 L+ (w̃) vec(M) 1 and r2 L (w̃) vec(M) can be computed as L+ (w̃ + vec(S)) L+ (w̃) + r( kw̃ k 2 + L (w̃))T vec(S) 2 r2 L+ (w̃) vec(M) 1 2 T 2 1 2 vec(S) r ( kw̃ k + L (w̃)) vec(S) ’ + i, j i, j 2 2 = (` 00 (Yi j , Ŷi j ) )x̃ f x̃ f T vec(M) = L+ (w̃ + vec(S)) L+ (w̃) + ( w̃ + rL (w̃))T vec(S) 1 1 (i, j)2 ’ 1 + 2 vec(S)T ( I + r2 L (w̃)) vec(S). i, j i, j T i, j i, j T = (` 00 (Yi j , Ŷi j ) ) vec(x f (x f Mq f , f )q f , f ), (A.22) 2 (i, j)2 1 1 2 1 2 1 (A.25) The value L+ (w̃) is available from the previous iteration. We then and calculate the following two values so that (A.25) can be easily ob- r2 L (w̃) vec(M) tained for any . ’m ’ n ( w̃ + rL (w̃))T vec(S) (A.26) i, j i, j = x̃ f x̃ f T vec(M) 1 1 i=1 j=1 and ’ m ’ n vec(S)T ( I + r2 L (w̃)) vec(S). (A.27) i, j i, j T i, j i, j T = vec(x f (x f Mq f , f )q f , f ). (A.23) Besides, we also have to cache the right hand side of (A.10), which 1 1 2 1 2 1 i=1 j=1 is The cost of evaluating r2 L+ (w̃) vec(M) is proportional to O(| |) rf (w̃)T vec(S). and can be computed directly. However, r2 L (w̃) vec(M) needs a For the value in (A.27) we can calculate the following matrix-vector delicate derivation. We adopt the results in [35] below. product �rst. If �rst situation considered, the r2 L (w̃) vec(M) can be com- puted as ( I + r2 L (w̃)) vec(S) vec(UfT diag(z)Q f 1 ), f = vec(S) + r2 L (w̃) vec(S), 1 2 where the second term can be conducted by the procedure in Section where ⇣ ⌘ f A.2.2. z = n (Uf1 M) Qf1 1k⇥1 2 With the above cached values, to evaluate the function-value The overall summation of (A.22) and (A.23) can be written as di�erence at each new , we only need to calculate UfT diag(z̄)Q f 1 , f L+ (w̃ + vec(S)). (A.28) 1 2 where From (A.12), new Ŷi j values are needed. That is, from (A.4) and ’ T (A.5), z̄ i = z i + 00 (` (Yi j , Ŷi j ) )u if M q̃if2, f1 i, j i, j j2 1 Ŷi j + x f T Sq f , f 8i, j 2 . (A.29) i 1 2 1

16.We can avoid calculating (A.29) for any new by applying a similar Table I: A comparison between learning the imputation trick. Speci�cally, with the cached variables model via S t and Sc [ S t . All results are relative improve- i, j i, j ments over the average (Sc ). The best approach is bold-faced. Yi j = x f T Sq f ,f , 1 2 1 Metric NLL AUC NLL AUC we can calculate (A.28) by Data set Yahoo! R3 CPC Approach L (w̃ + vec(S)) + New (avg) +79.1% +51.8% +32.9% +48.6% ’ 1 = `(Yi j , Ŷi j + Yi j ) (Ai j (Ŷi j + Yi j ))2 St New (item-avg) +76.8% +54.2% +32.4% +48.2% 2 New (complex) -0.2% +37.4% +33.4% +50.6% (i, j)2 New (avg) +0.5% +31.2% +4.34% +32.8% with the complexity proportional to O(| |). Sc [ S t New (item-avg) -0.17% +37.4% +9.0% +38.4% New (complex) -1.4% +36.8% +19.2% +43.0% B ADDITIONAL DETAILS OF EXPERIMENTS B.1 Data Preparation of the Yahoo! R3 Set Table II: Search range of parameters considered in di�erent To simulate events logged by the deterministic policy of (20), we approaches. modify the data set by the following procedures. FFM IPS CausE New (1) We apply rating-based matrix factorization technique [17] on 2 {0, ··· ,7} 2 {0, ··· ,7} - 2 {0, ··· ,7} (user, song) pairs in the training set to �nd a matrix R̃ with - - 2 { 4, ··· , 10} - c, t R̃i j ⇡ Ri j , 8(i, j) 2 , kt c k - - 2 { 4, ··· , 10} - k1 25 25 25 25 where parameters including the size of latent factors and the k2 2 {3, ··· ,6} 2 {3, ··· ,6} 2 ··· ,6} {3, 2 {3, ··· ,6} l2 regularization’s coe�cient are selected through a �ve-fold - - - 2 { 8, 12, 16} cross validation on a grid of values. T [1, 100] [1, 100] [1, 100] [1, 100] (2) We apply a thresholding operation to generate a new set as det = {(i, j) | R̃i j > 2.5, 8(i, j) 2 }. • k 2 : the number of latent factors of each approach for the Yahoo! To generate Sc , where each instance includes a feature vector x and R3 set. a binary label , we apply one-hot encoding and rating binarization • : the coe�cient to balance the IPS part and the direct method as follows part in New. Sc = {(Yi j , x i, j ) | 8(i, j) 2 det }, • T : the number of training iterations of each approach. where The search range of each parameter is listed in Table II. We then z}|{ i choose parameters achieving the highest NLL on S va . For building x i, j = [· · · , 1, · · · , 1, · · · ]T 2 R 15,877 , the �nal model to predict the test set, we should include data in | {z } the validation period. To this end, we consider the following new 14,877+j training set. and ( • For approaches using Sc , we use Sc [ Sc0 as the new training set, 1 Ri j = 5, Yi j = where for the Yahoo! R3 set, because all events in the original 1 Ri j < 5. training set have been included in Sc , Sc0 = ;. For the CPC set, Besides, we randomly split out 5% pairs from the original test set and as shown in Figure 6, Sc0 is the subset of events logged by non- conduct one-hot encoding and binarization to generate S t de�ned uniform policies in the day of collecting S va . in Section 2.2. Analogously, the rest 95% is further split into two • For approaches using S t , we use S t [ S va as the new training set. subsets in ratios of 5% and 90% to generate S va and S te for parameter • For approaches using Sc [ S t , we use Sc [ S t [ S va [ Sc0 as the selections and evaluations, respectively. new training set. We �nally compute NLL and AUC scores of evaluating S te . B.2 Parameter Selection For parameter selections, we conduct grid searches by training B.3 Impact of Unbiased Imputation Models multiple models on the corresponding training set of each approach. Learnt from S t We consider to tune the following parameters. In section 4.2, we proposed unbiased imputation models because • : the l2 regularization coe�cient for FFM, IPS, CausE and New. past works [6, 19] �nd that learning from biased data sets will • c , t : l2 regularization’s coe�cients respectively for Wc and propagate the carried selection bias to non-displayed events and Wt in CausE. worsen the performance. To verify this, we follow these works • kt c k : the coe�cient of the term in CausE, which controls the to learn biased imputation models from Sc [ S t . That is, in the discrepancy between Wc and Wt . work�ow presented in Figure 4, S t is replaced by S t [ Sc . We then • k 1 : the number of latent factors of each approach for the CPC re-select the parameters for the imputation model by a validation set. procedure. From Table I, we observe that the performance of using

17. an imputation model learnt from S t is signi�cantly better than from Sc [ S t . This result con�rms the importance of using an unbiased imputation model. View publication stats