# 2017年ICML最佳论文奖——《通过影响函数理解黑盒预测》

use influence function a classic technique from robust statistics to trace a model’s prediction through the learning algorithm and back to its training data

2. Understanding Black-box Predictions via Influence Functions 2. Approach 2.2. Perturbing a training input Consider a prediction problem from some input space X Let us develop a finer-grained notion of influence by study- (e.g., images) to an output space Y (e.g., labels). We are ing a different counterfactual: how would the model’s pre- given training points z1 , . . . , zn , where zi = (xi , yi ) ∈ dictions change if a training input were modified? X × Y. For a point z and parameters θ ∈ Θ, let def n L(z, θ) be the loss, and let n1 i=1 L(zi , θ) be the em- For a training point z = (x, y), define zδ = (x + δ, y). pirical risk. The empirical risk minimizer is given by Consider the perturbation z → zδ , and let θˆzδ ,−z be the def n empirical risk minimizer on the training points with zδ in θˆ = arg minθ∈Θ n1 i=1 L(zi , θ).1 Assume that the em- place of z. To approximate its effects, define the parameters pirical risk is twice-differentiable and strictly convex in θ; def in Section 4 we explore relaxing these assumptions. resulting from moving mass from z onto zδ : θˆ ,zδ ,−z = 1 n arg minθ∈Θ n i=1 L(zi , θ) + L(zδ , θ) − L(z, θ). An 2.1. Upweighting a training point analogous calculation to (1) yields: Our goal is to understand the effect of training points on a dθˆ ,zδ ,−z = Iup,params (zδ ) − Iup,params (z) model’s predictions. We formalize this goal by asking the d =0 counterfactual: how would the model’s predictions change ˆ − ∇θ L(z, θ) = −Hθˆ−1 ∇θ L(zδ , θ) ˆ . (3) if we did not have this training point? As before, we can make the linear approximation θˆzδ ,−z − Let us begin by studying the change in model pa- θˆ ≈ − n1 (Iup,params (zδ ) − Iup,params (z)), giving us a closed- rameters due to removing a point z from the train- form estimate of the effect of z → zδ on the model. Anal- ing set. Formally, this change is θˆ−z − θ, ˆ where def ogous equations also apply for changes in y. While in- θˆ−z = arg minθ∈Θ zi =z L(zi , θ). However, retraining fluence functions might appear to only work for infinitesi- the model for each removed z is prohibitively slow. mal (therefore continuous) perturbations, it is important to Fortunately, influence functions give us an efficient approx- note that this approximation holds for arbitrary δ: the - imation. The idea is to compute the parameter change if z upweighting scheme allows us to smoothly interpolate be- were upweighted by some small , giving us new param- tween z and zδ . This is particularly useful for working with def n discrete data (e.g., in NLP) or with discrete label changes. eters θˆ ,z = arg minθ∈Θ n1 i=1 L(zi , θ) + L(z, θ). A classic result (Cook & Weisberg, 1982) tells us that the in- If x is continuous and δ is small, we can further approxi- fluence of upweighting z on the parameters θˆ is given by mate (3). Assume that the input domain X ⊆ Rd , the pa- rameter space Θ ⊆ Rp , and L is differentiable in θ and x. def dθˆ ,z ˆ As δ → 0, ∇θ L(zδ , θ)ˆ − ∇θ L(z, θ)ˆ ≈ [∇x ∇θ L(z, θ)]δ, ˆ Iup,params (z) = = −Hθˆ−1 ∇θ L(z, θ), (1) d =0 ˆ where ∇x ∇θ L(z, θ) ∈ R p×d . Substituting into (3), def n where Hθˆ = n1 i=1 ∇2θ L(zi , θ) ˆ is the Hessian and is dθˆ ,zδ ,−z ˆ ≈ −Hθˆ−1 [∇x ∇θ L(z, θ)]δ. (4) positive definite (PD) by assumption. In essence, we form d =0 a quadratic approximation to the empirical risk around θˆ and take a single Newton step; see appendix A for a deriva- We thus have θˆzδ ,−z − θˆ ≈ − n1 Hθˆ−1 [∇x ∇θ L(z, θ)]δ. ˆ Dif- tion. Since removing a point z is the same as upweighting ferentiating w.r.t. δ and applying the chain rule gives us it by = − n1 , we can linearly approximate the parame- Ipert,loss (z, ztest ) def = ∇δ L(ztest , θˆzδ ,−z ) (5) ter change due to removing z by computing θˆ−z − θˆ ≈ δ=0 − n1 Iup,params (z), without retraining the model. ˆ = −∇θ L(ztest , θ) ˆ Hθˆ−1 ∇x ∇θ L(z, θ). Next, we apply the chain rule to measure how upweighting Ipert,loss (z, ztest ) δ tells us the approximate effect that z → ˆ In particular, the influence of z changes functions of θ. z + δ has on the loss at ztest . By setting δ in the direction of upweighting z on the loss at a test point ztest again has a Ipert,loss (z, ztest ), we can construct local perturbations of z closed-form expression: that maximally increase the loss at ztest . In Section 5.2, we will use this to construct training-set attacks. Finally, we def dL(ztest , θˆ ,z ) note that Ipert,loss (z, ztest ) can help us identify the features Iup,loss (z, ztest ) = (2) d =0 of z that are most responsible for the prediction on ztest . ˆ ˆ dθ ,z = ∇θ L(ztest , θ) d =0 2.3. Relation to Euclidean distance ˆ H −1 ∇θ L(z, θ). = −∇θ L(ztest , θ) ˆ θˆ To find the training points most relevant to a test point, it 1 We fold in any regularization terms into L. is common to look at its nearest neighbors in Euclidean

3. Understanding Black-box Predictions via Influence Functions Figure 1. Components of influence. (a) What is the effect of the training loss and Hθˆ−1 terms in Iup,loss ? Here, we plot Iup,loss against variants that are missing these terms and show that they are necessary for picking up the truly influential training points. For these calculations, we use logistic regression to distinguish 1’s from 7’s in MNIST (LeCun et al., 1998), picking an arbitrary test point ztest ; similar trends hold across other test points. Green dots are train images of the same label as the test image (7) while red dots are 1’s. Left: Without the train loss term, we overestimate the influence of many training points: the points near the y=0 line should have Iup,loss close to 0, but instead have high influence when we remove the train loss term. Mid: Without Hθˆ−1 , all green training points are helpful (removing each point increases test loss) and all red points are harmful (removing each point decreases test loss). This is because ∀x, x 0 (all pixel values are positive), so x · xtest ≥ 0, but it is incorrect: many harmful training points actually share the same label as ztest . See panel (b). Right: Without training loss or Hθˆ−1 , what is left is the scaled Euclidean inner product ytest y · σ(−ytest θ xtest ) · xtest x, which fails to accurately capture influence; the scatter plot deviates quite far from the diagonal. (b) The test image and a harmful training image with the same label. To the model, they look very different, so the presence of the training image makes the model think that the test image is less likely to be a 7. The Euclidean inner product does not pick up on these less intuitive, but important, harmful influences. space (e.g., Ribeiro et al. (2016)); if all points have the is too expensive for models like deep neural networks with same norm, this is equivalent to choosing x with the largest millions of parameters. Second, we often want to calculate x · xtest . For intuition, we compare this to Iup,loss (z, ztest ) on Iup,loss (zi , ztest ) across all training points zi . a logistic regression model and show that influence is much The first problem is well-studied in second-order optimiza- more accurate at accounting for the effect of training. tion. The idea is to avoid explicitly computing Hθˆ−1 ; in- Let p(y | x) = σ(yθ x), with y ∈ {−1, 1} and σ(t) = stead, we use implicit Hessian-vector products (HVPs) to 1 def 1+exp(−t) . We seek to maximize the probability of the ˆ and then efficiently approximate stest = Hθˆ−1 ∇θ L(ztest , θ) training set. For a training point z = (x, y), L(z, θ) = ˆ This also compute Iup,loss (z, ztest ) = −stest · ∇θ L(z, θ). log(1 + exp(−yθ x)), ∇θ L(z, θ) = −σ(−yθ x)yx, solves the second problem: for each test point of inter- n and Hθ = n1 i=1 σ(θ xi )σ(−θ xi )xi xi . From (2), est, we can precompute stest and then efficiently compute Iup,loss (z, ztest ) is: ˆ for each training point zi . −stest · ∇θ L(zi , θ) −ytest y · σ(−ytest θ xtest ) · σ(−yθ x) · xtest Hθˆ−1 x. We discuss two techniques for approximating stest , both We highlight two key differences from x · xtest . First, relying on the fact that the HVP of a single term in Hθˆ, σ(−yθ x) gives points with high training loss more influ- ˆ [∇2θ L(zi , θ)]v, can be computed for arbitrary v in the same ence, revealing that outliers can dominate the model pa- time that ∇θ L(zi , θ)ˆ would take, which is typically O(p) rameters. Second, the weighted covariance matrix Hθˆ−1 (Pearlmutter, 1994). measures the “resistance” of the other training points to the ˆ points in a direction of little Conjugate gradients (CG). The first technique is a stan- removal of z; if ∇θ L(z, θ) dard transformation of matrix inversion into an optimiza- variation, its influence will be higher since moving in that tion problem. Since Hθˆ 0 by assumption, Hθˆ−1 v ≡ direction will not significantly increase the loss on other training points. As we show in Fig 1, these differences arg mint { 21 t Hθˆt − v t}. We can solve this with CG mean that influence functions capture the effect of model approaches that only require the evaluation of Hθˆt, which training much more accurately than nearest neighbors. takes O(np) time, without explicitly forming Hθˆ. While an exact solution takes p CG iterations, in practice we can get a good approximation with fewer iterations; see Martens 3. Efficiently Calculating Influence (2010) for more details. There are two computational challenges to using Stochastic estimation. With large datasets, standard CG ˆ H −1 ∇θ L(z, θ). Iup,loss (z, ztest ) = −∇θ L(ztest , θ) ˆ First, it can be slow; each iteration still goes through all n train- θˆ 1 n ˆ requires forming and inverting Hθˆ = n i=1 ∇2θ L(zi , θ), ing points. We use a method developed by Agarwal et al. the Hessian of the empirical risk. With n training points (2016) to get an estimator that only samples a single point and θ ∈ Rp , this requires O(np2 + p3 ) operations, which per iteration, which results in significant speedups.

4. Understanding Black-box Predictions via Influence Functions def j Dropping the θˆ subscript for clarity, let Hj−1 = i=0 (I − H)i , the first j terms in the Taylor expansion of H −1 . Rewrite this recursively as Hj−1 = I + (I − H)Hj−1 −1 . −1 −1 From the validity of the Taylor expansion, Hj → H as j → ∞.2 The key is that at each iteration, we can substi- tute the full H with a draw from any unbiased (and faster- to-compute) estimator of H to form H ˜ −1 ] = ˜ j . Since E[H j Hj−1 , we still have E[H ˜ −1 ] → H −1 . j Figure 2. Influence matches leave-one-out retraining. We arbi- In particular, we can uniformly sample zi and use trarily picked a wrongly-classified test point ztest , but this trend ∇2θ L(zi , θ) ˆ as an unbiased estimator of H. This gives held more broadly. These results are from 10-class MNIST. Left: us the following procedure: uniformly sample t points For each of the 500 training points z with largest Iup,loss (z, ztest ) , ˜ −1 v = we plotted − n1 · Iup,loss (z, ztest ) against the actual change in test zs1 , . . . , zst from the training data; define H 0 −1 ˜ v = v + I − loss after removing that point and retraining. The inverse HVP v; and recursively compute H j was solved exactly with CG. Mid: Same, but with the stochastic ∇2θ L(zsj , θ) ˆ H ˜ −1 v, taking H ˜ t−1 v as our final unbiased es- approximation. Right: The same plot for a CNN, computed on j−1 timate of H −1 v. We pick t to be large enough such that H ˜t the 100 most influential points with CG. For the actual difference stabilizes, and to reduce variance we repeat this procedure in loss, we removed each point and retrained from θ˜ for 30k steps. r times and average results. Empirically, we found this sig- nificantly faster than CG. strictly convex. Here, we empirically show that influence functions are accurate approximations (Section 4.1) that We note that the original method of Agarwal et al. (2016) provide useful information even when these assumptions dealt only with generalized linear models, for which are violated (Sections 4.2, 4.3). ˆ can be efficiently computed in O(p) time. [∇2θ L(zi , θ)]v In our case, we rely on Pearlmutter (1994)’s more general 4.1. Influence functions vs. leave-one-out retraining algorithm for fast HVPs, described above, to achieve the same time complexity.3 Influence functions assume that the weight on a training point is changed by an infinitesimally small . To investi- With these techniques, we can compute Iup,loss (zi , ztest ) gate the accuracy of using influence functions to approx- on all training points zi in O(np + rtp) time; we show in imate the effect of removing a training point and retrain- Section 4.1 that empirically, choosing rt = O(n) gives ac- ing, we compared − n1 Iup,loss (z, ztest ) with L(ztest , θˆ−z ) − curate results. Similarly, we compute Ipert,loss (zi , ztest ) = ˆ (i.e., actually doing leave-one-out retraining). L(ztest , θ) − n1 ∇θ L(ztest , θ) ˆ ˆ H −1 ∇x ∇θ L(zi , θ) with two θˆ With a logistic regression model on 10-class MNIST,4 the matrix-vector products: we first compute stest , then predicted and actual changes matched closely (Fig 2-Left). ˆ with the same HVP trick. These stest ∇x ∇θ L(zi , θ), computations are easy to implement in auto-grad systems The stochastic approximation from Agarwal et al. (2016) like TensorFlow (Abadi et al., 2015) and Theano (Theano was also accurate with r = 10 repeats and t = 5, 000 iter- D. Team, 2016), as users need only specify L; the rest is ations (Fig 2-Mid). Since each iteration only requires one automatically handled. ˆ HVP [∇2θ L(zi , θ)]v, this runs quickly: in fact, we accu- rately estimated H −1 v without even looking at every data point, since n = 55, 000 > rt. Surprisingly, even r = 1 4. Validation and Extensions worked; while results were noisier, it was still able to iden- Recall that influence functions are asymptotic approxima- tify the most influential points. tions of leave-one-out retraining under the assumptions that (i) the model parameters θˆ minimize the empirical risk, 4.2. Non-convexity and non-convergence and that (ii) the empirical risk is twice-differentiable and In Section 2, we took θˆ as the global minimum. In practice, 2 We assume w.l.o.g. that ∀i, ∇2θ L(zi , θ)ˆ I; if this is not if we obtain our parameters θ˜ by running SGD with early true, we can scale the loss down without affecting the parameters. stopping or on non-convex objectives, θ˜ = θ. ˆ As a result, In some cases, we can get an upper bound on ∇2θ L(zi , θ) ˆ (e.g., for Hθ˜ could have negative eigenvalues. We show that influ- linear models and bounded input), which makes this easy. Other- ence functions on θ˜ still give meaningful results in practice. wise, we treat the scaling as a separate hyperparameter and tune it such that the Taylor expansion converges. Our approach is to form a convex quadratic approxima- 3 To increase stability, especially with non-convex models (see ˜ i.e., L(z, tion of the loss around θ, ˜ θ) = L(z, θ) ˜ + Section 4.2), we can also sample a mini-batch of training points at each iteration, instead of relying on a single training point. 4 We trained with L-BFGS (Liu & Nocedal, 1989), with L2 regularization of 0.01, n = 55, 000, and p = 7, 840 parameters.

6. Understanding Black-box Predictions via Influence Functions mon in computer vision and is equivalent to training a lo- able from real test images but completely fool a classifier gistic regression model on the bottleneck features (Don- (Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2016). ahue et al., 2014). We picked a test image both models We demonstrate that influence functions can be used to got correct (Fig 4-Top) and used SmoothHinge(·, 0.001) craft adversarial training images that are similarly visually- to compute the influence for the SVM. indistinguishable and can flip a model’s prediction on a sep- arate test image. To the best of our knowledge, this is the As expected, Iup,loss in the RBF SVM varied inversely with first proof-of-concept that visually-indistinguishable train- raw pixel distance, with training images far from the test ing attacks can be executed on otherwise highly-accurate image in pixel space having almost no influence. The In- neural networks. ception influences were much less correlated with distance in pixel space (Fig 4-Left). Looking at the two most help- The key idea is that Ipert,loss (z, ztest ) tells us how to mod- ful images (most positive −Iup,loss ) for each model in Fig ify training point z to most increase the loss on ztest . 4-Right, we see that the Inception network picked up on the Concretely, for a target test image ztest , we can construct distinctive characteristics of clownfish, whereas the RBF z˜i , an adversarial version of a training image zi , by ini- SVM pattern-matched training images superficially. tializing z˜i := zi and then iterating z˜i := Π(˜ zi + α sign(Ipert,loss (˜ zi , ztest ))), where α is the step size and Π Moreover, in the RBF SVM, fish (green points) close to projects onto the set of valid images that share the same 8- the test image were mostly helpful, while dogs (red) were bit representation with zi . After each iteration, we retrain mostly harmful, with the RBF acting as a soft nearest the model. This is an iterated, training-set analogue of the neighbor function (Fig 4-Left). In contrast, in the Incep- methods used by, e.g., Goodfellow et al. (2015); Moosavi- tion network, fish and dogs could be helpful or harmful for Dezfooli et al. (2016) for test-set attacks. correctly classifying the test image as a fish; in fact, some of the most helpful training images were dogs that, to the We tested these training attacks on the same Inception net- model, looked very different from the test fish (Fig 4-Top). work on dogs vs. fish from Section 5.1, choosing this pair of animals to provide a stark contrast between the classes. We set α = 0.02 and ran the attack for 100 iterations on each test image. As before, we froze all but the top layer for training; note that computing Ipert,loss still involves differentiating through the entire network. Originally, the model correctly classified 591 / 600 test images. For each of these 591 test images, considered separately, we tried to find a visually-indistinguishable perturbation (i.e., same 8- bit representation) to a single training image, out of 1,800 total training images, that would flip the model’s predic- tion. We were able to do this on 335 (57%) of the 591 test images. By perturbing 2 training images for each test image, we could flip predictions on 77% of the 591 test im- ages; and if we perturbed 10 training images, we could flip all but 1 of the 591. The above results are from attacking each test image separately, i.e., using a different training set to attack each test image. We also tried to attack multiple test images simultaneously by increasing their average loss, Figure 4. Inception vs. RBF SVM. Bottom left: and found that single training image perturbations could si- −Iup,loss (z, ztest ) vs. z − ztest 22 . Green dots are fish and multaneously flip multiple test predictions as well (Fig 5). red dots are dogs. Bottom right: The two most helpful training images, for each model, on the test. Top right: An image of a We make three observations about these attacks. First, dog in the training set that helped the Inception model correctly though the change in pixel values is small, the change in classify the test image as a fish. the final Inception feature layer is significantly larger: us- 5.2. Adversarial training examples ing L2 distance in pixel space, the training values change by less than 1% of the mean distance of a training point to In this section, we show that models that place a lot of in- its class centroid, whereas in Inception feature space, the fluence on a small number of points can be vulnerable to change is on the same order as the mean distance. This training input perturbations, posing a serious security risk leaves open the possibility that our attacks, while visually- in real-world ML systems where attackers can influence the imperceptible, can be detected by examining the feature training data (Huang et al., 2011). Recent work has gener- space. Second, the attack tries to perturb the training ex- ated adversarial test images that are visually indistinguish-