我们识别出混淆渐变,一种渐变掩蔽,是一种在防御对抗样本时导致虚假的安全感的现象。虽然导致混淆渐变的防御似乎会破坏基于迭代优化的攻击,但我们发现可以规避依赖此效果的防御措施。我们描述了展示效果的防御的特征行为,并且对于我们发现的三种类型的混淆梯度中的每一种,我们都开发了对应的攻击技术来克服它。在一个案例研究中,在ICLR 2018上检查未经认证的白盒安全防御,我们发现混淆的梯度是常见的,9个防御中的7个依赖于混淆的梯度。在每篇论文所考虑的原始威胁模型中,我们的新攻击成功完全绕过了6个防御,部分避开1种防御。

注脚

展开查看详情

1. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples Anish Athalye * 1 Nicholas Carlini * 2 David Wagner 2 Abstract apparent robustness against iterative optimization attacks: obfuscated gradients, a term we define as a special case of We identify obfuscated gradients, a kind of gradi- arXiv:1802.00420v4 [cs.LG] 31 Jul 2018 gradient masking (Papernot et al., 2017). Without a good ent masking, as a phenomenon that leads to a false gradient, where following the gradient does not successfully sense of security in defenses against adversarial optimize the loss, iterative optimization-based methods can- examples. While defenses that cause obfuscated not succeed. We identify three types of obfuscated gradients: gradients appear to defeat iterative optimization- shattered gradients are nonexistent or incorrect gradients based attacks, we find defenses relying on this caused either intentionally through non-differentiable op- effect can be circumvented. We describe charac- erations or unintentionally through numerical instability; teristic behaviors of defenses exhibiting the effect, stochastic gradients depend on test-time randomness; and and for each of the three types of obfuscated gra- vanishing/exploding gradients in very deep computation dients we discover, we develop attack techniques result in an unusable gradient. to overcome it. In a case study, examining non- certified white-box-secure defenses at ICLR 2018, We propose new techniques to overcome obfuscated gradi- we find obfuscated gradients are a common occur- ents caused by these three phenomena. We address gradient rence, with 7 of 9 defenses relying on obfuscated shattering with a new attack technique we call Backward gradients. Our new attacks successfully circum- Pass Differentiable Approximation, where we approximate vent 6 completely, and 1 partially, in the original derivatives by computing the forward pass normally and threat model each paper considers. computing the backward pass using a differentiable approx- imation of the function. We compute gradients of random- 1. Introduction ized defenses by applying Expectation Over Transforma- tion (Athalye et al., 2017). We solve vanishing/exploding In response to the susceptibility of neural networks to adver- gradients through reparameterization and optimize over a sarial examples (Szegedy et al., 2013; Biggio et al., 2013), space where gradients do not explode/vanish. there has been significant interest recently in constructing de- fenses to increase the robustness of neural networks. While To investigate the prevalence of obfuscated gradients and progress has been made in understanding and defending understand the applicability of these attack techniques, we against adversarial examples in the white-box setting, where use as a case study the ICLR 2018 non-certified defenses the adversary has full access to the network, a complete that claim white-box robustness. We find that obfuscated solution has not yet been found. gradients are a common occurrence, with 7 of 9 defenses relying on this phenomenon. Applying the new attack tech- As benchmarking against iterative optimization-based at- niques we develop, we overcome obfuscated gradients and tacks (e.g., Kurakin et al. (2016a); Madry et al. (2018); circumvent 6 of them completely, and 1 partially, under the Carlini & Wagner (2017c)) has become standard practice in original threat model of each paper. Along with this, we evaluating defenses, new defenses have arisen that appear to offer an analysis of the evaluations performed in the papers. be robust against these powerful optimization-based attacks. Additionally, we hope to provide researchers with a common We identify one common reason why many defenses provide baseline of knowledge, description of attack techniques, * and common evaluation pitfalls, so that future defenses can Equal contribution 1 Massachusetts Institute of Technol- ogy 2 University of California, Berkeley. Correspondence avoid falling vulnerable to these same attack approaches. to: Anish Athalye <aathalye@mit.edu>, Nicholas Carlini To promote reproducible research, we release our re- <npc@berkeley.edu>. implementation of each of these defenses, along with imple- Proceedings of the 35 th International Conference on Machine mentations of our attacks for each. 1 Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 1 https://github.com/anishathalye/obfuscated-gradients by the author(s).

2. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples 2. Preliminaries 2.4. Threat Models 2.1. Notation Prior work considers adversarial examples in white-box and black-box threat models. In this paper, we consider defenses We consider a neural network f (·) used for classification designed for the white-box setting, where the adversary has where f (x)i represents the probability that image x cor- full access to the neural network classifier (architecture and responds to label i. We classify images, represented as weights) and defense, but not test-time randomness (only x ∈ [0, 1]w·h·c for a c-channel image of width w and height the distribution). We evaluate each defense under the threat h. We use f j (·) to refer to layer j of the neural network, and model under which it claims to be secure (e.g., bounded ∞ f 1..j (·) the composition of layers 1 through j. We denote distortion of = 0.031). It often easy to find impercepti- the classification of the network as c(x) = arg maxi f (x)i , bly perturbed adversarial examples by violating the threat and c∗ (x) denotes the true label. model, but by doing so under the original threat model, we show that the original evaluations were inadequate and the 2.2. Adversarial Examples claims of defenses’ security were incorrect. Given an image x and classifier f (·), an adversarial example (Szegedy et al., 2013) x satisfies two properties: D(x, x ) 2.5. Attack Methods is small for some distance metric D, and c(x ) = c∗ (x). We construct adversarial examples with iterative That is, for images, x and x appear visually similar but x optimization-based methods. For a given instance is classified incorrectly. x, these attacks attempt to search for a δ such that In this paper, we use the ∞ and 2 distortion metrics to mea- c(x + δ) = c∗ (x) either minimizing δ , or maximizing sure similarity. Two images which have a small distortion the classification loss on f (x + δ). To generate ∞ bounded under either of these metrics will appear visually identical. adversarial examples we use Projected Gradient Descent We report ∞ distance in the normalized [0, 1] space, so that (PGD) confined to a specified ∞ ball; for 2 , we use the a distortion of 0.031 corresponds to 8/256, and 2 distance Lagrangian relaxation of Carlini & Wagner (2017c). We as the total root-mean-square distortion normalized by the use between 100 and 10,000 iterations of gradient descent, total number of pixels (as is done in prior work). as needed to obtain convergance. The specific choice of optimizer is far less important than choosing to use iterative 2.3. Datasets & Models optimization-based methods (Madry et al., 2018). We evaluate these defenses on the same datasets on which they claim robustness. 3. Obfuscated Gradients If a defense argues security on MNIST and any other dataset, A defense is said to cause gradient masking if it “does we only evaluate the defense on the larger dataset. On not have useful gradients” for generating adversarial exam- MNIST and CIFAR-10, we evaluate defenses over the en- ples (Papernot et al., 2017); gradient masking is known to tire test set and generate untargeted adversarial examples. be an incomplete defense to adversarial examples (Papernot On ImageNet, we evaluate over 1000 randomly selected et al., 2017; Tram`er et al., 2018). Despite this, we observe images in the test set, construct targeted adversarial exam- that 7 of the ICLR 2018 defenses rely on this effect. ples with randomly selected target classes, and report attack To contrast from previous defenses which cause gradient success rate in addition to model accuracy. Generating tar- masking by learning to break gradient descent (e.g., by learn- geted adversarial examples is a strictly harder problem that ing to make the gradients point the wrong direction (Tram`er we believe is a more meaningful metric for evaluating at- et al., 2018)), we refer to the case where defenses are de- tacks. 2 Conversely, for a defender, the harder task is to signed in such a way that the constructed defense necessarily argue robustness to untargeted attacks. causes gradient masking as obfuscated gradients. We dis- We use standard models for each dataset. For MNIST we cover three ways in which defenses obfuscate gradients (we use a standard 5-layer convolutional neural network which use this word because in these cases, it is the defense creator reaches 99.3% accuracy. On CIFAR-10 we train a wide who has obfuscated the gradient information); we briefly ResNet (Zagoruyko & Komodakis, 2016; He et al., 2016) define and discuss each of them. to 95% accuracy. For ImageNet we use the InceptionV3 Shattered Gradients are caused when a defense is non- (Szegedy et al., 2016) network which reaches 78.0% top-1 differentiable, introduces numeric instability, or otherwise and 93.9% top-5 accuracy. causes a gradient to be nonexistent or incorrect. Defenses 2 Misclassification is a less meaningful metric on ImageNet, that cause gradient shattering can do so unintentionally, where a misclassification of closely related classes (e.g., a German by using differentiable operations but where following the shepherd classified as a Doberman) may not be meaningful. gradient does not maximize classification loss globally.

3. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples Stochastic Gradients are caused by randomized defenses, 4. Attack Techniques where either the network itself is randomized or the input is randomly transformed before being fed to the classifier, Generating adversarial examples through optimization- causing the gradients to become randomized. This causes based methods requires useful gradients obtained through methods using a single sample of the randomness to incor- backpropagation (Rumelhart et al., 1986). Many defenses rectly estimate the true gradient. therefore either intentionally or unintentionally cause gradi- ent descent to fail because of obfuscated gradients caused Exploding & Vanishing Gradients are often caused by de- by gradient shattering, stochastic gradients, or vanish- fenses that consist of multiple iterations of neural network ing/exploding gradients. We discuss a number of techniques evaluation, feeding the output of one computation as the that we develop to overcome obfuscated gradients. input of the next. This type of computation, when unrolled, can be viewed as an extremely deep neural network evalua- 4.1. Backward Pass Differentiable Approximation tion, which can cause vanishing/exploding gradients. Shattered gradients, caused either unintentionally, e.g. by 3.1. Identifying Obfuscated & Masked Gradients numerical instability, or intentionally, e.g. by using non- differentiable operations, result in nonexistent or incorrect Some defenses intentionally break gradient descent and gradients. To attack defenses where gradients are not readily cause obfuscated gradients. However, others defenses unin- available, we introduce a technique we call Backward Pass tentionally break gradient descent, but the cause of gradient Differentiable Approximation (BPDA) 3 . descent being broken is a direct result of the design of the neural network. We discuss below characteristic behaviors 4.1.1. A S PECIAL C ASE : of defenses which cause this to occur. These behaviors may T HE S TRAIGHT-T HROUGH E STIMATOR not perfectly characterize all cases of masked gradients. As a special case, we first discuss what amounts to the One-step attacks perform better than iterative attacks. straight-through estimator (Bengio et al., 2013) applied to Iterative optimization-based attacks applied in a white-box constructing adversarial examples. setting are strictly stronger than single-step attacks and Many non-differentiable defenses can be expressed as fol- should give strictly superior performance. If single-step lows: given a pre-trained classifier f (·), construct a prepro- methods give performance superior to iterative methods, it cessor g(·) and let the secured classifier fˆ(x) = f (g(x)) is likely that the iterative attack is becoming stuck in its where the preprocessor g(·) satisfies g(x) ≈ x (e.g., such a optimization search at a local minimum. g(·) may perform image denoising to remove the adversar- ial perturbation, as in Guo et al. (2018)). If g(·) is smooth Black-box attacks are better than white-box attacks. and differentiable, then computing gradients through the The black-box threat model is a strict subset of the white- combined network fˆ is often sufficient to circumvent the box threat model, so attacks in the white-box setting should defense (Carlini & Wagner, 2017b). However, recent work perform better; if a defense is obfuscating gradients, then has constructed functions g(·) which are neither smooth black-box attacks (which do not use the gradient) often per- nor differentiable, and therefore can not be backpropagated form better than white-box attacks (Papernot et al., 2017). through to generate adversarial examples with a white-box attack that requires gradient signal. Unbounded attacks do not reach 100% success. With unbounded distortion, any classifier should have 0% robust- Because g is constructed with the property that g(x) ≈ x, ness to attack. If an attack does not reach 100% success we can approximate its derivative as the derivative of the with sufficiently large distortion bound, this indicates the identity function: ∇x g(x) ≈ ∇x x = 1. Therefore, we can attack is not performing optimally against the defense, and approximate the derivative of f (g(x)) at the point x ˆ as: the attack should be improved. ∇x f (g(x))|x=ˆx ≈ ∇x f (x)|x=g(ˆx) Random sampling finds adversarial examples. Brute- This allows us to compute gradients and therefore mount a force random search (e.g., randomly sampling 105 or more white-box attack. Conceptually, this attack is simple. We points) within some -ball should not find adversarial exam- perform forward propagation through the neural network as ples when gradient-based attacks do not. usual, but on the backward pass, we replace g(·) with the identity function. In practice, the implementation can be ex- pressed in an even simpler way: we approximate ∇x f (g(x)) Increasing distortion bound does not increase success. by evaluating ∇x f (x) at the point g(x). This gives us an A larger distortion bound should monotonically increase attack success rate; significantly increasing distortion bound 3 The BPDA approach can be used on an arbitrary network, should result in significantly higher attack success rate. even if it is already differentiable, to obtain a more useful gradient.

4. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples approximation of the true gradient, and while not perfect, is 4.3. Reparameterization sufficiently useful that when averaged over many iterations We solve vanishing/exploding gradients by reparameteriza- of gradient descent still generates an adversarial example. tion. Assume we are given a classifier f (g(x)) where g(·) The math behind the validity of this approach is similar to performs some optimization loop to transform the input x the special case. to a new input xˆ. Often times, this optimization loop means that differentiating through g(·), while possible, yields ex- 4.1.2. G ENERALIZED ATTACK : BPDA ploding or vanishing gradients. While the above attack is effective for a simple class of To resolve this, we make a change-of-variable x = h(z) networks expressible as f (g(x)) when g(x) ≈ x, it is not for some function h(·) such that g(h(z)) = h(z) for all fully general. We now generalize the above approach into z, but h(·) is differentiable. For example, if g(·) projects our full attack, which we call Backward Pass Differentiable samples to some manifold in a specific manner, we might Approximation (BPDA). construct h(z) to return points exclusively on the manifold. Let f (·) = f 1...j (·) be a neural network, and let f i (·) be a This allows us to compute gradients through f (h(z)) and non-differentiable (or not usefully-differentiable) layer. To thereby circumvent the defense. approximate ∇x f (x), we first find a differentiable approxi- mation g(x) such that g(x) ≈ f i (x). Then, we can approxi- 5. Case Study: ICLR 2018 Defenses mate ∇x f (x) by performing the forward pass through f (·) (and in particular, computing a forward pass through f i (x)), As a case study for evaluating the prevalence of obfuscated but on the backward pass, replacing f i (x) with g(x). Note gradients, we study the ICLR 2018 non-certified defenses that we perform this replacement only on the backward pass. that argue robustness in a white-box threat model. Each of these defenses argues a high robustness to adaptive, white- As long as the two functions are similar, we find that the box attacks. We find that seven of these nine defenses rely slightly inaccurate gradients still prove useful in construct- on this phenomenon, and we demonstrate that our tech- ing an adversarial example. Applying BPDA often requires niques can completely circumvent six of those (and partially more iterations of gradient descent than without because circumvent one) that rely on obfuscated gradients. We omit each individual gradient descent step is not exactly correct. two defenses with provable security claims (Raghunathan We have found applying BPDA is often necessary: replacing et al., 2018; Sinha et al., 2018) and one that only argues f i (·) with g(·) on both the forward and backward pass is black-box security (Tram`er et al., 2018). We include one either completely ineffective (e.g. with Song et al. (2018)) or paper, Ma et al. (2018), that was not proposed as a defense many times less effective (e.g. with Buckman et al. (2018)). per se, but suggests a method to detect adversarial examples. There is an asymmetry in attacking defenses versus con- 4.2. Attacking Randomized Classifiers structing robust defenses: to show a defense can be by- Stochastic gradients arise when using randomized transfor- passed, it is only necessary to demonstrate one way to do mations to the input before feeding it to the classifier or so; in contrast, a defender must show no attack can succeed. when using a stochastic classifier. When using optimization- Table 1 summarizes our results. Of the 9 accepted papers, based attacks on defenses that employ these techniques, it is 7 rely on obfuscated gradients. Two of these defenses necessary to estimate the gradient of the stochastic function. argue robustness on ImageNet, a much harder task than CIFAR-10; and one argues robustness on MNIST, a much easier task than CIFAR-10. As such, comparing defenses Expectation over Transformation. For defenses that em- across datasets is difficult. ploy randomized transformations to the input, we apply Ex- pectation over Transformation (EOT) (Athalye et al., 2017) 5.1. Non-obfuscated Gradients to correctly compute the gradient over the expected trans- formation to the input. 5.1.1. A DVERSARIAL T RAINING When attacking a classifier f (·) that first randomly trans- Defense Details. Originally proposed by Goodfellow forms its input according to a function t(·) sampled from a et al. (2014b), adversarial training solves a min-max game distribution of transformations T , EOT optimizes the expec- through a conceptually simple process: train on adversarial tation over the transformation Et∼T f (t(x)). The optimiza- examples until the model learns to classify them correctly. tion problem can be solved by gradient descent, noting that Given training data X and loss function (·), standard train- ∇Et∼T f (t(x)) = Et∼T ∇f (t(x)), differentiating through ing chooses network weights θ as the classifier and transformation, and approximating the θ∗ = arg min E (x; y; Fθ ). expectation with samples at each gradient descent step. θ (x,y)∈X

5. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples Defense Dataset Distance Accuracy Discussion. Again, as above, we are unable to reduce the Buckman et al. (2018) CIFAR 0.031 ( ∞ ) 0%∗ claims made by the authors. However, these claims are Ma et al. (2018) CIFAR 0.031 ( ∞ ) 5% weaker than other defenses (because the authors correctly Guo et al. (2018) ImageNet 0.005 ( 2 ) 0%∗ performed a strong optimization-based attack (Carlini & Dhillon et al. (2018) CIFAR 0.031 ( ∞ ) 0% Wagner, 2017c)): 16% accuracy with = .015, compared Xie et al. (2018) ImageNet 0.031 ( ∞ ) 0%∗ to over 70% at the same perturbation budget with adversarial Song et al. (2018) CIFAR 0.031 ( ∞ ) 9%∗ Samangouei et al. MNIST 0.005 ( 2 ) 55%∗∗ training as in Madry et al. (2018). (2018) Madry et al. (2018) CIFAR 0.031 ( ∞) 47% 5.2. Gradient Shattering Na et al. (2018) CIFAR 0.015 ( ∞) 15% 5.2.1. T HERMOMETER E NCODING Defense Details. In contrast to prior work (Szegedy et al., Table 1. Summary of Results: Seven of nine defense techniques accepted at ICLR 2018 cause obfuscated gradients and are vulner- 2013) which viewed adversarial examples as “blind spots” able to our attacks. Defenses denoted with ∗ propose combining in neural networks, Goodfellow et al. (2014b) argue that the adversarial training; we report here the defense alone, see §5 for reason adversarial examples exist is that neural networks be- full numbers. The fundamental principle behind the defense de- have in a largely linear manner. The purpose of thermometer noted with ∗∗ has 0% accuracy; in practice, imperfections cause encoding is to break this linearity. the theoretically optimal attack to fail, see §5.4.2 for details. Given an image x, for each pixel color xi,j,c , the l-level ther- mometer encoding τ (xi,j,c ) is a l-dimensional vector where τ (xi,j,c )k = 1 if if xi,j,c > k/l, and 0 otherwise (e.g., for We study the adversarial training approach of Madry et al. a 10-level thermometer encoding, τ (0.66) = 1111110000). (2018) which for a given -ball solves Due to the discrete nature of thermometer encoded val- ues, it is not possible to directly perform gradient descent θ∗ = arg min E max (x + δ; y; Fθ ) . θ (x,y)∈X δ∈[− , ]N on a thermometer encoded neural network. The authors therefore construct Logit-Space Projected Gradient Ascent To approximately solve this formulation, the authors solve (LS-PGA) as an attack over the discrete thermometer en- the inner maximization problem by generating adversarial coded inputs. Using this attack, the authors perform the examples using projected gradient descent. adversarial training of Madry et al. (2018) on thermometer encoded networks. Discussion. We believe this approach does not cause ob- On CIFAR-10, just performing thermometer encoding was fuscated gradients: our experiments with optimization- found to give 50% accuracy within = 0.031 under ∞ based attacks do succeed with some probability (but do distortion. By performing adversarial training with 7 steps not invalidate the claims in the paper). Further, the authors’ of LS-PGA, robustness increased to 80%. evaluation of this defense performs all of the tests for charac- teristic behaviors of obfuscated gradients that we list. How- Discussion. While the intention behind this defense is to ever, we note that (1) adversarial retraining has been shown break the local linearity of neural networks, we find that to be difficult at ImageNet scale (Kurakin et al., 2016b), this defense in fact causes gradient shattering. This can and (2) training exclusively on ∞ adversarial examples be observed through their black-box attack evaluation: ad- provides only limited robustness to adversarial examples versarial examples generated on a standard adversarially under other distortion metrics (Sharma & Chen, 2017). trained model transfer to a thermometer encoded model re- ducing the accuracy to 67%, well below the 80% robustness 5.1.2. C ASCADE A DVERSARIAL T RAINING to the white-box iterative attack. Cascade adversarial machine learning (Na et al., 2018) is closely related to the above defense. The main difference Evaluation. We use the BPDA approach from §4.1.2, is that instead of using iterative methods to generate ad- where we let f (x) = τ (x). Observe that if we define versarial examples at each mini-batch, the authors train a τˆ(xi,j,c )k = min(max(xi,j,c − k/l, 0), 1) first model, generate adversarial examples (with iterative methods) on that model, add these to the training set, and then then train a second model on the augmented dataset only τ (xi,j,c )k = floor(ˆ τ (xi,j,c )k ) single-step methods for efficiency. Additionally, the authors construct a “unified embedding” and enforce that the clean so we can let g(x) = τˆ(x) and replace the backwards pass and adversarial logits are close under some metric. with the function g(·).

6. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples 1.0 The authors explore different combinations of input trans- Baseline formations along with different underlying ImageNet classi- 0.8 Thermometer fiers, including adversarially trained models. They find that Adv. Train input transformations provide protection even with a vanilla Adv. Therm Model Accuracy 0.6 classifier. 0.4 Discussion. The authors find that a ResNet-50 classifier provides a varying degree of accuracy for each of the five 0.2 proposed input transformations under the strongest attack with a normalized 2 dissimilarity of 0.01, with the strongest 0.0 defenses achieving over 60% top-1 accuracy. We reproduce these results when evaluating an InceptionV3 classifier. 0.00 0.01 0.02 0.03 Perturbation Magnitude The authors do not succeed in white-box attacks, credit- ing lack of access to test-time randomness as “particularly Figure 1. Model accuracy versus distortion (under ∞ ). Adversar- crucial in developing strong defenses” (Guo et al., 2018). 5 ial training increases robustness to 50% at = 0.031; thermometer encoding by itself provides limited value, and when coupled with Evaluation. It is possible to bypass each defense inde- adversarial training performs worse than adversarial training alone. pendently (and ensembles of defenses usually are not much stronger than the strongest sub-component (He et al., 2017)). We circumvent image cropping and rescaling with a direct LS-PGA only reduces model accuracy to 50% on a application of EOT. To circumvent bit-depth reduction and thermometer-encoded model trained without adversarial JPEG compression, we use BPDA and approximate the training (bounded by = 0.031). In contrast, we achieve backward pass with the identity function. To circumvent 1% model accuracy with the lower = 0.015 (and 0% with total variance minimization and image quilting, which are = 0.031). This shows no measurable improvement from both non-differentiable and randomized, we apply EOT and standard models, trained without thermometer encoding. use BPDA to approximate the gradient through the transfor- When we attack a thermometer-encoded adversarially mation. With our attack, we achieve 100% targeted attack trained model 4 , we are able to reproduce the 80% accu- success rate and accuracy drops to 0% for the strongest de- racy at = 0.031 claim against LS-PGA. However, our fense under the smallest perturbation budget considered in attack reduces model accuracy to 30%. This is significantly Guo et al. (2018), a root-mean-square perturbation of 0.05 weaker than the original Madry et al. (2018) model that (and a “normalized” 2 perturbation as defined in Guo et al. does not use thermometer encoding. Because this model is (2018) of 0.01). trained against the (comparatively weak) LS-PGA attack, it is unable to adapt to the stronger attack we present above. 5.2.3. L OCAL I NTRINSIC D IMENSIONALITY (LID) Figure 1 shows a comparison of thermometer encoding, with LID is a general-purpose metric that measures the distance and without adversarial training, against the baseline classi- from an input to its neighbors. Ma et al. (2018) propose fier, over a range of perturbation magnitudes, demonstrating using LID to characterize properties of adversarial examples. that thermometer encoding provides limited value. The authors emphasize that this classifier is not intended as a defense against adversarial examples 6 , however the authors 5.2.2. I NPUT T RANSFORMATIONS argue that it is a robust method for detecting adversarial Defense Details. Guo et al. (2018) propose five input examples that is not easy to evade by attempting their own transformations to counter adversarial examples. adaptive attack and showing it fails. As a baseline, the authors evaluate image cropping and Analysis Overview. Instead of actively attacking the de- rescaling, bit-depth reduction, and JPEG compression. tection method, we find that LID is not able to detect high Then the authors suggest two new transformations: (a) ran- confidence adversarial examples (Carlini & Wagner, 2017a), domly drop pixels and restore them by performing total even in the unrealistic threat model where the adversary is variance minimization; and (b) image quilting: reconstruct entirely oblivious to the defense and generates adversarial images by replacing small patches with patches from “clean” examples on the original classifier. A full discussion of this images, using minimum graph cuts in overlapping boundary 5 regions to remove edge artifacts. This defense may be stronger in a threat model where the adversary does not have complete information about the exact 4 quilting process used (personal communication with authors). That is, a thermometer encoded model that is trained using 6 the approach of (Madry et al., 2018). Personal communication with authors.

7. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples attack is given in Appendix A. Discussion. The authors consider three attack scenarios: vanilla attack (an attack on the original classifier), single- 5.3. Stochastic Gradients pattern attack (an attack assuming some fixed randomization pattern), and ensemble-pattern attack (an attack over a small 5.3.1. S TOCHASTIC ACTIVATION P RUNING (SAP) ensemble of fixed randomization patterns). The authors Defense Details. SAP (Dhillon et al., 2018) introduces strongest attack reduces InceptionV3 model accuracy to randomness into the evaluation of a neural network to de- 32.8% top-1 accuracy (over images that were originally fend against adversarial examples. SAP randomly drops classified correctly). some neurons of each layer f i to 0 with probability pro- The authors dismiss a stronger attack over larger choices portional to their absolute value. That is, SAP essentially of randomness, stating that it would be “computationally applies dropout at each layer where instead of dropping with impossible” (emphasis ours) and that such an attack “may uniform probability, nodes are dropped with a weighted dis- not even converge” (Xie et al., 2018). tribution. Values which are retained are scaled up (as is done in dropout) to retain accuracy. Applying SAP de- Evaluation. We find the authors’ ensemble attack overfits creases clean classification accuracy slightly, with a higher to the ensemble with fixed randomization. We bypass this drop probability decreasing accuracy, but increasing robust- defense by applying EOT, optimizing over the (in this case, ness. We study various levels of drop probability and find discrete) distribution of transformations. they lead to similar robustness numbers. Using this attack, even if we consider the attack successful only when an example is classified incorrectly 10 times out Discussion. The authors only evaluate SAP by taking a of 10, we achieve 100% targeted attack success rate and single step in the gradient direction (Dhillon et al., 2018). reduce the accuracy of the classifier from 32.8% to 0.0% While taking a single step in the direction of the gradient with a maximum ∞ perturbation of = 0.031. can be effective on non-randomized neural networks, when randomization is used, computing the gradient with respect 5.4. Vanishing & Exploding Gradients to one sample of the randomness is ineffective. 5.4.1. P IXEL D EFEND Defense Details. Song et al. (2018) propose using a Evaluation. To resolve this difficulty, we estimate the gra- PixelCNN generative model to project a potential adver- dients by computing the expectation over instantiations of sarial example back onto the data manifold before feeding randomness. At each iteration of gradient descent, instead it into a classifier. The authors argue that adversarial ex- of taking a step in the direction of ∇x f (x) we move in the amples mainly lie in the low-probability region of the data k direction of i=1 ∇x f (x) where each invocation is ran- distribution. PixelDefend “purifies” adversarially perturbed domized with SAP. We have found that choosing k = 10 images prior to classification by using a greedy decoding provides useful gradients. We additionally had to resolve procedure to approximate finding the highest probability a numerical instability when computing gradients: this de- example within an -ball of the input image. fense caused computing a backward pass to cause exploding gradients due to division by numbers very close to 0. Discussion. The authors evaluate PixelDefend on With these approaches, we are able to reduce SAP model CIFAR-10 over various classifiers and perturbation budgets. accuracy to 9% at = .015, and 0% at = 0.031. If With a maximum ∞ perturbation of = 0.031, PixelDe- we consider an attack successful only when an example fend claims 46% accuracy (with a vanilla ResNet classifier). is classified incorrectly 10 times out of 10 (and consider it The authors dismiss the possibility of end-to-end attacks on correctly classified if it is ever classified as the correct label), PixelDefend due to the difficulty of differentiating through model accuracy is below 10% with = 0.031. an unrolled version of PixelDefend due to vanishing gradients and computation cost. 5.3.2. M ITIGATING THROUGH R ANDOMIZATION Defense Details. Xie et al. (2018) propose to defend Evaluation. We sidestep the problem of computing gradi- against adversarial examples by adding a randomization ents through an unrolled version of PixelDefend by approxi- layer before the input to the classifier. For a classifier that mating gradients with BPDA, and we successfully mount takes a 299 × 299 input, the defense first randomly rescales an end-to-end attack using this technique 7 . With this attack, the image to a r × r image, with r ∈ [299, 331), and then 7 In place of a PixelCNN, due to the availability of a pre-trained randomly zero-pads the image so that the result is 331×331. model, we use a PixelCNN++ (Salimans et al., 2017) and discretize The output is then fed to the classifier. the mixture of logistics to produce a 256-way softmax.

8. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples we can reduce the accuracy of a naturally trained classifier be revealed to the adversary or held secret to the defender: which achieves 95% accuracy to 9% with a maximum ∞ model architecture and model weights; training algorithm perturbation of = 0.031. We find that combining adversar- and training data; test time randomness (either the values ial training (Madry et al., 2018) with PixelDefend provides chosen or the distribution); and, if the model weights are no additional robustness over just using the adversarially held secret, whether query access is allowed (and if so, the trained classifier. type of output, e.g. logits or only the top label). While there are some aspects of a defense that might be 5.4.2. D EFENSE -GAN held secret, threat models should not contain unrealistic Defense-GAN (Samangouei et al., 2018) uses a Generative constraints. We believe any compelling threat model should Adversarial Network (Goodfellow et al., 2014a) to project at the very least grant knowledge of the model architecture, samples onto the manifold of the generator before classi- training algorithm, and allow query access. fying them. That is, the intuition behind this defense is It is not meaningful to restrict the computational power of nearly identical to PixelDefend, but using a GAN instead an adversary artificially (e.g., to fewer than several thousand of a PixelCNN. We therefore summarize results here and attack iterations). If two defenses are equally robust but gen- present the full details in Appendix B. erating adversarial examples on one takes one second and another takes ten seconds, the robustness has not increased. Analysis Overview. Defense-GAN is not argued secure on CIFAR-10, so we use MNIST. We find that adversarial 6.2. Make specific, testable claims examples exist on the manifold defined by the generator. That is, we show that we are able to construct an adversarial Specific, testable claims in a clear threat model precisely example x = G(z) so that x ≈ x but c(x) = c(x ). As convey the claimed robustness of a defense. For example, a such, a perfect projector would not modify this example x complete claim might be: “We achieve 90% accuracy when because it exists on the manifold described by the generator. bounded by ∞ distortion with = 0.031, when the attacker However, while this attack would defeat a perfect projector has full white-box access.” mapping x to its nearest point on G(z), the imperfect gradi- In this paper, we study all papers under the threat model ent descent based approach taken by Defense-GAN does not the authors define. However, if a paper is evaluated under perfectly preserve points on the manifold. We therefore con- a different threat model, explicitly stating so makes it clear struct a second attack using BPDA to evade Defense-GAN, that the original paper’s claims are not being violated. although at only a 45% success rate. A defense being specified completely, with all hyperpa- rameters given, is a prerequisite for claims to be testable. 6. Discussion Releasing source code and a pre-trained model along with Having demonstrated attacks on these seven defenses, we the paper describing a specific threat model and robustness now take a step back and discuss the method of evaluating a claims is perhaps the most useful method of making testable defense against adversarial examples. claims. At the time of writing this paper, four of the defenses we study made complete source code available (Madry et al., The papers we study use a variety of approaches in eval- 2018; Ma et al., 2018; Guo et al., 2018; Xie et al., 2018). uating robustness of the proposed defenses. We list what we believe to be the most important points to keep in mind 6.3. Evaluate against adaptive attacks while building and evaluating defenses. Much of what we describe below has been discussed in prior work (Carlini & A strong defense is robust not only against existing attacks, Wagner, 2017a; Madry et al., 2018); we repeat these points but also against future attacks within the specified threat here and offer our own perspective for completeness. model. A necessary component of any defense proposal is therefore an attempt at an adaptive attack. 6.1. Define a (realistic) threat model An adaptive attack is one that is constructed after a defense A threat model specifies the conditions under which a de- has been completely specified, where the adversary takes ad- fense argues security: a precise threat model allows for an vantage of knowledge of the defense and is only restricted by exact understanding of the setting under which the defense is the threat model. One useful attack approach is to perform meant to work. Prior work has used words including white- many attacks and report the mean over the best attack per im- box, grey-box, black-box, and no-box to describe slightly age. That is, for a set of attacks a ∈ A instead of reporting different threat models, often overloading the same word. the value min mean f (a(x)) report mean min f (a(x)). a∈A x∈A x∈A a∈A Instead of attempting to, yet again, redefine the vocabulary, If a defense is modified after an evaluation, an adaptive we enumerate the various aspects of a defense that might

9. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples attack is one that considers knowledge of the new defense. ˇ Biggio, B., Corona, I., Maiorca, D., Nelson, B., Srndi´ c, N., In this way, concluding an evaluation with a final adaptive Laskov, P., Giacinto, G., and Roli, F. Evasion attacks attack can be seen as analogous to evaluating a model on against machine learning at test time. In Joint European the test data. Conference on Machine Learning and Knowledge Dis- covery in Databases, pp. 387–402. Springer, 2013. 7. Conclusion Buckman, J., Roy, A., Raffel, C., and Goodfellow, I. Ther- Constructing defenses to adversarial examples requires de- mometer encoding: One hot way to resist adversarial ex- fending against not only existing attacks but also future amples. International Conference on Learning Represen- attacks that may be developed. In this paper, we identify tations, 2018. URL https://openreview.net/ obfuscated gradients, a phenomenon exhibited by certain forum?id=S18Su--CW. accepted as poster. defenses that makes standard gradient-based methods fail Carlini, N. and Wagner, D. Adversarial examples are not to generate adversarial examples. We develop three attack easily detected: Bypassing ten detection methods. AISec, techniques to bypass three different types of obfuscated gra- 2017a. dients. To evaluate the applicability of our techniques, we use the ICLR 2018 defenses as a case study, circumventing Carlini, N. and Wagner, D. Magnet and “efficient defenses seven of nine accepted defenses. against adversarial attacks” are not robust to adversarial examples. arXiv preprint arXiv:1711.08478, 2017b. More generally, we hope that future work will be able to avoid relying on obfuscated gradients (and other methods Carlini, N. and Wagner, D. Towards evaluating the robust- that only prevent gradient descent-based attacks) for per- ness of neural networks. In IEEE Symposium on Security ceived robustness, and use our evaluation approach to detect & Privacy, 2017c. when this occurs. Defending against adversarial examples is an important area of research and we believe performing Dhillon, G. S., Azizzadenesheli, K., Bernstein, J. D., Kos- a careful, thorough evaluation is a critical step that can not saifi, J., Khanna, A., Lipton, Z. C., and Anandkumar, be overlooked when designing defenses. A. Stochastic activation pruning for robust adversarial defense. International Conference on Learning Represen- tations, 2018. URL https://openreview.net/ Acknowledgements forum?id=H1uR4GZRZ. accepted as poster. We are grateful to Aleksander Madry, Andrew Ilyas, and Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Aditi Raghunathan for helpful comments on an early draft Warde-Farley, D., Ozair, S., Courville, A., and Bengio, of this paper. We thank Bo Li, Xingjun Ma, Laurens van der Y. Generative adversarial nets. In Advances in neural Maaten, Aurko Roy, Yang Song, and Cihang Xie for useful information processing systems, pp. 2672–2680, 2014a. discussion and insights on their defenses. This work was partially supported by the National Science Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain- Foundation through award CNS-1514457, Qualcomm, and ing and harnessing adversarial examples. arXiv preprint the Hewlett Foundation through the Center for Long-Term arXiv:1412.6572, 2014b. Cybersecurity. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of wasserstein gans. References arXiv preprint arXiv:1704.00028, 2017. Amsaleg, L., Chelly, O., Furon, T., Girard, S., Houle, M. E., Guo, C., Rana, M., Cisse, M., and van der Maaten, L. Coun- Kawarabayashi, K.-i., and Nett, M. Estimating local in- tering adversarial images using input transformations. trinsic dimensionality. In Proceedings of the 21th ACM International Conference on Learning Representations, SIGKDD International Conference on Knowledge Dis- 2018. URL https://openreview.net/forum? covery and Data Mining, pp. 29–38. ACM, 2015. id=SyJ7ClWCb. accepted as poster. Athalye, A., Engstrom, L., Ilyas, A., and Kwok, K. Syn- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- thesizing robust adversarial examples. arXiv preprint ing for image recognition. In Proceedings of the IEEE arXiv:1707.07397, 2017. conference on computer vision and pattern recognition, pp. 770–778, 2016. Bengio, Y., L´eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for con- He, W., Wei, J., Chen, X., Carlini, N., and Song, D. Adver- ditional computation. arXiv preprint arXiv:1308.3432, sarial example defenses: Ensembles of weak defenses are 2013. not strong. arXiv preprint arXiv:1706.04701, 2017.

10. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples Ilyas, A., Jalal, A., Asteri, E., Daskalakis, C., and Di- Samangouei, P., Kabkab, M., and Chellappa, R. Defense- makis, A. G. The robust manifold defense: Adversar- gan: Protecting classifiers against adversarial attacks ial training using generative models. arXiv preprint using generative models. International Conference arXiv:1712.09196, 2017. on Learning Representations, 2018. URL https:// openreview.net/forum?id=BkJ3ibb0-. ac- Kurakin, A., Goodfellow, I., and Bengio, S. Adversar- cepted as poster. ial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016a. Sharma, Y. and Chen, P.-Y. Attacking the madry de- fense model with L1 -based adversarial examples. arXiv Kurakin, A., Goodfellow, I. J., and Bengio, S. Ad- preprint arXiv:1710.10733, 2017. versarial machine learning at scale. arXiv preprint Sinha, A., Namkoong, H., and Duchi, J. Certifiable distri- arXiv:1611.01236, 2016b. butional robustness with principled adversarial training. International Conference on Learning Representations, Ma, X., Li, B., Wang, Y., Erfani, S. M., Wijewickrema, S., 2018. URL https://openreview.net/forum? Schoenebeck, G., Houle, M. E., Song, D., and Bailey, J. id=Hk6kPgZA-. Characterizing adversarial subspaces using local intrinsic dimensionality. International Conference on Learning Song, Y., Kim, T., Nowozin, S., Ermon, S., and Kush- Representations, 2018. URL https://openreview. man, N. Pixeldefend: Leveraging generative models net/forum?id=B1gJ1L2aW. accepted as oral pre- to understand and defend against adversarial examples. sentation. International Conference on Learning Representations, 2018. URL https://openreview.net/forum? Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and id=rJUYGxbCW. accepted as poster. Vladu, A. Towards deep learning models resistant to ad- versarial attacks. International Conference on Learning Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, Representations, 2018. URL https://openreview. D., Goodfellow, I., and Fergus, R. Intriguing properties net/forum?id=rJzIBfZAb. accepted as poster. of neural networks. ICLR, 2013. Na, T., Ko, J. H., and Mukhopadhyay, S. Cascade adver- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, sarial machine learning regularized with a unified embed- Z. Rethinking the inception architecture for computer ding. In International Conference on Learning Represen- vision. In Proceedings of the IEEE Conference on Com- tations, 2018. URL https://openreview.net/ puter Vision and Pattern Recognition, pp. 2818–2826, forum?id=HyRVBzap-. 2016. Tram`er, F., Kurakin, A., Papernot, N., Goodfellow, I., Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Ce- Boneh, D., and McDaniel, P. Ensemble adversarial train- lik, Z. B., and Swami, A. Practical black-box attacks ing: Attacks and defenses. International Conference against machine learning. In Proceedings of the 2017 on Learning Representations, 2018. URL https:// ACM on Asia Conference on Computer and Communica- openreview.net/forum?id=rkZvSe-RZ. ac- tions Security, ASIA CCS ’17, pp. 506–519, New York, cepted as poster. NY, USA, 2017. ACM. ISBN 978-1-4503-4944-4. doi: 10.1145/3052973.3053009. URL http://doi.acm. Xie, C., Wang, J., Zhang, Z., Ren, Z., and Yuille, A. Mit- org/10.1145/3052973.3053009. igating adversarial effects through randomization. In- ternational Conference on Learning Representations, Raghunathan, A., Steinhardt, J., and Liang, P. Certified de- 2018. URL https://openreview.net/forum? fenses against adversarial examples. International Confer- id=Sk9yuql0Z. accepted as poster. ence on Learning Representations, 2018. URL https: //openreview.net/forum?id=Bys4ob-Rb. Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learn- ing representations by back-propagating errors. Nature, 323:533–536, 1986. Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modifications. In ICLR, 2017.

11. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples A. Local Intrinsic Dimensionality achieves 0% success. Because Carlini and Wagner’s 2 attack is unbounded, any time the attack does not reach Defense Details. The Local Intrinsic Dimensionality 100% success indicates that the attack became stuck in a (Amsaleg et al., 2015) “assesses the space-filling capability local minima. When this happens, it is often possible to of the region surrounding a reference example, based on the slightly modify the loss function and return to 100% attack distance distribution of the example to its neighbors” (Ma success (Carlini & Wagner, 2017b). et al., 2018). The authors present evidence that the LID is significantly larger for adversarial examples generated by In this case, we observe the reason that performing this existing attacks than for normal images, and they construct a type of adaptive attack fails is that gradient descent does classifier that can distinguish these adversarial images from not succeed in optimizing the LID loss, even though the normal images. Again, the authors indicate that LID is not LID computation is differentiable. Computing the LID term intended as a defense and only should be used to explore involves computing the k-nearest neighbors when comput- properties of adversarial examples. However, it would be ing ri (x). Minimizing the gradient of the distance to the natural to wonder whether it would be effective as a defense, current k-nearest neighbors is not representative of the true so we study its robustness; our results confirm that it is direction to travel in for the optimal set of k-nearest neigh- not adequate as a defense. The method used to compute bors. As a consequence, we find that adversarial examples the LID relies on finding the k nearest neighbors, a non- generated with gradient methods when penalizing for a high differentiable operation, rendering gradient descent based LID either (a) are not adversarial; or (b) are detected as methods ineffective. adversarial, despite penalizing for the LID loss. Let S be a mini-batch of N clean examples. Let ri (x) denote the distance (under metric d(x, y)) between sample Evaluation. We now evaluate what would happen if a x and its i-th nearest neighbor in S (under metric d). Then defense would directly apply LID to detect adversarial ex- LID can be approximated by amples. Instead of performing gradient descent over a term that is difficult to differentiate through, we have found that k −1 generating high confidence adversarial examples (Carlini & 1 ri (x) LIDd (x) = − log Wagner, 2017a) (completely oblivious to to the detector) is k i=1 rk (x) sufficient to fool this detector. We obtain from the authors their detector trained on both the Carlini and Wagner’s 2 where k is a defense hyperparameter the controls the num- attack and train our own on the Fast Gradient Sign attack, ber of nearest neighbors to consider. The authors use the both of which were found to be effective at detecting adver- distance function sarial examples generated by other methods. By generating high-confidence adversarial examples minimizing ∞ dis- dj (x, y) = f 1..j (x) − f 1..j (y) 2 tortion, we are able to reduce model accuracy to 2% success to measure the distance between the jth activation layers. within = 0.015. LID reports these adversarial examples The authors compute a vector of LID values for each sample: are benign at a 97% rate (unmodified test data is flagged as benign with a 98% rate). −−→ LID(x) = {LIDdj (x)}nj=1 . This evaluation demonstrates that the LID metric can be −−→ circumvented, and future work should carefully evaluate if Finally, they compute the LID(x) over the training data building a detector relying on LID is robust to adversarial and adversarial examples generated on the training data, examples explicitly targeting such a detector. This work and train a logistic regression classifier to detect adversarial also raises questions whether a large LID is a fundamental examples. We are grateful to the authors for releasing their characteristic of all adversarial examples, or whether it is a complete source code. by-product of certain attacks. Discussion. While LID is not a defense itself, the authors assess the ability of LID to detect different types of attacks. B. Defense-GAN Through solving the formulation Defense Details. The defender first trains a Generative Adversarial Network with a generator G(z) that maps sam- min. |x − x |22 + α ( (x ) + LID-loss(x )) ples from a latent space (typically z ∼ N (0, 1)) to images that look like training data. Defense-GAN takes a trained the authors attempt to determine if the LID metric is a good classifier f (·), and to classify an input x, instead of return- metric for detecting adversarial examples. Here, LID-loss(·) ing f (x), returns f (arg minz |G(z) − x|). To perform this is a function that can be minimized to reduce the LID score. projection to the manifold, the authors take many steps of However, the authors report that this modified attack still gradient descent starting from different random initializa-

12. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples tions. Defense-GAN was not shown to be effective on CIFAR-10. We therefore evaluate it on MNIST (where it was argued to be secure). Discussion. In Samangouei et al. (2018), the authors con- struct a white-box attack by unrolling the gradient descent used during classification. Despite an unbounded 2 per- Figure 2. Images on the MNIST test set. Row 1: Clean images. turbation size, Carlini and Wagner’s attack only reaches Row 2: Adversarial examples on an unsecured classifier. Row 3: 30% misclassification rate on the most vulnerable model Adversarial examples on Defense-GAN. and under 5% on the strongest. This leads us to believe that unrolling gradient descent breaks gradients. Concurrent to our work, Ilyas et al. (2017) also develop a nearly identical approach to Defense-GAN; they also find Evaluation. We find that adversarial examples do exist it is vulnerable to the attack we outline above, but increase on the data manifold as described by the generator G(·). the robustness further with adversarial training. We do not However, Defense-GAN does not completely project to evaluate this extended approach. the projection of the generator, and therefore often does not identify these adversarial examples actually on the manifold. Evaluation B. The above attack does not succeed on We therefore present two evaluations. In the first, we as- Defense-GAN. While the adversarial examples are directly sume that Defense-GAN were to able to perfectly project on the projection of the Generator, the projection process to the data manifold, and give a construction for generating will actually move it off the projection. adversarial examples. In the second, we take the actual im- plementation of Defense-GAN as it is, and perform BPDA To mount an attack on the approximate projection process, to generate adversarial examples with 50% success under we use the BPDA attack regularized for 2 distortion. Our reasonable 2 bounds. attack approach is identical to that of PixelDefend, except we replace the manifold projection with a PixelCNN with the manifold projection by gradient descent on the GAN. Evaluation A. Performing the manifold projection is non- Under these settings, we succeed at reducing model accu- trivial as an inner optimization step when generating adver- racy to 55% with a maximum normalized distortion of .0051 sarial examples. To sidestep this difficulty, we show that for successful attacks. adversarial examples exist directly on the projection of the generator. That is, we construct an adversarial example x = G(z ∗ ) so that |x − x | is small and c(x) = c(x ). To do this, we solve the re-parameterized formulation 2 min. G(z) − x 2 + c · (G(z)). We initialize z = arg minz |G(z) − x| (also found via gradi- ent descent). We train a WGAN using the code the authors provide (Gulrajani et al., 2017), and a MNIST CNN to 99.3% accuracy. We run for 50k iterations of gradient descent for generating each adversarial example; this takes under one minute per instance. The unsecured classifier requires a mean 2 distor- tion of 0.0019 (per-pixel normalized, 1.45 un-normalized) to fool. When we mount our attack, we require a mean distortion of 0.0027, an increase in distortion of 1.46×; see Figure 2 for examples of adversarial examples. The reason our attacks succeed with 100% success without suffering from vanishing or exploding gradients is that our gradient computation only needs to differentiate through the genera- tor G(·) once.