# Capacity Control of ReLU Neural Networks by Basis-path Norm

## 展开查看详情

1. Capacity Control of ReLU Neural Networks by Basis-path Norm Shuxin Zheng1,∗ , Qi Meng2 , Huishuai Zhang2 , Wei Chen2 , Nenghai Yu1 , and Tie-Yan Liu2 1 University of Science and Technology of China 2 Microsoft Research Asia zhengsx@mail.ustc.edu.cn, {meq, huzhang, wche, Tie-Yan.Liu}@microsoft.com, ynh@ustc.edu.cn arXiv:1809.07122v1 [cs.LG] 19 Sep 2018 Abstract works, researchers commonly used different norms of net- work parameters to measure the capacity (Bartlett, Foster, Recently, path norm was proposed as a new capacity measure and Telgarsky 2017; Neyshabur, Tomioka, and Srebro 2014; for neural networks with Rectified Linear Unit (ReLU) acti- vation function, which takes the rescaling-invariant property 2016). of ReLU into account. It has been shown that the general- Among different types of deep neural networks, ReLU ization error bound in terms of the path norm explains the networks (i.e., neural networks with ReLU activations (Glo- empirical generalization behaviors of the ReLU neural net- rot, Bordes, and Bengio 2011)) have demonstrated their out- works better than that of other capacity measures. Moreover, standing performances in many fields such as image classi- optimization algorithms which take path norm as the regular- fication (He et al. 2016; Huang et al. 2017), information sys- ization term to the loss function, like Path-SGD, have been tem (Cheng et al. 2016; He et al. 2017), and text understand- shown to achieve better generalization performance. How- ing (Vaswani et al. 2017) etc. It is well known that ReLU ever, the path norm counts the values of all paths, and hence the capacity measure based on path norm could be improperly neural networks are positively scale invariant (Neyshabur, influenced by the dependency among different paths. It is also Salakhutdinov, and Srebro 2015; Neyshabur et al. 2016). known that each path of a ReLU network can be represented That is, for a hidden node with ReLU activation, if all of by a small group of linearly independent basis paths with its incoming weights are multiplied by a positive constant c multiplication and division operation, which indicates that the and its outgoing weights are divided by the same constant, generalization behavior of the network only depends on only the neural network with the new weights will generate ex- a few basis paths. Motivated by this, we propose a new norm actly the same output as the old one for any arbitrary input. Basis-path Norm based on a group of linearly independent (Neyshabur, Salakhutdinov, and Srebro 2015) considered the paths to measure the capacity of neural networks more ac- product of weights along all paths from the input to out- curately. We establish a generalization error bound based on put units as path norm which is invariant to the rescaling of this basis path norm, and show it explains the generalization behaviors of ReLU networks more accurately than previous weights, and proposed Path-SGD which takes path norm as capacity measures via extensive experiments. In addition, we the regularization term to the loss function. develop optimization algorithms which minimize the empiri- In fact, each path in a ReLU network can be represented cal risk regularized by the basis-path norm. Our experiments by a small group of generalized linearly independent paths on benchmark datasets demonstrate that the proposed regu- (we call them basis-path in the sequels) with multiplication larization method achieves clearly better performance on the and division operation as shown in Figure 1. Thus, there is test set than the previous regularization approaches. dependency among different paths. The smaller the percent- age of basis paths, the higher the dependency. As the net- Introduction work is determined only by the basis paths, the generaliza- tion property of the network should depend only on the basis Deep neural networks have pushed the frontiers of a wide va- paths, as well as the relevant regularization methods. In ad- riety of AI tasks in recent years such as speech recognition dition, Path-SGD controls the capacity by solving argmin (Xiong et al. 2016; Chan et al. 2016), computer vision (Ioffe of the regularized loss function, the solution of the argmin and Szegedy 2015; Ren et al. 2015) and neural language pro- problem is approximate because dependency among differ- cessing (Bahdanau, Cho, and Bengio 2014; Gehring et al. ent values of all paths is not considered in the network. This 2017), etc. More surprisingly, deep neural networks gener- motivates us to establish a capacity bound based on only alize well, even when the number of parameters is signifi- the basis paths instead of all the paths. This is in contrast cantly larger than the amount of training data (Zhang et al. to the generalization bound based on the path norm which 2017). To explain the generalization ability of neural net- counts the values of all the paths and does not consider the ∗ This work was done when the author was visiting Microsoft dependency among different paths. To tackle these prob- Research Asia. lems, we define a new norm based on the values of the basis Copyright c 2019, Association for the Advancement of Artificial paths called Basis-path Norm. In previous work, (Meng et Intelligence (www.aaai.org). All rights reserved. al. 2018) constructed the basis paths by skeleton method and

2.proved that the values of all other paths can be calculated us- ing the values of basis paths by multiplication and division operations. In this work, we take one step further and cate- gorize the basis paths into positive and negative basis paths according to the sign of their coefficients in the calculations of non-basis paths. Figure 1: A toy neural network example. The network has 6 In order to control generalization error, we need to keep paths pi,j , where i ∈ {1, 2, 3} and j ∈ {1, 2}, the values of the hypothesis space being small. As we know, loss func- paths vpi,j = w1,i w2,j , We can see the dependency among tion can be computed by paths, hence we keep the values vp ·vp2,1 vp ·vp1,2 of all paths being small. To keep small values of non-basis the paths, i.e., vp2,2 = 1,2 vp1,1 and vp3,2 = 3,1 vp1,1 . paths represented by positive and negative basis paths, we In this group of basis paths, p1,1 is the negative Basis-path, control the positive basis paths not being too large while p1,2 ,p2,1 and p3,1 are the positive basis paths. the negative basis paths not being too small. In addition, to keep small values of basis paths, we control the nega- and Srebro 2016). For example, in (Bartlett, Foster, and Tel- tive basis paths not being too large as well. With this con- garsky 2017), the authors proposed a margin-based gener- sideration, we define the new Basis-path norm. We prove a alization bound for networks that scale with their margin- generalization error bound for ReLU networks in terms of normalized spectral complexity. An analysis of generaliza- the basis-path norm. We then study the relationship between tion bounds based on PAC-Bayes was proposed in (Dziu- this basis-path norm bound and the empirical generalization gaite and Roy 2017). gap – the absolute difference between test error and training error. The experiments included ReLU networks with differ- Among these measures, the generalization bound based ent depths, widths, and levels of randomness to the label. For on path norm is tighter theoretically (Neyshabur, Tomioka, comparison purpose, we also compute the generalization er- and Srebro 2016). Empirically, path norm has been showed ror bounds induced by other capacity measures for neural to be more accurate to describe the tendency of generaliza- networks proposed in the literature. Our experiments show tion error (Neyshabur et al. 2017). Thus, we are interested that the generalization bound based on basis-path norm is in the capacity measure which is related to the path norm. much more consistent with the empirical generalization gap In (Neyshabur, Tomioka, and Srebro 2016), the authors first than those based on other norms. In particular, when the net- proposed group norm and path norm. The results show that work size is small, the ordinary path norm bound fit empir- the path norm is equivalent to a kind of group norm. In ical generalization gap well. However, when the width and (Neyshabur, Salakhutdinov, and Srebro 2015; Neyshabur et depth increases, the percentage of non-basis paths increases, al. 2016), the authors proposed to use path norm as a reg- and the dependency among paths increases and we observe ularization term for ReLU multi-layers perceptron (MLP) that the path norm bound degenerates in reflecting the em- network and recurrent network and designed Path-SGD al- pirical generalization gap. In contrast, our basis-path norm gorithm. In (Neyshabur et al. 2017), the authors empiri- bound fits the empirical generalization gap consistently as cally compared different kinds of capacity measures includ- the network size changes. This validates the efficacy of BP ing path norm for deep neural network generalization. How- norm as a capacity measure. ever, none of those norms considered the dependency among Finally, we propose a novel regularization method, called paths in the networks. Basis-path regularization (BP regularization), in which we penalize the loss function by the BP norm. Empirically, Preliminaries we first conduct experiments on recommendation system In this section, we introduce ReLU neural networks and gen- of MovieLens-1M dataset to compare the multi-layer per- eralization error. ceptron (MLP) model’s generalization with BP regulariza- First of all, we briefly introduce the structure of recti- tion and baseline norm-based regularization, then we verify fier neural network models. Suppose fw : X → Y is a the effectiveness of BP regularization on image classifica- L-layer neural network with weight w ∈ W , where input tion task with ResNet and PlainNet on CIFAR-10 dataset. space X ⊂ Rd and output space Y = RK . In the l-th layer The results of all experiments show that, with our method, (l = 0, . . . , L), there are hl nodes. We denote the nodes and optimization algorithms (i.e., SGD, Adam, Quotient SGD) their values as {Ol , ol }. It is clear that, h0 = d, hL = K. can attain better test accuracy than with other regularization The layer mapping is given as, ol = σ(wlT ol−1 ), where wl methods. is the adjacency matrix in the l-layer, and the rectifier acti- vation function σ(·) = max(·, 0) is applied element-wisely. Related Work We can also calculate the k-th output by paths, i.e., Generalization of deep neural networks has attracted a great L L−1 deal of attention in the community (Zhang et al. 2017; Nwk (x) = wl (il−1 , il ) · I(olil (w, x) > 0) · xp0 Neyshabur et al. 2017; Kawaguchi, Kaelbling, and Bengio (i0 ,··· ,iL =k) l=1 l=1 2017). Norm and margin-based measures have been widely (1) studied, and commonly used in neural network optimiza- tion with capacity control (Bartlett and Mendelson 2002; where (i0 , · · · , iL ) is the path starting from input fea- Evgeniou, Pontil, and Poggio 2000; Neyshabur, Tomioka, ture node Oi00 to output node OiLL via hidden nodes

3.Oi11 , . . . , OiL−1 L−1 , and wl (il−1 , il ) is the weight of the edge con- pm+1 , · · · , pr always have positive exponent. We call the necting nodes Oil−1 and Oill . 1 basis path with negative exponent αi Negative Basis Path l−1 L and denote the negative basis path vector as p− = We denote p(i0 ,··· ,iL ) = l=1 wl (il−1 , il ) and (p1 , · · · , pm ). We call the basis path with positive ex- L−1 l a(i0 ,··· ,iL ) = l=1 I(o il (w, x) > 0). The output can ponent αj Positive Basis Path and denote it as p+ = be represented using paths as (pm+1 , · · · , pr ). k In order to control generalization error, we need to keep Np,a (x) = p(i0 ,··· ,iL ) · a(i0 ,··· ,iL ) · xi0 . the hypothesis space being small. Thus we want all the paths (i0 ,··· ,iL ) to have small values. For non-basis path represented by pi For ease of reference, we omit the explicit index (i0 , · · · , iL ) and pj , we control pi not being too small because αi is neg- and use i be the index of path. We use p = (p1 , p2 , · · · , pM ) ative, and pj not being too large because αj is positive. We L control pi not being too large as well to keep small values of where M = l=0 hl to denote the path vector. The path basis paths. We define the following basis-path norm φ(·) as norm used in Path-SGD (Neyshabur, Salakhutdinov, and follows. 1/2 M Srebro 2015) is defined as Ω(p) = i=1 p2i . Definition 1 The basis norm on the ReLU networks is Given the training set {(x1 , y1 ), · · · , (xn , yn )} i.i.d sam- φ(p) = sup {| log |p1 ||, · · · , | log |pm ||, |pm+1 |, · · · , |pr |} . pled from the underlying distribution P, machine learning (2) algorithms learn a model f from the hypothesis space F by minimizing the empirical loss function l(f (x), y). The uni- We next provide the property of φ(p). form generalization error of empirical risk minimization in hypothesis space F is defined as Theorem 1 φ(p) is a norm in the vector space where p− is a vector in Euclidean space and p+ is a vector in a general- n 1 ized linear space under the generalized addition and gener- gen (F) = sup | l(f (xi ), yi ) − E(x,y)∼P l(f (x), y)|. alized scalar multiplication operations: p− ⊕ (p )− = [p1 · f ∈F n i=1 p1 , · · · , pm ·pm ] and c p− = [sgn(p1 )·|p1 |a , · · · , sgn(pm )· Generalization error gen measures how well a model f |pm |a ] for p− , (p )− ∈ Rm and c ∈ R. learned from the training data S can fit an unknown test sam- Proof: The definition of φ(p) is equivalent to ple (x, y) ∼ P. Empirically, we consider the empirical generalization er- φ(p) = sup φ∞ (p+ ), φ∞ (p− ) . (3) ror which is defined as the difference of empirical loss be- tween the training set and test set at the trained model f . where φ∞ (p− ) = supi {| log |pi ||, i = 1, · · · , m} and φ∞ (p+ ) = supj {|pj |, j = m + 1, · · · , r}. Obviously, Basis-path Norm φ∞ (p+ ) is the ∞ norm in Euclidean space. Thus, it only In this section, we define the Basis-path Norm (abbreviated needs to prove φ∞ (p− ) is a kind of norm. Next, we prove as BP norm) on the networks. Using the BP norm, we de- that φ∞ (p− ) is a norm in the generalized linear space. fine a capacity measure which is called BP-measure and we In the generalized linear space, the zero vector is I, where prove that the generalization error can be upper bounded us- I denotes a vector with all elements being equal to 1. Based ing this measure. on the generalized linear operators, we verify the properties including positive definite, absolutely homogeneous and the The Definition of Basis-path Norm triangle inequality of p− ∞ as follows: (1) (Positive definite) φ∞ (p− ) ≥ 0 and φ∞ (p− ) = 0 First, as shown in (Meng et al. 2018), the authors constructed when p− = I. a group of basis paths by skeleton method. It means that the (2) (Absolutely homogeneous) For arbitrary c ∈ R, we value of non-basis paths can be calculated using the values have of basis paths. In the calculation of non-basis paths’ values, some basis paths always have positive exponents and hence φ∞ (c · p− ) = sup | log |p− c − i | |, i = 1, · · · , m = |c| · φ∞ (p ). appear in the numerator, others have negative exponents and i hence appear in the denominator. We use p˜ to denote a non- (3) (Triangle inequality) basis path and p1 , · · · , pr to denote basis paths. We have the following proposition. φ∞ (p− ⊕ (p )− ) = sup {| log |pi pi ||, i = 1, · · · , m} i Proposition 1 For any non-basis path p˜, p˜ = ≤ sup {| log |pi ||, i = 1, · · · , m} m αi r αj i=1 pi j=m+1 pj , where αi ≤ 0, αj ≥ 0. i Limited by the space, we put the detailed proof in the sup- + sup {| log |pi ||, i = 1, · · · , m} i plementary materials. The proposition shows that basis paths p1 , · · · , pm al- =φ∞ (p− ) + φ∞ ((p )− ). ways have negative exponent in the calculation, while Considered that φ∞ (p+ ) and φ∞ (p− ) are both norms, tak- 1 The paths across the bias node can also be described in the ing supreme of them is still a norm. Thus φ(p) satisfies the same way. For simplicity, we omit the bias term. definition of norm.

4.Generalization Error Bound by Basis-path Norm where αi ≤ 0, αj ≥ 0. We want to use the basis-path norm to define a capacity mea- As shown in skeleton method in (Meng et al. 2018) sure to get the upper bound for the generalization error. Sup- (which can also be referred in supplementary materials), pose the binary classifier is given as g(x) = v T f (x), where basis paths are constructed according to skeleton weights. v represents the linear operator on the output of the deep Here, we clarify the non-basis paths according to the num- network with input vector x ∈ Rd . We consider the follow- ber of non-skeleton weights it contains. We denote the non- ing hypothesis space which is composed of linear operator basis path which contains b non-skeleton weights as p˜b . The m v, and L-layered fully connected neural networks with width proofs of Proposition 1 indicates that for p˜b , i=1 αi = r H and input dimension d: 1 − b, i=m+1 αj = b. Thus we have d,H,L Gγ,v = pb | ≤ eγ(b−1) · γ b . |˜ {g = v ◦ f : L ≥ 2, h0 = d, h1 = · · · = hL−1 = H, φ(p) ≤ γ}. Step 3 (counting the number of different type of paths): Theorem 2 Given the training set {(x1 , y1 ), · · · , (xn , yn )} Based on the construction of basis paths (refer to the skele- with xi ∈ Rd , yi ∈ {0, 1}, and the hypothesis space Gγ,v d,H,L ton method in supplementary), in each hidden layer, there which contains MLPs with depth L ≥ 2, width H and φ(p), are H skeleton weights and H(H −1) non-skeleton weights. for arbitrary z > 0, for every δ > 0, with probability at least We can get that the number of p˜b in a L-layer MLP with b−1 1 − δ, for every hypothesis g ∈ Gγ,vd,H,L , the generalization width H is (d − 1)HCL−2 (H − 1)b−1 + HCL−2 b (H − 1)b L−2 error can be upper bounded as if 1 ≤ b ≤ L − 2 and (d − 1)H(H − 1) if b = L − 1. Step 4: We have: d,H,L 2 ln(4/δ) gen (Gγ,v ) ≤4 + m r L−1 n Ω2 (Fγ ) = (pi )2 + (pj )2 + p˜2b,k . (7) 2Φ(γ; d, H, L)(4H)L−1 · v 2 · maxi xi 22 2 i=1 j=m+1 b=2 k 2 , n The number of negative basis paths is H and e−γ ≤ |pi | ≤ where m eγ , so we have i=1 p2i ≤ He2γ , where m = H. ∆ Φ(γ; d, H, L) = Ω2 (Fγ ) L−2 (He2γ + (d − 1)Hγ 2 ) 1 + (H − 1)γ 2 e2γ . (4) L−2 2γ b−1 ≤He + (d − 1)HCL−2 (H − 1)b−1 + HCL−2 b (H − 1)b We call Φ(γ; d, H, L) Basis-path measure. Therefore, the d,H,L b=1 generalization error gen (Gγ,v ) can be upper bounded by 2 2 b γ(b−1) a function of Basis-path measure. · γ e + (d − 1)H(H − 1)L−2 · γ L−1 eγ(L−2) The proof depends on estimating the value of different L−2 types of paths and counting the number of different types of b paths. We give the proof sketch of Theorem 2. ≤(He2γ + (d − 1)Hγ 2 b CL−2 (H − 1)b γ 2 e2γ Proof of Theorem 2: b=0 Step 1: If we denote Fγ = {f : L ≥ 2, h0 = d, h1 = 2γ L−2 ≤(He + (d − 1)Hγ ) 1 + (H − 1)γ 2 e2γ 2 · · · = hL−1 = H, φ(p) ≤ γ}, the generalization error of a binary classification problem is =Φ(γ; d, H, L), (8) d,H,L 2 2 ln(4/δ) a gen (Gγ,v ≤2 v ) 2 RA(Fγ ) +4 , where Ineq.8 is established by the calculation of (1 + x) . n where RA(·) denotes the Rademacher complexity of a hy- Based on the above theorem, we discuss how pothesis space (Wolf 2018). Following the results of Theo- Φ(γ; d, H, L) changes as width H and depth L. (1) rem 1 and Theorem 5 in (Neyshabur, Tomioka, and Srebro For fixed γ, Φ(γ; d, H, L) increases exponentially as L and 2016) under p = 2 and q = ∞, we have H goes to large. (2) Φ(γ; d, H, L) increases as γ increases. If γ diminishes to zero, we have Φ(γ; d, H, L) → H. In 2Ω2 (Fγ )(4H)L−1 maxi xi 22 RA(Fγ ) ≤ , this case, the feature directly flow into the output, which n means that fk (x) = xk mod d , for k = 1, · · · , H. (3) If where Ω(Fγ ) is the maximal path norm of f, f ∈ Fγ . γ = O √HL 1 , we have Φ(γ; d, H, L) ≤ O H + Ld . Step 2 (estimating path value): We give Ω(Fγ ) an upper bound using Basis-path norm. Using φ(p) ≤ γ, we have It increases linearly as d and H increase and decreases e−γ ≤ |pi | ≤ eγ and |pj | ≤ γ. Then using Proposition 1, linearly as L increases. we have m r Empirical Verification α |˜ p| ≤ pα i i pj j (5) In the previous section, we derived a BP norm induced gen- i=1 j=m+1 eralization error bound for ReLU networks. In this section, m r we study the relationship between this bound and the empir- ≤ e−γ i=1 αi ·γ i=m+1 αj , (6) ical generalization gap – the absolute difference between test

5.error and training error with real-data experiments, in com- Algorithm 1 Optimize ReLU Network with Basis-path Reg- parison with the generalization error bounds given by other ularization capacity measures, including weight norm (Evgeniou, Pon- Input: learning rate ηt , training set S, initial w0 . til, and Poggio 2000), path norm (Neyshabur, Tomioka, and for t = 0, · · · , T do Srebro 2016) and spectral norm (Bartlett, Foster, and Telgar- 1. Draw mini-batch data xt from S. sky 2017). We follow the experiment settings in (Neyshabur 2. Compute gradient of the loss function g(wt ) = et al. 2017), and extend on our BP norm bound. As shown ∇f (wt , xt ). in the previous section, the BP norm with capacity is pro- 3. Compute gradient of the basis-path regularization portional to Eqn.4 We conduct experiments with multi-layer h(wt ) = ∇R(w) by Ineq. (10) and (11). perceptrons (MLP) with ReLU of different depths, widths, 4. Update wt+1 = wt − ηt (g(wt ) + h(wt )). and global minima on MNIST classification task which is end for optimized by stochastic gradient descent. More details of the Initialize: wT . training strategies can be found in the supplementary mate- rials. All experiments are averaged over 5 trials if without explicit note. minimal test error is achieved with 3 hidden layers, and then First, we train several MLP models and force them to con- shows an over fitting along with the increasing of the layers. verge to different global minima by intentionally replacing The weight norm keeps increasing with the growing of net- a different number of training data with random labels, and work size as discussed above, and the Πi hi in spectral norm then calculate the capacity measures on these models. The will be quite large when layers L is increasing. Path norm training set consists of 10000 randomly selected samples can partially explain the decreasing generalization error be- with true labels and another at most 5000 intentionally mis- fore 4 hidden layers and it indicates that the networks with 4, labeled data which are gradually added into the training set. 5 and 6 hidden layers have small generalization error, which The evaluation of error rate is conducted on a fixed 10000 doesn’t match our observations. The amount of non-basis validation set. Figure 2 (a) shows that every network is paths is exponentially growing when layers L is increasing, enough to fit the entire training set regardless of the amount therefore the path norm couldn’t measure the capacity accu- of mislabeled data, while the test error of the learned net- rately by counting all paths’ values. In contrast, the BP norm works increases with increasing size of the mislabeled data. can nearly match the generalization error, these observations As shown in Figure 2 (b), the measure of BP norm is con- verify that BP norm bound is more tight to generalization er- sistent with the behaviors of generalization on the data and ror and can be a better predictor of generalization. indeed is a good predictor of the generalization error, as well as weight norms, path norm, and spectral norm. We further investigate the relationship between general- Basis-path Regularization for ReLU Networks ization error and the network size with different widths. We train a bunch of MLPs with 2 hidden layers and vary- In this section, we propose Basis-path regularization, in ing number of hidden units from 16 to 8192 for each layer. which we penalize the loss function by the BP norm. Ac- The experiment is conducted on the whole training set with cording to the definition of BP norm in Eqn.(2), to make it 60000 images. As shown in Figure 2(c), the networks can fit small, we need to restrict the values of negative basis paths the whole training set when the number of hidden units is to be moderate (neither too large nor too small) and mini- greater than or equal to 32, while the minimal test error is mize the value of positive basis paths. To this end, in our achieved with 512 hidden units, then shows a slightly over proposed method, we penalize the empirical loss by the l2 fitting on training set beyond 1024 hidden units. Figure 2(d) distance between the values of negative basis paths and 1, as shows that the measure of BP norm behaves similarly to the well as the sum of the values of all positive basis paths. trend of generalization errors which decreases at the begin- The constraint φ(p) ≤ γ equals to p+ 2 ≤ γ 2 and ning and then slightly increases, and also achieves minimal log(p− )2 2 ≤ (2γ)2 , which means that the largest ele- value at 512 hidden units. Weight norm and spectral norm ment in a vector is smaller than γ iff all of the element is keep increasing along with the network size growing while smaller than γ. We choose to optimize their square because the trend of generalization error behaves differently. Path of the smoothness. So using the Lagrangian dual methods, norm shows the good explanation of the generalization when we add the constraint λ21 p+ 2 and λ42 log(p− )2 2 in the the number of hidden units is small, but keeps decreasing loss function and then optimize the regularized empirical along with increasing the network size in this experiment. risk function: One possible reason is that the proportion of basis paths in all paths is decreasing, and the vast majority improperly af- L(w, x) = f (w, x) + R(p) fects the capacity measure when the dependency in the net- λ1 + λ2 (9) work becomes large. In contrast, BP norm better explains = f (w, x) + p 2 + log(p− )2 2 . 2 4 the generalization behaviors regardless of the network size. Similar empirical observation is shown when we train the We use g(w) to denote the gradient of loss with respect to w, network with a different number of hidden layers. Each net- i.e., g(w) = ∂f ∂w (w,x) . For the non-skeleton weight wj , since work has 32 hidden units in each layer and can fit the whole it is contained in only one positive basis path pj , we can training set in this experiment. As shown in Figure2(e,f), the calculate the gradient of the regularization term with respect

6. 0.25 0.050 train error 0.06 train error test error 0.20 test error test error 0.045 0.15 0.04 error error error 0.040 0.10 0.02 0.035 0.05 0.00 0.00 0.030 0 1000 2000 3000 4000 5000 16 32 64 128256512 1k 2k 4k 8k 2 3 4 5 6 #mislabeled data #hidden units #layers (a) (c) (e) 1.0 weight norm 1.0 weight norm 1.0 weight norm path norm path norm path norm 0.8 spectral norm 0.8 spectral norm 0.8 spectral norm BP norm BP norm BP norm 0.6 0.6 0.6 measure measure measure 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0 1000 2000 3000 4000 5000 32 64 128 256 512 1k 2k 4k 8k 2 3 4 5 6 #mislabeled data #hidden units #layers (b) (d) (f) Figure 2: Left: experiments on different global minima for the objective function on the subset with true labels:(a) the training and test error, (b) different measures w.r.t. the size of random labels. Middle: experiments on different hidden units: (c) the training and test error, (d) different measures w.r.t. the size of hidden units for each layer. Right: (e) the test error, (f) different measure w.r.t. the number of layers of the network. to wj as computation overhead. All the additional computations re- garding Ineq.(10) only introduce very lightweight element- λ1 ∂R(p) ∂pj (w) p2j wise matrix operations, which is small compared with the h(wj ) = = λ1 · . (10) 2 ∂pj ∂wj wj forward and backward process. For the skeleton weight wi , it is contained in only one nega- tive basis path pi (if the neural network has equal number of Experimental Results hidden nodes) and some of the positive basis paths pj . Thus In this section, we evaluate Basis-path Regularization on its gradient can be calculated as follows deep ReLU neural networks with the aim of verifying λ2 ∂R(p) ∂pi (w) λ1 ∂R(p) ∂pj (w) that does our proposed BP regularization outperforms other h(wi ) = + baseline regularization methods and whether it can improve 4 ∂pi ∂wi 2 pj :wi ∂pj ∂wi the generalization on the benchmark datasets. For sake of p2j fairness, we reported the mean of 5 independent runs with λ2 log pi = + λ1 , random initialization. wi p wi j :wi (11) Recommendation System We first apply our basis-path regularization method to rec- where pj : wi denotes all positive basis paths containing wi . ommendation task with MLP networks and conduct experi- Combining them together, we get the gradient of the reg- mental studies based on a public dataset, MovieLens2 . The ularized loss function with respected to the weights. For ex- characteristics of the MovieLens dataset are summarized in ample, if we use stochastic gradient descent to be the opti- Table 1. We use the version containing one million ratings, mizer, the update rule is as follows: where each user has at least 20 ratings. We train an NCF wt+1 = wt − ηt (g(wt ) + h(wt )). (12) framework with similar MLP network proposed in (He et al. 2017) and followed their training strategies with Adam op- Please note that the computation overhead of h(wi ) is high, timizer but without any pre-training. We test the predictive moreover, we observed that the values of the negative basis factors of [8,16,32,64], and set the number of hidden units paths are relatively stable in the optimization, thus we set to the embedding size ×4 in each hidden layer. We calculate h(wi ) to be zero for ease of the computation. Specifically, both metrics for each test user and report the average score. basis-path regularization can be easily combined with the For each method, we perform a wide range grid search of optimization algorithm which is in quotient space. hyper-parameter λ from 10−α where α ∈ 5, 6, 7, 8, 9 and re- The flow of SGD with basis-path regularization is shown port the experimental results based on the best performance in Algorithm 1, it’s trivial to extend basis-path regularization on the validation set. The performance of a ranked list is to other stochastic optimization algorithms. Comparing to 2 weight decay, basis-path regularization has little additional https://grouplens.org/datasets/movielens/1m/

7. MovieLens-1M MovieLens-1M MovieLens-1M MovieLens-1M 0.72 0.30 0.70 0.32 0.6 0.25 0.68 NDCG@10 NDCG@K 0.30 HR@10 0.20 HR@K 0.66 0.4 0.64 0.28 0.15 weight norm weight norm weight norm weight norm 0.62 path norm path norm 0.2 path norm path norm BP norm BP norm BP norm 0.10 BP norm 0.60 0.26 8 16 32 64 8 16 32 64 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 factors factors K K (a) (b) (c) (d) Figure 3: Performance of HR and NDCG w.r.t. the number of predictive factors and Top-K items recommendation. judged by Hit Ratio (HR) and Normalized Discounted Cu- Table 2: Classification error rate (%) on image classification mulative Gain (NDCG) (He et al. 2015). task. Baseline is from (He et al. 2016), and the number of † is 7.51 reported in the original paper. Fig. ?? shows the Table 1: Statistics of the MovieLens datasets. training procedures. Dataset Interaction# Item# User# Sparsity PlainNet34 ResNet34 MovieLens 1,000,209 3,706 6,040 95.53% Algorithm Train Test ∆ Train Test ∆ SGD 0.06 7.76 7.70 0.01 7.13 7.12 Figure 3 (a) and (b) show the performance of HR@10 and SGD + WD 0.06 6.34 6.27 0.01 5.71† 5.70 NDCG@10 w.r.t. the number of predictive factors. From this SGD + BPR 0.06 5.99 5.92 0.01 5.62 5.61 figure, it’s clear to see that basis-path regularization achieve Q-SGD 0.03 7.00 6.97 0.01 6.66 6.65 better generalization performance than all baseline methods. Q-SGD + BPR 0.05 5.73 5.68 0.03 5.36 5.33 Figure 3 (c) and (d) show the performance of Top-K recom- mended lists where the ranking position K ranges from 1 to 10. As can be seen, the basis-path regularization demon- basis-path regularization can always find better generaliza- strates consistent improvement over other methods across tion points during optimization, which is consistent with our positions, which is consistent with our analysis of general- theoretical analysis in the previous section. We further inves- ization error bound in the previous section. tigate the combination of Q-SGD and basis-path regulariza- Image Classification tion. Q-SGD with basis-path regularization achieves the best test accuracy on both PlainNet and ResNet model, which in- In this section, we apply our basis-path regularization to dicates that taking BP norm as the regularization term to the this task and conduct experimental studies based on CIFAR- loss function is helpful for optimization algorithms. 10 (Krizhevsky and Hinton 2009), with 10 classes of im- ages. We employ a popular deep convolutional ReLU model, ResNet (He et al. 2016) for image classification since it Conclusion achieves huge successes in many image related tasks. In ad- In this paper, we define Basis-path norm on the group of ba- dition, we conduct our studies on a stacked deep CNN de- sis paths, and prove that the generalization error of ReLU scribed in (He et al. 2016) (refer to PlainNet), which suf- neural networks can be upper bounded by a function of fers serious dependency among the paths. We train 34 layers BP norm. We then design Basis-path regularization method, ResNet and PlainNet networks on this dataset, and use SGD which shows clearly performance gain on generalization with widely used l2 weight decay regularization (WD) as our ability. For future work, we plan to test basis-path regular- baseline. In addition, we implement Q-SGD, which is pro- ization on larger networks and datasets. Furthermore, we are posed in (Meng et al. 2018) and optimize the networks on also interested in applying basis-path regularization on net- basis paths. We investigate the combination of SGD/Q-SGD works with different architecture. and basis-path regularization (BPR). Similar with the pre- vious task, we perform a wide range grid search of λ from References {0.1, 0.2, 0.5}×10−α , where α ∈ {3, 4, 5, 6}. More training Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma- details can be found in supplementary materials. chine translation by jointly learning to align and translate. Table 2 shows the training and test results of each algo- arXiv preprint arXiv:1409.0473. rithms. From the figure and table, we can see that our basis- path regularization indeed improves test accuracy of Plain- Bartlett, P. L., and Mendelson, S. 2002. Rademacher and Net34 and Resnet34 by nearly 1.8% and 1.5% respectively. gaussian complexities: Risk bounds and structural results. Moreover, the training behaviors of SGD with weight de- Journal of Machine Learning Research 3(Nov):463–482. cay and basis-path regularization are quite similar, but the Bartlett, P. L.; Foster, D. J.; and Telgarsky, M. J. 2017.

8.Spectrally-normalized margin bounds for neural networks. Kawaguchi, K.; Kaelbling, L. P.; and Bengio, Y. In Advances in Neural Information Processing Systems, 2017. Generalization in deep learning. arXiv preprint 6241–6250. arXiv:1710.05468. Bayer, I.; He, X.; Kanagal, B.; and Rendle, S. 2017. A Krizhevsky, A., and Hinton, G. 2009. Learning multiple generic coordinate descent framework for learning from im- layers of features from tiny images. plicit feedback. In Proceedings of the 26th International Meng, Q.; Zheng, S.; Ye, Q.; Chen, W.; and Liu, T.-Y. 2018. Conference on World Wide Web, 1341–1350. International Optimization of relu neural networks using quotient stochas- World Wide Web Conferences Steering Committee. tic gradient descent. arXiv preprint arXiv:1802.03713. Chan, W.; Jaitly, N.; Le, Q.; and Vinyals, O. 2016. Lis- Neyshabur, B.; Wu, Y.; Salakhutdinov, R. R.; and Srebro, ten, attend and spell: A neural network for large vocabulary N. 2016. Path-normalized optimization of recurrent neu- conversational speech recognition. In Acoustics, Speech and ral networks with relu activations. In Advances in Neural Signal Processing (ICASSP), 2016 IEEE International Con- Information Processing Systems, 3477–3485. ference on, 4960–4964. IEEE. Neyshabur, B.; Bhojanapalli, S.; McAllester, D.; and Srebro, Cheng, H.-T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; N. 2017. Exploring generalization in deep learning. In Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; Advances in Neural Information Processing Systems, 5949– et al. 2016. Wide & deep learning for recommender systems. 5958. In Proceedings of the 1st Workshop on Deep Learning for Neyshabur, B.; Salakhutdinov, R. R.; and Srebro, N. 2015. Recommender Systems, 7–10. ACM. Path-sgd: Path-normalized optimization in deep neural net- Dziugaite, G. K., and Roy, D. M. 2017. Computing nonvac- works. In Advances in Neural Information Processing Sys- uous generalization bounds for deep (stochastic) neural net- tems, 2422–2430. works with many more parameters than training data. arXiv Neyshabur, B.; Tomioka, R.; and Srebro, N. 2014. In search preprint arXiv:1703.11008. of the real inductive bias: On the role of implicit regulariza- Evgeniou, T.; Pontil, M.; and Poggio, T. 2000. Regular- tion in deep learning. arXiv preprint arXiv:1412.6614. ization networks and support vector machines. Advances in Neyshabur, B.; Tomioka, R.; and Srebro, N. 2016. Norm- computational mathematics 13(1):1. based capacity control in neural networks. In Conference on Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Learning Theory, 1376–1401. Y. N. 2017. Convolutional sequence to sequence learning. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: arXiv preprint arXiv:1705.03122. Towards real-time object detection with region proposal net- Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse works. In Advances in neural information processing sys- rectifier neural networks. In Proceedings of the fourteenth tems, 91–99. international conference on artificial intelligence and statis- Ruder, S. 2016. An overview of gradient descent optimiza- tics, 315–323. tion algorithms. arXiv preprint arXiv:1609.04747. He, X.; Chen, T.; Kan, M.-Y.; and Chen, X. 2015. Tri- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, rank: Review-aware explainable recommendation by mod- L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At- eling aspects. In Proceedings of the 24th ACM International tention is all you need. In Advances in Neural Information on Conference on Information and Knowledge Management, Processing Systems, 6000–6010. 1661–1670. ACM. Wolf, M. M. 2018. Mathematical foundations of supervised He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning. [J]. learning for image recognition. In CVPR. Xiong, W.; Droppo, J.; Huang, X.; Seide, F.; Seltzer, M.; He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T.- Stolcke, A.; Yu, D.; and Zweig, G. 2016. Achieving human S. 2017. Neural collaborative filtering. In Proceedings of parity in conversational speech recognition. arXiv preprint the 26th International Conference on World Wide Web, 173– arXiv:1610.05256. 182. International World Wide Web Conferences Steering Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; and Vinyals, Committee. O. 2017. Understanding deep learning requires rethinking Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; generalization. ICLR. and Salakhutdinov, R. R. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accel- erating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

9. Supplementary: Capacity Control of ReLU where pl (wl (il−1 , il )) is the basis-path which contains non- Neural Networks by Basis-path Norm skeleton weight wl (il−1 , il ) 3 and pj denotes a basis-path which only contains skeleton weights. This document contains supplementary theoretical materials Then for a L-layer FNN, a non-basis path can be calcu- and additional experimental details of the paper "Capacity lated as Control of ReLU Neural Networks by Essential Path Norm". p(i0 ,··· ,iL−1 ,iL ) Skeleton Method to Construct Basis-Path L We briefly review the construction of basis-paths. For an L- = wl (il−1 , il ) layer feedforward neural network with width H, skeleton l=1 methods (Meng et al. 2018) construct basis paths following l≤L−1 pl (wl (il−1 , il )) · wL (wl (il−1 , il )) two steps: =wL (iL−1 , iL ) · j pj · l wL (wl (il−1 , il )) (1) Select skeleton weights: The skeleton weights are the diagonal elements in weight matrix wl , l = 1, · · · , L−1. For pL (wL (iL−1 , iL )) l≤L−1 pl (wl (il−1 , il )) w0 , select elements w0 (p1 mod d, p1 ). For wL , select ele- = L−1 · l=1 wl (wL (iL−1 , iL )) j pj · l≤L−1 wL (wl (il−1 , il )) ments wL (pL−1 , pL−1 mod K). All the selected weights are called skeleton weights. Others are called non-skeleton l≤L pl (wl (il−1 , il )) = , weights. j p˜j (2) Construct basis-paths: The paths contains no more than one non-skeleton weights are basis-paths. where wL (wl (il−1 , il )) is the skeleton weight at layer L In (Meng et al. 2018), the authors also proved the follow- which connects the basis-path that contains wl (il−1 , il ), ing properties of basis-paths: each non-skeleton weight will and wl (wL (iL−1 , iL )) denotes the skeleton weight only appear in one basis-path. at layer l that connects the weight wL (iL−1 , iL ). Because wl (wL (iL−1 , iL )) are all skeleton weights, L−1 Proof of Proposition 1 l=1 wl (wL (iL−1 , iL )) is the value of a basis-path in a L − 1 layer FNN. The establishing of the above equality Proposition 2 For any non-basis path p˜, p˜ = also uses the fact that pl (wl (il−1 , il )) · wL (wl (il−1 , il )) m αi r αj i=1 pi j=m+1 pj , where αi ≤ 0, αj ≥ 0. is a basis-path for a L-layer FNN because it contains Proof: We prove this proposition by induction. For ease of only one non-skeleton weight wl (il−1 , il ). Combining L−1 reference, we use Mi (j) to denote the operator i mod j, l=1 wl (wL (iL−1 , iL )) · j pj · l wL (wl (il−1 , il )) i.e., Mi (j) = i mod j where i, j ∈ Z+ . together, we can get that the denominator p˜j is a basis-path (1) If L = 2, hidden node Oi11 (i1 = 1, · · · , H) has which contains only skeleton weight. an ingoing skeleton weight w1 (Mi1 (d), i1 ) and an outgoing Therefore, we prove that basis-path which contains one skeleton weight w2 (i1 , Mi1 (K)). Then for a non-skeleton non-skeleton weight will only appear in the numerator and path p(i0 ,i1 ,i2 ) where i0 = Mi1 (d) and i2 = Mi1 (K)), it basis-path which only contains skeleton weights will only can be calculated as appear in the denominator in the calculation of the non-basis paths. p(i0 ,i1 ,i2 ) (13) =w1 (i0 , i1 ) · w2 (i1 , i2 ) (14) Experiments - additional material w1 (i0 , i1 )w2 (i1 , Mi1 (K)) · w1 (Mi1 (d), i1 )w2 (i1 , i2 ) All experiments were conducted with Pytorch commit = 2b47480. The ResNet implementation can be found in w1 (Mi1 (d), i1 ) · w2 (i1 , Mi1 (K)) (15) (https://github.com/pytorch/vision/ ). Unless noted the ini- p(i0 ,i1 ,Mi1 (K)) · p(Mi1 (d),i1 ,i2 ) tialization method was used by sampling values from a uni- = . (16) form distribution with range [−a, a], where a = √ 1 , p(Mi1 (d),i1 ,Mi1 (K)) hl−1 hl−1 is the dimension of previous layer, which is the default According to skeleton method, p(i0 ,i1 ,Mi1 (K)) , initialization method for linear layers and convolutional lay- p(Mi1 (d),i1 ,i2 ) , p(Mi1 (d),i1 ,Mi1 (K)) are all basis-paths. ers in PyTorch. We can see that the basis-path such as p(Mi1 (d),i1 ,Mi1 (K)) which contains no non-skeleton weights is in the denomina- Experiment settings tor. The basis-path which contains one skeleton weight such Empirical Verification Multi-layer perceptrons with as p(i0 ,i1 ,Mi1 (K)) and p(Mi1 (d),i1 ,i2 ) is in the numerator. An ReLU activation were trained by SGD in this experi- example is shown in Figure 1 in the main paper. ments. The max iteration number is 500 epochs. The (2) If the proposition is satisfied for a L − 1-layer FNN, initial learning rate is set to 0.05 for each model. i.e., Exponential decay is a wildly used method when training neural network models(Hinton et al. 2012; L−1 l≤L−1 pl (wil−1 ,il ) 3 p(i0 ,··· ,iL−1 ) = wl (il−1 , il ) = , If wl (il−1 , il ) is a non-skeleton weight. Otherwise, it will not j pj appear in the numerator. l=1

10. ResNet34 0.95 ResNet34 = 1e-3 = 5e-4 100 = 2e-4 = 1e-4 0.94 Test accuracy Training Loss = 5e-5 10 1 = 2e-5 = 1e-5 = 5e-6 = 2e-6 0.93 10 2 = 1e-6 10 3 0.92 0 25 50 75 100 125 150 80 100 120 140 160 Epoch Epoch Figure 4: Training loss and test accuracy of ResNet34 models w.r.t. number of effective passes on CIFAR-10. Ruder 2016), which is applied to the learning rate in this experiments with power 0.01. Mini-batch size with 128 was performed in this experiments. Recommendation System For the recommendation sys- tem experiments, to evaluate the performance of item rec- ommendation, we employed the leave-one-out evaluation, which has been widely used in literatures (Bayer et al. 2017; He et al. 2017). For each user, we held-out the latest interac- tion as the test set and utilized the remaining data for train- ing. We followed the common strategy that randomly sam- ples 100 items that are not interacted by the user, ranking the test item among the 100 items. Image Classification For the image classification experi- ment, we use the original RGB image of CIFAR-10 dataset with size 3 × 32 × 32. As before, we re-scale each pixel value to the interval [0, 1]. We then extract random crops (and their horizontal flips) of size 3 × 28 × 28 pixels and present these to the network in mini-batches of size 128. The training and test loss and the test error are only computed from the center patch (3 × 28 × 28). We trained 34 layers ResNet and PlainNet models (refer to resnet34 and plain34 in the original paper respectively) on this dataset. We performed training for 64k iterations, with a mini-batch size of 128, and an initial learning rate of 1.0 which was divided by 10 at 32k and 48k iterations following the practice in the original paper. Experimental Results on the Influence of λ As for different model and norm, the λ should be selected carefully. As described in image classification task section in the main paper, we performed a wide range of grid search for each norm from {0.1, 0.2, 0.5} × 10−α , where α ∈ {3, 4, 5, 6}, and reported the best performance based on the validation set. In this section, we show how λ affect our basis-path regularization. The results are given in Fig- ure 4. Each result is reported by 5 independent runs with random initialization. Please note that too large value of λ (λ > 1e − 3 in this setting) will lead to diverge, meanwhile too small will make the regularization influence nearly dis- appear. A proper λ will lead to significant better accuracy.