# 网络中非光滑分布优化的优化算法

## 展开查看详情

1. Optimal Algorithms for Non-Smooth Distributed Optimization in Networks arXiv:1806.00291v1 [math.OC] 1 Jun 2018 Kevin Scaman∗ Francis Bach† Sébastien Bubeck‡ Yin Tat Lee‡ § Laurent Massoulié† ¶ Abstract In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect √of this result is that, for non-smooth functions, while the dominant term of the error is in O(1/ t), the structure of the communication network only impacts a second-order term in O(1/t), where t is time. In other words, the error due to lim- its in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a d1/4 multiplicative factor of the optimal con- vergence rate, where d is the underlying dimension. 1 Introduction Distributed optimization finds many applications in machine learning, for example when the dataset is large and training is achieved using a cluster of computing units. As a result, many algorithms were recently introduced to minimize the average f¯ = n1 ni=1 fi of functions fi which are respec- tively accessible by separate nodes in a network [1, 2, 3, 4]. Most often, these algorithms alternate between local and incremental improvement steps (such as gradient steps) with communication steps between nodes in the network, and come with a variety of convergence rates (see for example [5, 4, 6, 7]). Recently, a theoretical analysis of first-order distributed methods provided optimal convergence rates for strongly-convex and smooth optimization in networks [8]. In this paper, we extend this analysis to the more challenging case of non-smooth convex optimization. The main contribution of this ∗ Noah’s Ark Lab, Huawei Technologies, Paris, France. Email: kevin.scaman@huawei.com † INRIA - Département d’informatique de l’ENS, Ecole Normale Supérieure, CNRS, INRIA, PSL Research University, 75005 Paris, France. Email: francis.bach@inria.fr ‡ Microsoft Research, Redmond, United States. Email: sebubeck@microsoft.com § University of Washington, Seattle, United States. Email: yintat@uw.edu ¶ MSR-INRIA Joint Centre, Paris, France. Email: laurent.massoulie@inria.fr 1

2.paper is to provide optimal convergence rates and their corresponding optimal algorithms for this class of distributed problems under two regularity assumptions: (1) the Lipschitz continuity of the global objective function f¯, and (2) a bound on the average of Lipschitz constants of local functions fi . Under the local regularity assumption, we provide in Section 4 matching upper and lower bounds of complexity in a decentralized setting in which communication is performed using the gossip algorithm [9]. Moreover, we propose the first optimal algorithm for non-smooth decentralized opti- mization, called multi-step primal-dual (MSPD). Under the more challenging global regularity as- sumption, we show in Section 3 that distributing the simple smoothing approach introduced in [10] yields fast convergence rates with respect to communication. This algorithm, called distributed randomized smoothing (DRS), achieves a convergence rate matching the lower bound up to a d1/4 multiplicative factor, where d is the dimensionality of the problem. Our analysis differs from the smooth and strongly-convex setting in two major aspects: (1) the naïve master/slave distributed algorithm is in this case not optimal, and (2) the convergence rates differ between communication and local computations. More specifically, error due to limits in communi- cation resources enjoys fast convergence rates, as we establish by formulating the optimization prob- lem as a composite saddle-point problem with a smooth term for communication and non-smooth term for the optimization of the local functions (see Section 4 and Eq. (21) for more details). Related work. Many algorithms were proposed to solve the decentralized optimization of an av- erage of functions (see for example [1, 11, 3, 4, 12, 2, 13, 5]), and a sheer amount of work was devoted to improving the convergence rate of these algorithms [5, 6]. In the case of non-smooth optimization, fast communication schemes were developed in [14, 15], although precise optimal convergence rates were not obtained. Our decentralized algorithm is closely related to the recent primal-dual algorithm of [14] which enjoys fast communication rates in a decentralized and stochas- tic setting. Unfortunately, their algorithm lacks gossip acceleration to reach optimality with respect to communication time. Finally, optimal convergence rates for distributed algorithms were investi- gated in [8] for smooth and strongly-convex objective functions, and [16, 17] for totally connected networks. 2 Distributed optimization setting Optimization problem. Let G = (V, E) be a strongly connected directed graph of n computing units and diameter ∆, each having access to a convex function fi over a convex set K ⊂ Rd . We consider minimizing the average of the local functions n 1 min f¯(θ) = fi (θ) , (1) θ∈K n i=1 in a distributed setting. More specifically, we assume that each computing unit can compute a subgradient ∇fi (θ) of its own function in one unit of time, and communicate values (i.e. vectors in Rd ) to its neighbors in G. A direct communication along the edge (i, j) ∈ E requires a time τ ≥ 0. These actions may be performed asynchronously and in parallel, and each machine i possesses a local version of the parameter, which we refer to as θi ∈ K. Regularity assumptions. Optimal convergence rates depend on the precise set of assumptions 2

3.applied to the objective function. In our case, we will consider two different constraints on the regularity of the functions: (A1) Global regularity: the (global) function f¯ is convex and Lg -Lipschitz continuous, in the sense that, for all θ, θ′ ∈ K, |f¯(θ) − f¯(θ′ )| ≤ Lg θ − θ′ 2 . (2) (A2) Local regularity: Each local function is convex and Li -Lipschitz continuous, and we denote 1 n as Lℓ = n i=1 L2i the ℓ2 -average of the local Lipschitz constants. Assumption (A1) is weaker than (A2), as we always have Lg ≤ Lℓ . Moreover, we may have Lg = 0 and Lℓ arbitrarily large, for example with two linear functions f1 (x) = −f2 (x) = ax and a → +∞. We will see in the following sections that the local regularity assumption is easier to analyze and leads to matching upper and lower bounds. For the global regularity assumption, we only provide an algorithm with a d1/4 competitive ratio, where d is the dimension of the problem. Finding an optimal distributed algorithm for global regularity is, to our understanding, a much more challenging task and is left for future work. Finally, we assume that the feasible region K is convex and bounded, and denote by R the radius of a ball containing K, i.e. ∀θ ∈ K, θ − θ0 2 ≤ R , (3) where θ0 ∈ K is the initial value of the algorithm, that we set to θ0 = 0 without loss of general- ity. Black-box optimization procedure. The lower complexity bounds in Theorem 2 and Theorem 3 depend on the notion of black-box optimization procedures of [8] that we now recall. A black-box optimization procedure is a distributed algorithm verifying the following constraints: 1. Local memory: each node i can store past values in a (finite) internal memory Mi,t ⊂ Rd at time t ≥ 0. These values can be accessed and used at time t by the algorithm run by node i, and are updated either by local computation or by communication (defined below), that is, for all i ∈ {1, ..., n}, Mi,t ⊂ Mcomp i,t ∪ Mcomm i,t . (4) 2. Local computation: each node i can, at time t, compute a subgradient of its local function ∇fi (θ) for a value θ ∈ Mi,t−1 in the node’s internal memory before the computation. Mcomp i,t = Span ({θ, ∇fi (θ) : θ ∈ Mi,t−1 }) . (5) 3. Local communication: each node i can, at time t, share a value to all or part of its neighbors, that is, for all i ∈ {1, ..., n}, Mcomm i,t = Span Mj,t−τ . (6) (j,i)∈E 4. Output value: each node i must, at time t, specify one vector in its memory as local output of the algorithm, that is, for all i ∈ {1, ..., n}, θi,t ∈ Mi,t . (7) 3

4.Hence, a black-box procedure will return n output values—one for each computing unit—and our analysis will focus on ensuring that all local output values are converging to the optimal param- eter of Eq. (1). For simplicity, we assume that all nodes start with the simple internal memory Mi,0 = {0}. Note that communications and local computations may be performed in parallel and asynchronously. 3 Distributed optimization under global regularity The most standard approach for distributing a first-order optimization method consists in computing a subgradient of the average function n 1 ∇f¯(θ) = ∇fi (θ) , (8) n i=1 where ∇fi (θ) is any subgradient of fi at θ, by sending the current parameter θt to all nodes, perform- ing the computation of all local subgradients in parallel and averaging them on a master node. Since each iteration requires communicating twice to the whole network (once for θt and once for sending the local subgradients to the master node, which both take a time ∆τ where ∆ is the diameter of the network) and one subgradient computation (on each node and performed in parallel), the time to reach a precision ε with such a distributed subgradient descent is upper-bounded by RLg 2 O (∆τ + 1) . (9) ε Note that this convergence rate depends on the global Lipschitz constant Lg , and is thus applicable under the global regularity assumption. The number of subgradient computations in Eq. (9) (i.e. the term not proportional to τ ) cannot be improved, since it is already optimal for objective functions defined on only one machine (see for example Theorem 3.13 p. 280 in [18]). However, quite surpris- ingly, the error due to communication time may benefit from fast convergence rates in O(RLg /ε). This result is already known under the local regularity assumption (i.e. replacing Lg with Lℓ or even maxi Li ) in the case of decentralized optimization [14] or distributed optimization on a totally con- nected network [17]. To our knowledge, the case of global regularity has not been investigated by prior work. 3.1 A simple algorithm with fast communication rates We now show that the simple smoothing approach introduced in [10] can lead to fast rates for error due to communication time. Let γ > 0 and f : Rd → R be a real function. We denote as smoothed version of f the following function: f γ (θ) = E [f (θ + γX)] , (10) where X ∼ N (0, I) is a standard Gaussian random variable. The following lemma shows that f γ is both smooth and a good approximation of f . L Lemma 1 (Lemma E.3 of [10]). If γ > 0, then f γ is γg -smooth and, for all θ ∈ Rd , √ f (θ) ≤ f γ (θ) ≤ f (θ) + γLg d . (11) 4

5.Algorithm 1 distributed randomized smoothing Input: approximation error ε > 0, communication graph G, α0 = 1, αt+1 = 2/(1 + 1 + 4/α2t ) 20RLg d1/4 5RLg d−1/4 Rαt√ T = ε ,K = ε , γt = Rd−1/4 αt , ηt = 1/4 t+1 . 2Lg (d + K ) Output: optimizer θT 1: Compute a spanning tree T on G. 2: Send a random seed s to every node in T . 3: Initialize the random number generator of each node using s. 4: x0 = 0, z0 = 0, G0 = 0 5: for t = 0 to T − 1 do 6: yt = (1 − αt )xt + αt zt 7: Send yt to every node in T . 1 K 8: Each node i computes gi = K k=1 ∇fi (yt + γt Xt,k ), where Xt,k ∼ N (0, I) 1 9: Gt+1 = Gt + nαt i gi 10: zt+1 = argminx∈K x + ηt+1 Gt+1 22 11: xt+1 = (1 − αt )xt + αt zt+1 12: end for 13: return θT = xT Hence, smoothing the objective function allows the use of accelerated optimization algorithms and provides faster convergence rates. Of course, the price to pay is that each computation of the n smoothed gradient ∇f¯γ (θ) = n1 i=1 ∇fiγ (θ) now requires, at each iteration m, to sample a suf- ficient amount of subgradients ∇fi (θ + γXm,k ) to approximate Eq. (10), where Xm,k are K i.i.d. Gaussian random variables. At first glance, this algorithm requires all computing units to synchro- nize on the choice of Xm,k , which would require to send to all nodes each Xm,k and thus incur a communication cost proportional to the number of samples. Fortunately, computing units only need to share one random seed s ∈ R and then use a random number generator initialized with the provided seed to generate the same random variables Xm,k without the need to communicate any vector. The overall algorithm, denoted distributed randomized smoothing (DRS), uses the random- ized smoothing optimization algorithm of [10] adapted to a distributed setting, and is summarized in Alg. 1. The computation of a spanning tree T in step 1 allows efficient communication to the whole network in time at most ∆τ . Most of the algorithm (i.e. steps 2, 4, 6, 7, 9, 10 and 11) are performed on the root of the spanning subtree T , while the rest of the computing units are responsible for computing the smoothed gradient (step 8). The seed s of step 2 is used to ensure that every Xm,k , although random, is the same on every node. Finally, step 10 is a simple orthogonal projection of the gradient step on the convex set K. We now show that the DRS algorithm converges to the optimal parameter under the global regularity assumption. Theorem 1. Under global regularity (A1), Alg. 1 achieves an approximation error E f¯(θT ) −f¯(θ∗ ) of at most ε > 0 in a time Tε upper-bounded by RLg RLg 2 O (∆τ + 1)d1/4 + . (12) ε ε Proof. See Appendix A. 5

6.More specifically, Alg. 1 completes its T iterations by time RLg d1/4 RLg d1/4 RLg d−1/4 Tε ≤ 40 ∆τ + 100 . (13) ε ε ε Comparing Eq. (13) to Eq. (9), we can see that our algorithm improves on the standard method when the dimension is not too large, and more specifically RLg 4 d≤ . (14) ε In practice, this condition is easily met, as ε ≤ 10−2 already leads to the condition d ≤ 108 (assuming that R and Lg have values around 1). Moreover, for problems of moderate dimension, the term d1/4 remains a small multiplicative factor (e.g. for d = 1000, d1/4 ≈ 6). Finally, note that DRS only needs the convexity of f¯, and the convexity of the local functions fi is not necessary for Theorem 1 to hold. Remark 1. Several other smoothing methods exist in the literature, notably the Moreau envelope [19] enjoying a dimension-free approximation guarantee. However, the Moreau envelope of an average of functions is difficult to compute (requires a different oracle than computing a subgradient), and unfortunately leads to convergence rates with respect to local Lipschitz characteristics instead of Lg . 3.2 Optimal convergence rate The following result provides oracle complexity lower bounds under the global regularity assump- tion, and is proved in Appendix B. This lower bound extends the communication complexity lower bound for totally connected communication networks from [17]. Theorem 2. Let G be a network of computing units of size n > 0, and Lg , R > 0. There exists n functions fi : Rd → R such that (A1) holds and, for any t < (d−2) 2 min{∆τ, 1} and any black-box procedure one has, for all i ∈ {1, ..., n}, RLg 1 1 f¯(θi,t ) − min f¯(θ) ≥ t 2 + . (15) θ∈B2 (R) 36 (1 + 2∆τ ) 1+t Proof. See Appendix B. Assuming that the dimension d is large compared to the characteristic values of the problem (a standard set-up for lower bounds in non-smooth optimization [20, Theorem 3.2.1]), Theorem 2 implies that, under the global regularity assumption (A1), the time to reach a precision ε > 0 with any black-box procedure is lower bounded by RLg RLg 2 Ω ∆τ + , (16) ε ε where the notation g(ε) = Ω(f (ε)) stands for ∃C > 0 s.t. ∀ε > 0, g(ε) ≥ Cf (ε). This lower bound proves that the convergence rate of DRS in Eq. (13) is optimal with respect to computation time and within a d1/4 multiplicative factor of the optimal convergence rate with respect to communica- tion. 6

7.The proof of Theorem 2 relies on the use of two objective functions: first, the standard worst case function used for single machine convex optimization (see e.g. [18]) is used to obtain a lower bound on the local computation time of individual machines. Then, a second function first introduced in [17] is split on the two most distant machines to obtain worst case communication times. By aggre- gating these two functions, a third one is obtained with the desired lower bound on the convergence rate. The complete proof is available in Appendix B. n Remark 2. The lower bound also holds for the average of local parameters n1 i=1 θi , and more generally any parameter that can be computed using any vector of the local memories at time t: in Theorem 2, θi,t may be replaced by any θt such that θt ∈ Span Mi,t . (17) i∈V 4 Decentralized optimization under local regularity In many practical scenarios, the network may be unknown or changing through time, and a local communication scheme is preferable to the master/slave approach of Alg. 1. Decentralized algo- rithms tackle this problem by replacing targeted communication by local averaging of the values of neighboring nodes [9]. More specifically, we now consider that, during a communication step, each machine i broadcasts a vector xi ∈ Rd to its neighbors, then performs a weighted average of the values received from its neighbors: node i sends xi to his neighbors and receives j Wji xj . (18) In order to ensure the efficiency of this communication scheme, we impose standard assumptions on the matrix W ∈ Rn×n , called the gossip matrix [9, 8]: 1. W is symmetric and positive semi-definite, 2. The kernel of W is the set of constant vectors: Ker(W ) = Span(1), where 1 = (1, ..., 1)⊤ , 3. W is defined on the edges of the network: Wij = 0 only if i = j or (i, j) ∈ E. 4.1 Optimal convergence rate Similarly to the smooth and strongly-convex case of [8], the lower bound on the optimal conver- gence rate is obtained by replacing the diameter of the network with 1/ γ(W ), where γ(W ) = λn−1 (W )/λ1 (W ) is the ratio between smallest non-zero and largest eigenvalues of W , also known as the normalized eigengap. Theorem 3. Let Lℓ , R > 0 and γ ∈ (0, 1]. There exists a matrix W of eigengap γ(W ) = γ, and n √ functions fi satisfying (A2), where n is the size of W , such that for all t < d−2 2 min(τ / γ, 1) and all i ∈ {1, ..., n}, RLℓ 1 1 f¯(θi,t ) − min f¯(θ) ≥ √ 2t γ 2 + . (19) θ∈B2 (R) 108 (1 + 1+t τ ) Proof. See Appendix C. 7

8.Assuming that the dimension d is large compared to the characteristic values of the problem, The- orem 3 implies that, under the local regularity assumption (A2) and for a gossip matrix W with eigengap γ(W ), the time to reach a precision ε > 0 with any decentralized black-box procedure is lower-bounded by RLℓ τ RLℓ 2 Ω + . (20) ε γ(W ) ε The proof of Theorem 3 relies on linear graphs (whose diameter is proportional to 1/ γ(L) where L is the Laplacian matrix) and Theorem 2. More specifically, a technical aspect of the proof consists in splitting the functions used in Theorem 2 on multiple nodes to obtain a dependency in Lℓ instead of Lg . The complete derivation is available in Appendix C. 4.2 Optimal decentralized algorithm We now provide an optimal decentralized optimization algorithm under (A2). This algorithm is closely related to the primal-dual algorithm proposed by [14] which we modify by the use of accel- erated gossip using Chebyshev polynomials as in [8]. First, we formulate our optimization problem in Eq. (1) as the saddle-point problem Eq. (21) below, based on a duality argument similar to that in [8]. Then, we consider a square root A of the sym- metric semi-definite positive gossip matrix W , of size n × m for some m, such that AA⊤ = W and A⊤ u = 0 if and only if u is constant, and consider the equivalent problem of minimizing 1 n n n i=1 fi (θi ) over Θ = (θ1 , . . . , θn ) ∈ K with the constraint that θ1 = · · · = θn , or equivalently ΘA = 0. Through Lagrangian duality, we therefore get the equivalent problem: n 1 min max fi (θi ) − tr Λ⊤ ΘA . (21) Θ∈Kn Λ∈Rd×n n i=1 We solve it by applying Algorithm 1 in Chambolle-Pock [21] (we could alternatively apply compos- ite Mirror-Prox [22]), with the following steps at each iteration t, with initialization Λ0 = 0 and Θ0 = Θ−1 = (θ0 , . . . , θ0 ): (a) Λt+1 = Λt − σ(2Θt+1 − Θt )A n 1 1 (22) (b) Θt+1 = argmin fi (θi ) − tr Θ⊤ Λt+1 A⊤ + tr(Θ − Θt )⊤ (Θ − Θt ) , Θ∈K n n i=1 2η where the gain parameters η, σ are required to satisfy σηλ1 (W ) ≤ 1. We implement the algorithm with the variables Θt and Y t = Λt A⊤ = (y1t , . . . , ynt ) ∈ Rd×n , for which all updates can be made locally: Since AA⊤ = W , they now become (a′ ) Y t+1 = Y t − σ(2Θt+1 − Θt )W 1 1 (23) (b′ ) θit+1 = argmin fi (θi ) − θi⊤ yit+1 + θi − θit 2 , ∀i ∈ {1, . . . , n} , θi ∈K n 2η The step (b′ ) still requires a proximal step for each function fi . We approximate it by the outcome of the subgradient method run for M steps, with a step-size proportional to 2/(m + 2) as suggested in [23]. That is, initialized with θ˜i0 = θit , it performs the iterations m ˜m 2 η θ˜im+1 = θi − ∇fi (θ˜im ) − ηyit+1 − θit , m = 0, . . . , M − 1. (24) m+2 m+2 n 8

9.Algorithm 2 multi-step primal-dual algorithm Input: approximation error ε > 0, gossip matrix W ∈ √ Rn×n , K 4ε 1− γ(W ) nR 1−c1 1+c2K K = ⌊1/ γ(W )⌋, M = T = ⌈ RL ⌉, c1 = √ ,η= Lℓ 1+cK , σ= 1 τ (1−cK 2 . ℓ 1+ γ(W ) 1 1 ) Output: optimizer θ¯T 1: Θ0 = 0, Θ−1 = 0, Y0 = 0 2: for t = 0 to T − 1 do 3: Y t+1 = Y t − σ ACCELERATED G OSSIP(2Θt − Θt−1 , W , K) // see [8, Alg. 2] 4: Θ˜ 0 = Θt 5: for m = 0 to M − 1 do 6: m ˜m θ˜im+1 = m+2 2 θi − m+2 η ˜m t+1 n ∇fi (θi ) − ηyi − θit , ∀i ∈ {1, . . . , n} 7: end for 8: Θt+1 = Θ ˜M 9: end for T n 10: return θ¯T = T1 n 1 t=1 i=1 θi t We thus replace the step (b′ ) by running M steps of the subgradient method to obtain θ˜iM . Theorem 4. Under local regularity (A2), the approximation error with the iterative algorithm of Eq. (23) after T iterations and using M subgradient steps per iteration is bounded by RLℓ 1 1 f¯(θ¯T ) − min f¯(θ) ≤ + . (25) θ∈K γ(W ) T M Proof. See Appendix D. Theorem 4 implies that the proposed algorithm achieves an error of at most ε in a time no larger than 2 RLℓ τ RLℓ 1 O + . (26) ε γ(W ) ε γ(W ) While the first term (associated to communication) is optimal, the second does not match the lower bound of Theorem 3. This situation is similar to that of strongly-convex and smooth decentralized optimization [8], when the number of communication steps is taken equal to the number of overall iterations. By using Chebyshev acceleration [24, 25] with an increased number of communication steps, the algorithm reaches the optimal convergence rate. More precisely, since one communication step is a multiplication (of Θ e.g.) by the gossip matrix W , performing K communication steps is equivalent to multiplying by a power of W . More generally, multiplication by any polynomial PK (W ) of degree K can be achieved in K steps. Since our algorithm depends on the eigengap of the gossip matrix, a good choice of polynomial consists in maximizing this eigengap γ(PK (W )). This is the approach followed by [8] and leads to the choice PK (x) = 1 − TK (c2 (1 − x))/TK (c2 ), where c2 = (1 + γ(W ))/(1 − γ(W )) and TK are the Chebyshev polynomials [24] defined as T0 (x) = 1, T1 (x) = x, and, for all k ≥ 1, Tk+1 (x) = 2xTk (x) − Tk−1 (x). We refer the reader to [8] for more details on the method. Finally, as mentioned in [8], chosing K = ⌊1/ γ(W )⌋ leads to an eigengap γ(PK (W )) ≥ 1/4 and the optimal convergence rate. We denote the resulting algorithm as multi-step primal-dual (MSPD) and describe it in Alg. 2. The procedure ACCELERATED G OSSIP is extracted from [8, Algorithm 2] and performs one step 9

10.of Chebyshev accelerated gossip, while steps 4 to 8 compute the approximation of the minimization problem (b’) of Eq. (23). Our performance guarantee for the MSPD algorithm is then the follow- ing: Theorem 5. Under local regularity (A2), Alg. 2 achieves an approximation error f¯(θ¯T ) − f¯(θ∗ ) of at most ε > 0 in a time Tε upper-bounded by RLℓ τ RLℓ 2 O + , (27) ε γ(W ) ε which matches the lower complexity bound of Theorem 3. Alg. 2 is therefore optimal under the the local regularity assumption (A2). Proof. See Appendix D. Remark 3. It is clear from the algorithm’s description that it completes its T iterations by time 2 4RLℓ τ 4RLℓ Tε ≤ + . (28) ε γ(W ) ε T n To obtain the average of local parameters θ¯T = nT1 t=1 i=1 θi , one can then rely on the gossip algorithm [9] to average over the network the individual nodes’ time averages. Let W ′ = I − c3 PK (W ) where c3 = (1 + c2K K 2 ′ 1 )/(1 − c1 ) . Since W is bi-stochastic, semi-definite positive and λ2 (W ′ ) = 1 − γ(PK (W )) ≤ 3/4, using it for gossiping the time averages leads to a time O √τγ ln RL ε ℓ to ensure that each node reaches a precision ε on the objective function (see [9] for more details on the linear convergence of gossip), which is negligible compared to Eq. (27). Remark 4. A stochastic version of the algorithm is also possible by considering stochastic oracles on each fi and using stochastic subgradient descent instead of the subgradient method. Remark 5. In the more general context where node compute times ρi are not necessarily all equal to 1, we may still apply Alg. 2, where now the number of subgradient iterations performed by node i is M/ρi rather than M . The proof of Theorem 5 also applies, and now yields the modified upper bound on time to reach precision ε: RLℓ τ RLc 2 O + , (29) ε γ(W ) ε 1 n where L2c = n i=1 ρi L2i . 5 Conclusion In this paper, we provide optimal convergence rates for non-smooth and convex distributed optimiza- tion in two settings: Lipschitz continuity of the global objective function, and Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide optimal convergence rates that depend on the ℓ2 -average of the local Lipschitz constants and the (normalized) eigengap of the gossip matrix. Moreover, we also provide the first optimal decentralized algorithm, called multi-step primal-dual (MSPD). Under the global regularity assumption, we provide a lower complexity bound that depends on the Lipschitz constant of the (global) objective function, as well as a distributed version of the smoothing 10

11.approach of [10] and show that this algorithm is within a d1/4 multiplicative factor of the optimal convergence rate. √ In both settings, the optimal convergence rate exhibits two different speeds: a slow rate in Θ(1/ t) with respect to local computations and a fast rate in Θ(1/t) due to communication. Intuitively, com- munication is the limiting factor in the initial phase of optimization. However, its impact decreases with time and, for the final phase of optimization, local computation time is the main limiting fac- tor. The analysis presented in this paper allows several natural extensions, including time-varying com- munication networks, asynchronous algorithms, stochastic settings, and an analysis of unequal node compute speeds going beyond Remark 5. Moreover, despite the efficiency of DRS, finding an opti- mal algorithm under the global regularity assumption remains an open problem and would make a notable addition to this work. Acknowledgements We acknowledge support from the European Research Council (grant SEQUOIA 724063). References [1] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent opti- mization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009. [2] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed op- timization and statistical learning via the alternating direction method of multipliers. Founda- tions and Trends in Machine Learning, 3(1):1–122, 2011. [3] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592–606, 2012. [4] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015. [5] Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750–1761, 2014. [6] Dušan Jakoveti´c, José M. F. Moura, and Joao Xavier. Linear convergence rate of a class of distributed augmented lagrangian algorithms. IEEE Transactions on Automatic Control, 60(4):922–936, 2015. [7] Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017. [8] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Opti- mal algorithms for smooth and strongly convex distributed optimization in networks. In Pro- ceedings of the 34th International Conference on Machine Learning ICML, pages 3027–3036, 2017. 11

12. [9] Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algo- rithms. IEEE/ACM Transactions on Networking (TON), 14(SI):2508–2530, 2006. [10] John C. Duchi, Peter L. Bartlett, and Martin J. Wainwright. Randomized smoothing for stochas- tic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012. [11] Dušan Jakoveti´c, Joao Xavier, and José M. F. Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59(5):1131–1146, 2014. [12] Aryan Mokhtari and Alejandro Ribeiro. DSA: Decentralized double stochastic averaging gra- dient algorithm. Journal of Machine Learning Research, 17(1):2165–2199, 2016. [13] Ermin Wei and Asuman Ozdaglar. Distributed alternating direction method of multipliers. In 51st Annual Conference on Decision and Control (CDC), pages 5445–5450. IEEE, 2012. [14] Guanghui Lan, Soomin Lee, and Yi Zhou. Communication-efficient algorithms for decentral- ized and stochastic optimization. arXiv preprint arXiv:1701.03961, 2017. [15] Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hof- mann, and Michael I Jordan. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems 27, pages 3068–3076, 2014. [16] Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Advances in Neural Information Processing Systems 27, pages 163–171, 2014. [17] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. In Advances in Neural Information Processing Systems 28, pages 1756– 1764, 2015. [18] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015. [19] J. J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathéma- tique de France, 93:273–299, 1965. [20] Yurii Nesterov. Introductory lectures on convex optimization : a basic course. Kluwer Aca- demic Publishers, 2004. [21] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex prob- lems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120– 145, May 2011. [22] Niao He, Anatoli Juditsky, and Arkadi Nemirovski. Mirror prox algorithm for multi-term composite minimization and semi-separable problems. Computational Optimization and Ap- plications, 61(2):275–319, 2015. [23] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. Technical Report 1212.2002, arXiv, 2012. [24] W. Auzinger. Iterative Solution of Large Linear Systems. Lecture notes, TU Wien, 2011. [25] M. Arioli and J. Scott. Chebyshev acceleration of iterative refinement. Numerical Algorithms, 66(3):591–608, 2014. 12

13.A Proof of the convergence rate of DRS (Theorem 1) Corollary 2.4 of [10] gives, with the appropriate choice of gradient step ηt and smoothing γt , 10RLg d1/4 5RLg E f¯(θT ) − min f¯(θ) ≤ +√ . (30) θ∈K T TK 20RLg d1/4 5RLg d−1/4 Thus, to reach a precision ε > 0, we may set T = ε and K = ε , leading to the desired bound on the time Tε = T (2∆τ + K) to reach ε. B Proof of the lower bound under global regularity (Theorem 2) Let i0 ∈ V and i1 ∈ V be two nodes at distance ∆. The function used by [18] to prove the oracle complexity for Lipschitz and bounded functions is α g1 (θ) = δ max θi + θ 22 . (31) i∈{1,...,t} 2 2 By considering this function on a single node (e.g. i0 ), at least O RL ε subgradients will be necessary to obtain a precision ε. Moreover, we also split the difficult function used in [17] t α g2 (θ) = γ |θi+1 − θi | − βθ1 + θ 22 , (32) i=1 2 on the two extremal nodes i0 and i1 in order to ensure that communication is necessary between the most distant nodes of the network. The final function that we consider is, for all i ∈ {1, ..., n}, k γ i=1 |θ2i − θ2i−1 | + δ maxi∈{2k+2,...,2k+1+l} θi if i = i0 k fi (θ) = γ i=1 |θ2i+1 − θ2i | − βθ1 + α2 θ 22 if i = i1 , (33) 0 otherwise where γ, δ, β, α > 0 and k, l ≥ 0 are parameters of the function satisfying 2k + l < d. The objective function is thus 2k 1 α f¯(θ) = γ |θi+1 − θi | − βθ1 + δ max θi + θ 2 2 (34) n i=1 i∈{2k+2,...,2k+1+l} 2 First, note that reordering the coordinates of θ between θ2 and θ2k+1 in a decreasing order can only decrease the value function f¯(θ). Hence, the optimal value θ∗ verifies this constraint and 1 α ∗ f¯(θ∗ ) = ∗ −γθ2k+1 − (β − γ)θ1∗ + δ max θi∗ + θ 2 2 . (35) n i∈{2k+2,...,2k+1+l} 2 Moreover, at the optimum, all the coordinates between θ2 and θ2k+1 are equal, all the coordi- nates between θ2k+2 and θ2k+1+l are also equal, and all the coordinates after θ2k+1+l are zero. Hence 1 α ∗2 2 2 f¯(θ∗ ) = ∗ −γθ2k+1 − (β − γ)θ1∗ + δθ2k+2 ∗ + ∗ θ + 2kθ2k+1 ∗ + lθ2k+2 , (36) n 2 1 13

14. 1 and optimizing over θ1∗ ≥ θ2k+1 ∗ ∗ ≥ 0 ≥ θ2k+2 leads to, when β ≥ γ(1 + 2k ), −1 γ2 δ2 f¯(θ∗ ) = (β − γ)2 + + . (37) 2αn 2k l Now note that, starting from θ0 = 0, each subgradient step can only increase the number of non-zero coordinates between θ2k+2 and θ2k+1+l by at most one. Thus, when t < l, we have max θt,i ≥ 0 . (38) i∈{2k+2,...,2k+1+l} Moreover, increasing the number of non-zero coordinates between θ1 and θ2k+1 requires at least one subgradient step and ∆ communication steps. As a result, when t < min{l, 2k∆τ }, we have θt,2k+1 = 0 and f¯(θt ) ≥ minθ∈Rd n1 −(β − γ)θ1 + α2 θ 22 2 (39) ≥ −(β−γ) 2αn . Hence, we have, for t < min{l, 2k∆τ }, 1 γ2 δ2 f¯(θt ) − f¯(θ∗ ) ≥ + . (40) 2αn 2k l Optimizing f¯ over a ball of radius R ≥ θ∗ 2 thus leads to the previous approximation error bound, and we choose 1 γ2 δ2 R = θ∗ 2 = (β − γ) 2 + + . (41) α2 2k l Finally, the Lipschitz constant of the objective function f¯ is 1 √ Lg = β + 2 2k + 1γ + δ + αR , (42) n Lg n Lg n and setting the parameters of f¯ to β = γ(1 + √1 ), 2k δ = 9 , γ = √ , 9 k l = ⌊t⌋ + 1, and t k = 2∆τ + 1 leads to t < min{l, 2k∆τ } and RLg 1 1 f¯(θt ) − f¯(θ∗ ) ≥ t 2 + , (43) 36 (1 + 2∆τ ) 1+t while f¯ is L-Lipschitz and θ∗ 2 ≤ R. C Proof of the lower bound under local regularity (Theorem 3) Following the idea introduced in [8], we prove Theorem 3 by applying Theorem 2 on linear graphs and splitting the local functions of Eq. (33) on multiple nodes to obtain Lg ≈ Lℓ . Lemma 2. Let γ ∈ (0, 1]. There exists a graph Gγ of size nγ and a gossip matrix Wγ ∈ Rnγ ×nγ on this graph such that γ(Wγ ) = γ and 2 γ≥ . (44) (nγ + 1)2 When γ ≥ 1/3, Gγ is a totally connected graph of size nγ = 3. Otherwise, Gγ is a linear graph of size nγ ≥ 3. 14

15.Proof. First of all, when γ ≥ 1/3, we consider the totally connected network of 3 nodes, reweight only the edge (v1 , v3 ) by a ∈ [0, 1], and let Wa be its Laplacian matrix. If a = 1, then the network is totally connected and γ(Wa ) = 1. If, on the contrary, a = 0, then the network is a linear graph and γ(Wa ) = 1/3. Thus, by continuity of the eigenvalues of a matrix, there exists a value a ∈ [0, 1] such 1−cos( π ) that γ(Wa ) = γ and Eq. (44) is trivially verified. Otherwise, let xn = 1+cos( πn ) be a decreasing n sequence of positive numbers. Since x3 = 1/3 and limn xn = 0, there exists nγ ≥ 3 such that xnγ ≥ γ > xnγ +1 . Let Gγ be the linear graph of size nγ ordered from node v1 to vnγ , and weighted with wi,i+1 = 1 − a1{i = 1}. If we take Wa as the Laplacian of the weighted graph Gγ , a simple calculation gives that, if a = 0, γ(Wa ) = xnγ and, if a = 1, the network is disconnected and γ(Wa ) = 0. Thus, there exists a value a ∈ [0, 1] such that γ(Wa ) = γ. Finally, by definition of nγ , 2 one has γ > xnγ +1 ≥ (nγ +1) 2. Let γ ∈ (0, 1] and Gγ the graph of Lemma 2. We now consider I0 = {1, ..., m} and I1 = {nγ − n +1 m + 1, ..., nγ } where m = ⌊ γ3 ⌋. When γ < 1/3, the distance d(I0 , I1 ) between the two sets I0 and I1 is thus bounded by nγ + 1 d(I0 , I1 ) = nγ − 2m + 1 ≥ , (45) 3 and we have 1 3d(I0 , I1 ) √ ≤ √ . (46) γ 2 Moreover, Eq. (46) also trivially holds when γ ≥ 1/3. We now consider the local functions of Eq. (33) splitted on I0 and I1 : 1 k m γ i=1 |θ2i − θ2i−1 | + δ maxi∈{2k+2,...,2k+1+l} θi if i ∈ I0 fi (θ) = 1 k α 2 if i ∈ I1 . (47) m γ i=1 |θ2i+1 − θ2i | − βθ1 + 2 θ 2 0 otherwise The average function f¯ remains unchanged and the time to communicate a vector between a node of I0 and a node of I1 is at least d(I0 , I1 )τ . Thus, the same result as Theorem 2 holds with ∆ = d(I0 , I1 ). We thus have RLg 1 1 f¯(θi,t ) − min f¯(θ) ≥ t 2 + . (48) θ∈B2 (R) 36 (1 + 2d(I0 ,I1 )τ ) 1+t Finally, the local Lipschitz constant Lℓ is bounded by nγ Lℓ ≤ Lg ≤ 3Lg , (49) m and Eq. (46), Eq. (48) and Eq. (49) lead to the desired result. 15

16.D Proof of the convergence rate of MSPD (Theorem 4 and The- orem 5) Theorem 1 (b) in [21] implies that, provided τ σλ1 (W ) < 1, the algorithm with exact proximal step leads to a restricted primal-dual gap n n 1 1 sup fi (θi ) − tr Λ′⊤ ΘA − ′inf n fi (θi′ ) − tr Λ⊤ Θ′ A Λ′ F ≤c n i=1 Θ ∈K n i=1 of 1 nR2 c2 + . ε= 2t η σ This implies that our candidate Θ is such that n n 1 1 fi (θi ) + c ΘA F ≤ ′inf n fi (θi′ ) + c Θ′ A F +ε . n i=1 Θ ∈K n i=1 Let θ be the average of all θi . We have: n n n 1 1 1 fi (θ) ≤ fi (θi ) + L i θi − θ n i=1 n i=1 n i=1 n n 1 1 1 ≤ fi (θi ) + √ L2i · Θ(I − 11⊤ /n) F n i=1 n n i=1 n 1 n 1 1 n L2i i=1 ≤ fi (θi ) + √ · ΘA F . n i=1 n λn−1 (W ) 1 n 2 √1 i=1 Li Thus, if we take c = n n λn−1 (W ) , we obtain n n 1 1 fi (θ) ≤ fi (θ∗ ) + ε, n i=1 n i=1 and we thus obtain a ε-minimizer of the original problem. We have n 1 1 1 nR2 λn−1 (W ) n2 i=1 L2i ε≤ + 2T η σ with the constraint σηλ1 (W ) < 1. This leads to, with the choice λn−1 (W )/λ1 (W ) η = nR n 2 i=1 Li /n and taking σ to the limit σηλ1 (W ) = 1, to a convergence rate of n 1 1 λ1 (W ) ε= R L2i . T n i=1 λn−1 (W ) 16

17.Since we cannot use the exact proximal operator of fi , we instead approximate it. If we approximate n (with the proper notion of gap [21, Eq. (11)]) each argminθi ∈K fi (θi ) + 2η θi − z 2 up to δi , then n the overall added gap is n1 i=1 δi . If we do M steps of subgradient steps then the associated gap is L2i η δi = nM (standard result for strongly-convex subgradient [23]). Therefore the added gap is n 1 1 λ1 (W ) R L2i . M n i=1 λn−1 (W ) Therefore after T communication steps, i.e., communication time T τ plus M T subgradient evalua- tions, i.e., time M T , we get an error of 1 1 RLℓ + √ , T M γ where γ = γ(W ) = λn−1 (W )/λ1 (W ). Thus to reach ε, it takes 2 2RLℓ 1 4RLℓ 1 √ τ+ √ . ε γ ε γ The second term is optimal, while the first term is not. We therefore do accelerated gossip in- √ stead of plain gossip. By performing K steps of gossip instead of one, with K = ⌊1/ γ⌋, the eigengap is lower bounded by γ(PK (W )) ≥ 1/4, and the overall time to obtain an error below ε becomes 2 4RLℓ τ 4RLℓ + , ε γ(W ) ε as announced. 17