图论算法在现代数据模型中的应用

在过去十年间,许多分布式计算平台应运而生,例如Map-Reduce;分布式流处理。而本章,首先介绍了Map-Reduce,一个非常成功的想法,它改变了离线分析和批量数据处理。其次介绍了DSG(密集子图)的应用,LP对偶等概念以及其他相关应用。
展开查看详情

1.Graph Algorithms for Modern Data Models Ashish Goel Stanford University Joint work with Kamesh Munagala ; Bahman Bahmani and Abdur Chowdhury

2.Over the past decade, many commodity distributed computing platforms have emerged Two examples: Map-Reduce; Distributed Stream Processing Similar to PRAM models, but have several nuances Carefully calibrated to take latencies of disks vs network vs memory into account C ost of processing is often negligible compared to the cost of data transfer Take advantage of aggregation in disk and network operations Example: the cost of sending 100KB is about the same as sending 1 Byte over a network Modern Data Models

3.Data Model #1: Map Reduce An immensely successful idea which transformed offline analytics and bulk-data processing. Hadoop (initially from Yahoo!) is the most popular implementation. MAP: Transforms a (key, value) pair into other (key, value) pairs using a UDF (User Defined Function) called Map . Many mappers can run in parallel on vast amounts of data in a distributed file system SHUFFLE: The infrastructure then transfers data from the mapper nodes to the “reducer” nodes so that all the (key, value) pairs with the same key go to the same reducer and get grouped into a single large (key, <val 1 , val 2 , ..>) pair REDUCE: A UDF that processes this grouped (key, <val 1 , val 2 , ..>) pair for a single key. Many reducers can run in parallel.

4.Complexity Measures Key-Complexity: The maximum size of a key-value pair T he amount of time taken to process each key The memory required to process each key Sequential Complexity: The total time needed by all the mappers and reducers together The total output produced by all the mappers and reducers together Number of MapReduce phases [Goel, Munagala ; 2012]

5.Complexity Measures Key-Complexity: The maximum size of a key-value pair T he amount of time taken to process each key The memory required to process each key Sequential Complexity: The total time needed by all the mappers and reducers together The total output produced by all the mappers and reducers together [Goel, Munagala ; 2012] THE CURSE OF THE LAST REDUCER

6.Complexity Measures Key-Complexity: The maximum size of a key-value pair T he amount of time taken to process each key The memory required to process each key Sequential Complexity: The total time needed by all the mappers and reducers together The total output produced by all the mappers and reducers together [Goel, Munagala ; 2012] SHUFFLE SIZE

7.Complexity Measures Key-Complexity: The maximum size of a key-value pair T he amount of time taken to process each key The memory required to process each key Sequential Complexity: The total time needed by all the mappers and reducers together The total output produced by all the mappers and reducers together [Goel, Munagala ; 2012] THE AMOUNT OF WORK DONE TO AGGREGATE ALL THE VALUES FOR A SINGLE KEY (SORTING) IS NOT A COMPLEXITY MEASURE

8.Complexity Measures Key-Complexity: The maximum size of a key-value pair T he amount of time taken to process each key The memory required to process each key Sequential Complexity: The total time needed by all the mappers and reducers together The total output produced by all the mappers and reducers together [Goel, Munagala ; 2012] THE AMOUNT OF WORK DONE TO AGGREGATE ALL THE VALUES FOR A SINGLE KEY (SORTING) IS NOT A COMPLEXITY MEASURE

9.Densest Subgraph (DSG) Given: an undirected graph G = (V,E) , with N nodes, M edges, and maximum degree d MAX For a subset S of nodes, let E(S) denote the set of edges between nodes in S Goal: Find the set S that maximizes |E(S)|/|S| Applications: Community detection Can be solved in polynomial time A (2+ε) -approximation known on MapReduce O((log N)/ ε ) -phases Each phase has sequential complexity O(M) and key complexity O( d MAX ) [ Bahmani , Kumar, Vassilvitskii ; 2012]

10.0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

11.0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

12.LP Formulation Maximize Σ e y e Subject to: Σ v x v ≤ 1 y e ≤ x v [for all nodes v, edges e, such that e is incident on v] x , y ≥ 0

13.LP Formulation Maximize Σ e y e Subject to: Σ v x v ≤ 1 y e ≤ x v [for all nodes v, edges e, such that e is incident on v] x , y ≥ 0 x v indicates whether node v Is part of S

14.LP Formulation Maximize Σ e y e Subject to: Σ v x v ≤ 1 y e ≤ x v [for all nodes v, edges e, such that e is incident on v] x , y ≥ 0 x v indicates whether node v Is part of S y e indicates whether edge e Is part of E(S)

15.LP Formulation Maximize Σ e y e Subject to: Σ v x v ≤ 1 y e ≤ x v [for all nodes v, edges e, such that e is incident on v] x , y ≥ 0 x v indicates whether node v Is part of S y e indicates whether edge e Is part of E(S) Edge e can be in E(S) only if its endpoints are in S

16.LP Formulation Maximize Σ e y e Subject to: Σ v x v ≤ 1 y e ≤ x v [for all nodes v, edges e, such that e is incident on v] x , y ≥ 0 x v indicates whether node v Is part of S y e indicates whether edge e Is part of E(S) Edge e can be in E(S) only if its endpoints are in S Maximizing Σ e y e while setting Σ v x v ≤ 1 maximizes density

17.LP Formulation Maximize Σ e y e Subject to: Σ v x v ≤ 1 y e ≤ x v [for all nodes v, edges e, such that e is incident on v] x , y ≥ 0 x v indicates whether node v Is part of S y e indicates whether edge e Is part of E(S) Edge e can be in E(S) only if its endpoints are in S Maximizing Σ e y e while setting Σ v x v ≤ 1 maximizes density The LP has NO INTEGRALITY GAP

18.General Direction for DSG Write the dual of the LP, and solve it on MapReduce PST type algorithms: Perform multiplicative updates of dual weights. Powerful primal-dual technique, with many applications in online, parallelized, and centralized algorithms. Approach: formulate the dual in a form suitable for PST; reduce width for efficiency; increase width for obtaining the primal back from the dual [ Plotkin , Shmoys , Tardos ; 1995] [General exposition: Arora , Hazan , Kale; 2010] [Many updates, variants: eg . Garg , Konemann 1998]

19.The Primal and its Dual Maximize Σ e y e Subject to: Σ v x v ≤ 1 [D] y e ≤ x v [ ® e,v ] x, y ≥ 0 Minimize D Subject to: ® e,v + ® e,w ≥ 1 [y e ] [for all edges e = ( v,w )] Σ e incident on v ® e,v ≤ D [x v ] [for all nodes v] ® , D ≥ 0 USEFUL FACT: An approximate solution to this dual results in an approximate solution to the primal

20.The Primal and its Dual Maximize Σ e y e Subject to: Σ v x v ≤ 1 [D] y e ≤ x v [ ® e,v ] x, y ≥ 0 Minimize D Subject to: ® e,v + ® e,w ≥ 1 [y e ] [for all edges e = ( v,w )] Σ e incident on v ® e,v ≤ D [x v ] [for all nodes v] ® , D ≥ 0 USEFUL FACT: An approximate solution to this dual results in an approximate solution to the primal

21.Solving the Dual Minimize D Guess D Subject to: Try to find ® , s.t. ® e,v + ® e,w ≥ 1 [for all edges e = ( v,w )] Σ e incident on v ® e,v ≤ D [for all nodes v] ® ≥ 0 ® 2 P

22.Solving the Dual PST: Solve the dual using calls to the following oracle , for given y e : Maximize Σ e y e ( ® e ,u + ® e ,v ) s.t. ® 2 P Width, ½ = max { ® e ,v + ® e ,w } s.t. ® 2 P Guarantee: We get a (1+ ² ) -approximation in O(( ½ log N)/ ² 2 ) steps First Problem: ½ is too large (as large as D) Minimize D Guess D Subject to: Try to find ® , s.t. ® e,v + ® e,w ≥ 1 [for all edges e = ( v,w )] Σ e incident on v ® e,v ≤ D [for all nodes v] ® ≥ 0 ® 2 P

23.The Dual Oracle on MapReduce Need to compute the oracle in each iteration: Maximize Σ e y e ( ® e ,u + ® e ,v ) , subject to : Σ e incident on v ® e ,v ≤ D ; ® ≥ 0 Maps well to MapReduce Map(edge e = ( u,v ), y e ): EMIT(u, (e, y e )); Emit(v, (e, y e )) Reduce(node u, <(e 1 , y e1 ), …>) : Find the largest y e in the values list, and output ® e,u = D and everything else is implicitly 0 Key complexity: O( d MAX ) ; sequential complexity : O(M)

24.The Dual Oracle on MapReduce Need to compute the oracle in each iteration: Maximize Σ e y e ( ® e ,u + ® e ,v ) , subject to : Σ e incident on v ® e ,v ≤ D ; ® ≥ 0 Maps well to MapReduce Map(edge e = ( u,v ), y e ): EMIT(u, (e, y e )); Emit(v, (e, y e )) Reduce(node u, <(e 1 , y e1 ), …>) : Find the largest y e in the values list, and output ® e,u = D and everything else is implicitly 0 Key complexity: O( d MAX ) ; sequential complexity : O(M)

25.PST: Solve the dual using calls to the following oracle , for given y e : Maximize Σ e y e ( ® e ,u + ® e ,v ) s.t. ® 2 P Width, ½ = max { ® e ,v + ® e ,w } s.t. ® 2 P Guarantee: We get a (1+ ² ) -approximation in O(( ½ log N)/ ² 2 ) steps First Problem: ½ is too large (as large as D ) Solving the Dual ® 2 P Minimize D Guess D Subject to: Try to find ® , s.t. ® e,v + ® e,w ≥ 1 [for all edges e = ( v,w )] Σ e incident on v ® e,v ≤ D [for all nodes v] ® ≥ 0

26.Solving the Dual: Reducing Width ® 2 P Minimize D Guess D Subject to: Try to find ® , s.t. ® e,v + ® e,w ≥ 1 [for all edges e = ( v,w )] Σ e incident on v ® e,v ≤ D [for all nodes v] ® ≥ 0; ® ≤ 1

27.Solving the Dual: Reducing Width Width ½ = max { ® e ,v + ® e ,w } s.t. ® 2 P The optimum solution to the dual LP never sets any ® e,u t o be larger than 1 , and hence, adding the “ ® ≤ 1 ” constraints does not change the dual solution Next problem: It no longer holds that an approximate dual leads to an approximate primal ® 2 P Minimize D Guess D Subject to: Try to find ® , s.t. ® e,v + ® e,w ≥ 1 [for all edges e = ( v,w )] Σ e incident on v ® e,v ≤ D [for all nodes v] ® ≥ 0; ® ≤ 1

28.Preserving Approximation Replace “ ® ≤ 1 ” with “ ® ≤ 2 ” The width increases by only O(1) , but: Technical Lemma: A (1+ ² ) -approximate solution to the dual results in a (1+O( ² )) -approximate solution to the primal ® 2 P Minimize D Guess D Subject to: Try to find ® , s.t. ® e,v + ® e,w ≥ 1 [for all edges e = ( v,w )] Σ e incident on v ® e,v ≤ D [for all nodes v] ® ≥ 0; ® ≤ 2

29.Performance O((log N)/ ² 2 ) iterations Each iteration: Reduce-key complexity: O( d MAX ) Sequential complexity: O(M) The greedy algorithm takes O((log N)/ ² ) iterations, but gives a (2+ ² ) -approximation Extends to fractional matchings , and directed graphs [Goel, Munagala ; 2013]