# еӣҫи®әз®—жі•еңЁзҺ°д»Јж•°жҚ®жЁЎеһӢдёӯзҡ„еә”з”Ё

еңЁиҝҮеҺ»еҚҒе№ҙй—ҙпјҢи®ёеӨҡеҲҶеёғејҸи®Ўз®—е№іеҸ°еә”иҝҗиҖҢз”ҹпјҢдҫӢеҰӮMap-ReduceпјӣеҲҶеёғејҸжөҒеӨ„зҗҶгҖӮиҖҢжң¬з« пјҢйҰ–е…Ҳд»Ӣз»ҚдәҶMap-ReduceпјҢдёҖдёӘйқһеёёжҲҗеҠҹзҡ„жғіжі•пјҢе®ғж”№еҸҳдәҶзҰ»зәҝеҲҶжһҗе’Ңжү№йҮҸж•°жҚ®еӨ„зҗҶгҖӮе…¶ж¬Ўд»Ӣз»ҚдәҶDSGпјҲеҜҶйӣҶеӯҗеӣҫпјүзҡ„еә”з”ЁпјҢLPеҜ№еҒ¶зӯүжҰӮеҝөд»ҘеҸҠе…¶д»–зӣёе…іеә”з”ЁгҖӮ
еұ•ејҖжҹҘзңӢиҜҰжғ…

1.Graph Algorithms for Modern Data Models Ashish Goel Stanford University Joint work with Kamesh Munagala ; Bahman Bahmani and Abdur Chowdhury

2.Over the past decade, many commodity distributed computing platforms have emerged Two examples: Map-Reduce; Distributed Stream Processing Similar to PRAM models, but have several nuances Carefully calibrated to take latencies of disks vs network vs memory into account C ost of processing is often negligible compared to the cost of data transfer Take advantage of aggregation in disk and network operations Example: the cost of sending 100KB is about the same as sending 1 Byte over a network Modern Data Models

3.Data Model #1: Map Reduce An immensely successful idea which transformed offline analytics and bulk-data processing. Hadoop (initially from Yahoo!) is the most popular implementation. MAP: Transforms a (key, value) pair into other (key, value) pairs using a UDF (User Defined Function) called Map . Many mappers can run in parallel on vast amounts of data in a distributed file system SHUFFLE: The infrastructure then transfers data from the mapper nodes to the вҖңreducerвҖқ nodes so that all the (key, value) pairs with the same key go to the same reducer and get grouped into a single large (key, &lt;val 1 , val 2 , ..&gt;) pair REDUCE: A UDF that processes this grouped (key, &lt;val 1 , val 2 , ..&gt;) pair for a single key. Many reducers can run in parallel.

4.Complexity Measures Key-Complexity: The maximum size of a key-value pair T he amount of time taken to process each key The memory required to process each key Sequential Complexity: The total time needed by all the mappers and reducers together The total output produced by all the mappers and reducers together Number of MapReduce phases [Goel, Munagala ; 2012]

5.Complexity Measures Key-Complexity: The maximum size of a key-value pair T he amount of time taken to process each key The memory required to process each key Sequential Complexity: The total time needed by all the mappers and reducers together The total output produced by all the mappers and reducers together [Goel, Munagala ; 2012] THE CURSE OF THE LAST REDUCER

6.Complexity Measures Key-Complexity: The maximum size of a key-value pair T he amount of time taken to process each key The memory required to process each key Sequential Complexity: The total time needed by all the mappers and reducers together The total output produced by all the mappers and reducers together [Goel, Munagala ; 2012] SHUFFLE SIZE

7.Complexity Measures Key-Complexity: The maximum size of a key-value pair T he amount of time taken to process each key The memory required to process each key Sequential Complexity: The total time needed by all the mappers and reducers together The total output produced by all the mappers and reducers together [Goel, Munagala ; 2012] THE AMOUNT OF WORK DONE TO AGGREGATE ALL THE VALUES FOR A SINGLE KEY (SORTING) IS NOT A COMPLEXITY MEASURE

8.Complexity Measures Key-Complexity: The maximum size of a key-value pair T he amount of time taken to process each key The memory required to process each key Sequential Complexity: The total time needed by all the mappers and reducers together The total output produced by all the mappers and reducers together [Goel, Munagala ; 2012] THE AMOUNT OF WORK DONE TO AGGREGATE ALL THE VALUES FOR A SINGLE KEY (SORTING) IS NOT A COMPLEXITY MEASURE

9.Densest Subgraph (DSG) Given: an undirected graph G = (V,E) , with N nodes, M edges, and maximum degree d MAX For a subset S of nodes, let E(S) denote the set of edges between nodes in S Goal: Find the set S that maximizes |E(S)|/|S| Applications: Community detection Can be solved in polynomial time A (2+Оө) -approximation known on MapReduce O((log N)/ Оө ) -phases Each phase has sequential complexity O(M) and key complexity O( d MAX ) [ Bahmani , Kumar, Vassilvitskii ; 2012]

10.0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

11.0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

12.LP Formulation Maximize ОЈ e y e Subject to: ОЈ v x v вүӨ 1 y e вүӨ x v [for all nodes v, edges e, such that e is incident on v] x , y вүҘ 0

13.LP Formulation Maximize ОЈ e y e Subject to: ОЈ v x v вүӨ 1 y e вүӨ x v [for all nodes v, edges e, such that e is incident on v] x , y вүҘ 0 x v indicates whether node v Is part of S

14.LP Formulation Maximize ОЈ e y e Subject to: ОЈ v x v вүӨ 1 y e вүӨ x v [for all nodes v, edges e, such that e is incident on v] x , y вүҘ 0 x v indicates whether node v Is part of S y e indicates whether edge e Is part of E(S)

15.LP Formulation Maximize ОЈ e y e Subject to: ОЈ v x v вүӨ 1 y e вүӨ x v [for all nodes v, edges e, such that e is incident on v] x , y вүҘ 0 x v indicates whether node v Is part of S y e indicates whether edge e Is part of E(S) Edge e can be in E(S) only if its endpoints are in S

16.LP Formulation Maximize ОЈ e y e Subject to: ОЈ v x v вүӨ 1 y e вүӨ x v [for all nodes v, edges e, such that e is incident on v] x , y вүҘ 0 x v indicates whether node v Is part of S y e indicates whether edge e Is part of E(S) Edge e can be in E(S) only if its endpoints are in S Maximizing ОЈ e y e while setting ОЈ v x v вүӨ 1 maximizes density

17.LP Formulation Maximize ОЈ e y e Subject to: ОЈ v x v вүӨ 1 y e вүӨ x v [for all nodes v, edges e, such that e is incident on v] x , y вүҘ 0 x v indicates whether node v Is part of S y e indicates whether edge e Is part of E(S) Edge e can be in E(S) only if its endpoints are in S Maximizing ОЈ e y e while setting ОЈ v x v вүӨ 1 maximizes density The LP has NO INTEGRALITY GAP

18.General Direction for DSG Write the dual of the LP, and solve it on MapReduce PST type algorithms: Perform multiplicative updates of dual weights. Powerful primal-dual technique, with many applications in online, parallelized, and centralized algorithms. Approach: formulate the dual in a form suitable for PST; reduce width for efficiency; increase width for obtaining the primal back from the dual [ Plotkin , Shmoys , Tardos ; 1995] [General exposition: Arora , Hazan , Kale; 2010] [Many updates, variants: eg . Garg , Konemann 1998]

19.The Primal and its Dual Maximize ОЈ e y e Subject to: ОЈ v x v вүӨ 1 [D] y e вүӨ x v [ В® e,v ] x, y вүҘ 0 Minimize D Subject to: В® e,v + В® e,w вүҘ 1 [y e ] [for all edges e = ( v,w )] ОЈ e incident on v В® e,v вүӨ D [x v ] [for all nodes v] В® , D вүҘ 0 USEFUL FACT: An approximate solution to this dual results in an approximate solution to the primal

20.The Primal and its Dual Maximize ОЈ e y e Subject to: ОЈ v x v вүӨ 1 [D] y e вүӨ x v [ В® e,v ] x, y вүҘ 0 Minimize D Subject to: В® e,v + В® e,w вүҘ 1 [y e ] [for all edges e = ( v,w )] ОЈ e incident on v В® e,v вүӨ D [x v ] [for all nodes v] В® , D вүҘ 0 USEFUL FACT: An approximate solution to this dual results in an approximate solution to the primal

21.Solving the Dual Minimize D Guess D Subject to: Try to find В® , s.t. В® e,v + В® e,w вүҘ 1 [for all edges e = ( v,w )] ОЈ e incident on v В® e,v вүӨ D [for all nodes v] В® вүҘ 0 В® 2 P

22.Solving the Dual PST: Solve the dual using calls to the following oracle , for given y e : Maximize ОЈ e y e ( В® e ,u + В® e ,v ) s.t. В® 2 P Width, ВҪ = max { В® e ,v + В® e ,w } s.t. В® 2 P Guarantee: We get a (1+ ВІ ) -approximation in O(( ВҪ log N)/ ВІ 2 ) steps First Problem: ВҪ is too large (as large as D) Minimize D Guess D Subject to: Try to find В® , s.t. В® e,v + В® e,w вүҘ 1 [for all edges e = ( v,w )] ОЈ e incident on v В® e,v вүӨ D [for all nodes v] В® вүҘ 0 В® 2 P

23.The Dual Oracle on MapReduce Need to compute the oracle in each iteration: Maximize ОЈ e y e ( В® e ,u + В® e ,v ) , subject to : ОЈ e incident on v В® e ,v вүӨ D ; В® вүҘ 0 Maps well to MapReduce Map(edge e = ( u,v ), y e ): EMIT(u, (e, y e )); Emit(v, (e, y e )) Reduce(node u, &lt;(e 1 , y e1 ), вҖҰ&gt;) : Find the largest y e in the values list, and output В® e,u = D and everything else is implicitly 0 Key complexity: O( d MAX ) ; sequential complexity : O(M)

24.The Dual Oracle on MapReduce Need to compute the oracle in each iteration: Maximize ОЈ e y e ( В® e ,u + В® e ,v ) , subject to : ОЈ e incident on v В® e ,v вүӨ D ; В® вүҘ 0 Maps well to MapReduce Map(edge e = ( u,v ), y e ): EMIT(u, (e, y e )); Emit(v, (e, y e )) Reduce(node u, &lt;(e 1 , y e1 ), вҖҰ&gt;) : Find the largest y e in the values list, and output В® e,u = D and everything else is implicitly 0 Key complexity: O( d MAX ) ; sequential complexity : O(M)

25.PST: Solve the dual using calls to the following oracle , for given y e : Maximize ОЈ e y e ( В® e ,u + В® e ,v ) s.t. В® 2 P Width, ВҪ = max { В® e ,v + В® e ,w } s.t. В® 2 P Guarantee: We get a (1+ ВІ ) -approximation in O(( ВҪ log N)/ ВІ 2 ) steps First Problem: ВҪ is too large (as large as D ) Solving the Dual В® 2 P Minimize D Guess D Subject to: Try to find В® , s.t. В® e,v + В® e,w вүҘ 1 [for all edges e = ( v,w )] ОЈ e incident on v В® e,v вүӨ D [for all nodes v] В® вүҘ 0

26.Solving the Dual: Reducing Width В® 2 P Minimize D Guess D Subject to: Try to find В® , s.t. В® e,v + В® e,w вүҘ 1 [for all edges e = ( v,w )] ОЈ e incident on v В® e,v вүӨ D [for all nodes v] В® вүҘ 0; В® вүӨ 1

27.Solving the Dual: Reducing Width Width ВҪ = max { В® e ,v + В® e ,w } s.t. В® 2 P The optimum solution to the dual LP never sets any В® e,u t o be larger than 1 , and hence, adding the вҖң В® вүӨ 1 вҖқ constraints does not change the dual solution Next problem: It no longer holds that an approximate dual leads to an approximate primal В® 2 P Minimize D Guess D Subject to: Try to find В® , s.t. В® e,v + В® e,w вүҘ 1 [for all edges e = ( v,w )] ОЈ e incident on v В® e,v вүӨ D [for all nodes v] В® вүҘ 0; В® вүӨ 1

28.Preserving Approximation Replace вҖң В® вүӨ 1 вҖқ with вҖң В® вүӨ 2 вҖқ The width increases by only O(1) , but: Technical Lemma: A (1+ ВІ ) -approximate solution to the dual results in a (1+O( ВІ )) -approximate solution to the primal В® 2 P Minimize D Guess D Subject to: Try to find В® , s.t. В® e,v + В® e,w вүҘ 1 [for all edges e = ( v,w )] ОЈ e incident on v В® e,v вүӨ D [for all nodes v] В® вүҘ 0; В® вүӨ 2

29.Performance O((log N)/ ВІ 2 ) iterations Each iteration: Reduce-key complexity: O( d MAX ) Sequential complexity: O(M) The greedy algorithm takes O((log N)/ ВІ ) iterations, but gives a (2+ ВІ ) -approximation Extends to fractional matchings , and directed graphs [Goel, Munagala ; 2013]

жҲ‘е°ұжҳҜжҲ‘пјҒ
е·Іе°Ҷй“ҫжҺҘеӨҚеҲ¶иҮіеүӘиҙҙжқҝ