Efficient computation of aggregate functions in large scale data processing framewor

1.Efficient computation of aggregate functions in large scale data processing frameworks Chao Zhang, Farouk Toumani and Emmanuel Gangler University Clermont Auvergne, France

2.MapReduce Paradigm Shuffle: sending <k,v> with Mapper: same k to identical reducer generating <k,v> Reducer: computing output on <k,v> with same k Communication cost:a bottleneck of MapReduce 2

3.Aggregation The inherent property of aggregation The property to break down the dependency : taking all values as input and : associativity. returning only one value as output. Input : 𝑋 = {𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 } Input : 𝑋 = 𝑋1 ∪ 𝑋2 … ∪ 𝑋𝑙 Output : 𝑦 = 𝛼(𝑋) Output : 𝑦 = 𝛼(𝑋) 𝑥1 𝑋1 𝛼(𝑋1 ) 𝑥2 𝑋2 𝛼(𝑋2 ) 𝑥3 𝛼 y 𝑋3 𝛼(𝑋3 ) 𝛼 y . . . . . . . . . 𝑥𝑛 𝑋𝑙 𝛼(𝑋𝑙 ) Partial aggregation:reducing communication cost 3

4.Research agenda Q: given an arbitrary aggregation 𝛼, when it is decomposable and how to decompose it? • Classifying aggregations into symmetric (commutative) and asymmetric families. • Generic frameworks to decompose aggregation. • Properties for the frameworks to be efficient: • Sufficient and necessary condition for symmetric aggregation; • Sufficient condition for a class of asymmetric aggregation. 4

5.An overview of solutions 5