如何在睡觉时调整你的工作

在这次演讲中,我们将讨论Tunein,一个会自动调整框架开发的。我们将描述我们如何使用迭代优化方法来找到最优参数。我们将讨论我们尝试的各种优化算法,以及为什么我们发现粒子群优化算法以给出最佳结果。我们将讨论如何避免使用任何额外的执行,并在它们的常规计划执行期间调整作业。我们详细讨论了在调整时确保更快收敛和零失败执行的技术。我们将展示如何通过调整一组小的参数来获得超过50%的资源使用增益。我们还将讨论吸取的经验教训和未来的路线图。
展开查看详情

1.TuneIn: How to get your jobs tuned while sleeping Manoj Kumar, LinkedIn Arpan Agrawal, LinkedIn #Res2SAIS

2. OUR VISION Create economic opportunity for every member of the global workforce

3. OUR MISSION Connect the world’s professionals to make them more productive and successful

4.Agenda • Why TuneIn? • How does TuneIn work? • Architecture and framework features • Road ahead #Res2SAIS 4

5.Grid Scale at LinkedIn 2008 2018 1 cluster 10+ clusters 20 nodes 1000s of nodes 5 users 1000s of active users MapReduce Pig, Hive, Spark, etc. Few workflows 10000s workflows #Res2SAIS 5

6.Typical Conversations Hey, this Spark I will tune it to job is running improve the slowly. run time. Manager Developer #Res2SAIS 6

7.Typical Conversations We have found some jobs which I will ask my team are consuming to tune those jobs high resources on to reduce the the cluster. resource usage. Hadoop Admin Manager #Res2SAIS 7

8.Typical Conversations Is there a way we can get this I will try to tune daily report 30 it to reduce the minutes early? run time. Client Developer #Res2SAIS 8

9.Why Tuning? • Optimal parameter configuration: – leads to better cluster utilization and thus savings – reduces the execution time • Default configuration is not always optimal #Res2SAIS 9

10.Manual Tuning PHASE 3 PHASE 1 Come up with next Execute parameter set Manual Job Tuning Process PHASE 2 Observe the Execution Metrics #Res2SAIS 10

11.Dr. Elephant: Heuristic based tuning • Suggests tuning recommendations PHASE 1 PHASE 3 based on pre-defined heuristics Come up with next Execute parameter set Heuristics Job • No need to worry about the Based Manual Tuning hundreds of counters and parameters • Relies on user’s initiative to use the recommendations PHASE 2 • Expects some user expertise Look at the Dr. Elephant recommendations #Res2SAIS 11

12.#Res2SAIS 12

13.Why Auto Tuning? • 10000s of jobs to tune • Increases developer productivity • Tunes without any extra effort • No expertise is expected • Option of which objective function to tune for – resource usage – execution time etc. #Res2SAIS 13

14.Let’s auto tune! #Res2SAIS 14

15.TuneIn • Framework to automatically tune recurring Hadoop and Spark jobs • Iteratively tries to reach the optimal configuration • Results : 20-35% reduction in Resource Usage #Res2SAIS 15

16.Particle Swarm Optimization (PSO) [1] • Mimics the behavior of swarm of birds searching food • Introduces a population of candidate solution particles in the search space Source: Wikipedia Particle Swarm Optimization by J. Kennedy et al., https://ieeexplore.ieee.org/document/488968/ #Res2SAIS 16

17.PSO (contd.) • Points of attraction: personal and swarm’s best known positions • Particles converge to the region with the minimum cost function value Source: Wikipedia #Res2SAIS 17

18.Why PSO? • Cost function is noisy – PSO is gradient free and robust to noise [3] • Spark and Hadoop are complex systems – PSO is a metaheuristic black box optimization algorithm • Fastest convergence K. E. Parsopoulos et al., “Particle Swarm Optimizer in Noisy and Continuously Changing Environments,” in Artificial Intelligence and Soft Computing #Res2SAIS 18

19.PSO Details [2] • Swarm size of 3 gives the best result – neither too small to cover the search space – nor too big to do many first iteration random searches • Good starting point is important to guide the swarm Optimizing Hadoop parameter settings with gene expression programming guided PSO by Mukhtaj Khan et al. #Res2SAIS 19

20.Cost function • Resource usage per unit input ∑"#$%&'$()* +,-./0-12 314,25 ∗ +,-./0-12 78.041 9,./: ;-8<. =0>1 • Approximately input size invariant #Res2SAIS 20

21.Search Space • Parameters being tuned constitutes Param 3 the search space • Parameters to tune depends on the cost function metric Param 2 Param 1 #Res2SAIS 21

22.Search Space Cost function: Resource Usage Pig Spark mapreduce.map.memory.mb spark.executor.memory mapreduce.reduce.memory.mb spark.executor.cores mapreduce.task.io.sort.mb spark.memory.fraction mapreduce.task.io.sort.factor spark.yarn.executor.memoryOverhead #Res2SAIS 22

23.Search Space Optimization • Important to prevent failures • Speeds up convergence • Boundary parameter values – e.g. !"#$%. '(')*+,$. ),$'! ∈ 1, 10 • Parameter interdependent constraints – Captures the interdependence among the parameters – e.g. 1#"$'2*)'. +#!%. 3,. !,$+. 14 < 0.60 ∗ 1#"$'2*)'. 1#". 1'1,$8. 14 #Res2SAIS 23

24. Avoiding over optimization • Undesirable to squeeze memory so much that execution time shoots up significantly • Updated cost function: ∑"#$%&'$()* +,-./0-12 314,25 ∗ +,-./0-12 78.041 + @1-/:.5 9,./: ;-8<. =0>1 #Res2SAIS 24

25.Convergence • No theoretical bound on the steps to converge • Practically converges in 20 job executions • TuneIn gets turned off for the job automatically on convergence #Res2SAIS 25

26.Results Job type Metric Average reduction Spark Resource Usage 30 - 40 % per job Pig Resource Usage 20 - 35 % per job #Res2SAIS 27

27.Architecture Dr. Elephant 1. Get Parameters Rest API 2. Mapper memory: 2048 Sort Buffer: 200 3. Submit Job 4. Fetch Metrics MapReduce/Spark TuneIn Framework Fetchers #Res2SAIS 28

28.Framework Features Generic Framework Tuning During Regular • Resource Usage, Execution Time Scheduled Runs • Pig, Hive, Spark • Easy Integration Failure Avoidance • Constraints on parameters Auto Switch Off • Automatic Failure Handling #Res2SAIS 29

29.Road Ahead • Tuning for execution time • Faster convergence using Intelligent Parameter Space Optimization (IPSO) • Smarter tuning switch on/off #Res2SAIS 30