哪些数据破坏了我的代码?检查Spark 的转变

Apache Spark正在迅速成为数据工程和数据科学应用的主导大数据处理框架。在Apache Spark中编程大数据应用程序的简单性和在内存处理中获得的速度是这一流行背后的关键因素。然而,帮助开发人员构建和调试Spice应用程序的工具没有跟上进度。
展开查看详情

1.Which Data Broke My Code? Inspecting Spark Transformations Vinod K. Nair, Director of Product Management @ Pepperdata #DevSAIS12

2. Talk Outline • Introduction • Problem: ‘laziness’ makes debugging hard • Solution: interactive inspection of RDDs • Demo • Q&A #DevSAIS12 2

3.Introduction to Pepperdata Application Performance Management (APM) for Spark (& Hadoop) #DevSAIS12 3

4.#DevSAIS12 4

5.#DevSAIS12 5

6.Problem: ‘laziness’ makes debugging hard RDD data unavailable until an ‘action’ triggers execution #DevSAIS12 6

7.Transformations are invisible RDDs support two types of operations: 1. transformations, which create a new dataset from an existing one 2. actions, which return a value to the driver program after running a computation on the dataset Transformations in Spark are lazy. They are only computed when an action requires a result to be returned to the driver program. - https://spark.apache.org/docs/latest/rdd-programming-guide.html #DevSAIS12 7

8.filtererdData.take(10).foreach( println ) #DevSAIS12 8

9.Solution available today • Sprinkle your code with print statements • ‘hopefully’ catch the right transformation causing the problem • If you don’t catch it – repeat process #DevSAIS12 9

10.Our solution: interactive inspection of data in flight Trigger an ‘action’ to enable inspection of any RDD in the DAG #DevSAIS12 10

11.Solution requirements • No user code changes required • Work with any standard Spark distribution • Provide a familiar interactive debugger interface #DevSAIS12 11

12.Solution overview Driver Spark Context User Code RDD Graph DAG Scheduler Worker Task Task Task Cluster Scheduler Manager Worker Task Task

13.Solution overview Driver Worker Spark Context Task User Code RDD Graph RDD DAG Scheduler Task Inspector Cluster Manager Worker RDD Task Task Metadata Scheduler Task UI REST API CLI

14.New Spark job to display RDD[10] Original stage with RDD [10] as Debug stage with an intermediate transformation RDD [10] as output #DevSAIS12 14

15.CLI command to inspect RDD[10] Action #DevSAIS12 15

16.Demo Interactive ‘debugger’ experience to inspect any RDD in the DAG #DevSAIS12 16

17. Analyze CDC’s census data • CDC’s 500 cities project (data.gov: 500 cities local data for better health ) • Input: Major chronic diseases by city • Output: Diabetes in adults by US region #DevSAIS12 17

18.Distribution of diabetics by region 3,448,595 1,532,856 2,096,619 3,591,010 #DevSAIS12 18

19.Region Adult Population with Diabetes WEST 3,448,595 MID WEST 1,532,856 NORTH EAST 2,096,619 SOUTH 3,407,528 UNKNOWN 183,482 #DevSAIS12 19

20. RDD transformations through the app 2014,CA,California,Alameda,City,BRFSS,Health Region Impacted pop. Outcomes,0600562,Diagnosed diabetes among adults aged >=18 Years,%,AgeAdjPrv,Age-adjusted West prevalence,8.1,7.9,8.2,,,73812,"(37.7650849031, - 122.266489842)",HLTHOUT,DIABETES,0600562,,Diabetes Mid West … RDD [5] North East South RDD [16] Filter (“DIABETES”) & Map ( State, Impacted pop. ) Reduce (Region) State Impacted pop. Region Impacted pop. Alabama 205,764 South 27,161 Alabama 90,468 Map South 9,228 Alaska 291,826 (Region, Impacted pop.) West 21,596 Arizona 76,238 West 6,633 … RDD [11] … RDD [12] #DevSAIS12 20

21.Spark Debugger UI URL Spark Web UI List of RDDs RDD Filter 'Breakpoint’ on any transformation First 10 records matching filter #DevSAIS12 21

22.Spark Debugger Demo #DevSAIS12 22

23.To recap, you can… • View data sets as they are transformed in your Spark App – no code changes are required – it works with any Spark distribution, and – it uses a familiar debugger interface to ‘set’ breakpoints and view RDDs in flight #DevSAIS12 23

24.What’s next ? Roadmap for Pepperdata’s Spark Debugger #DevSAIS12 24

25.Areas of focus going forward • UX Improvements • Attach to a running app (streaming use case) • Pause a job on hitting a condition • Spark SQL support To learn more visit the booth (#407) #DevSAIS12 25