使用BigSift自动调试 Apache Spark中的大数据分析

由于数据集的不洁性质或对数据的错误假设,开发大数据分析经常涉及试验和错误调试。数据科学家通常编写实现数据处理管道的代码,并在他们的本地工作站上用从结核病规模数据仓库下载的小样本数据进行测试。他们互相祈祷,希望程序能在昂贵的生产云中工作。当一个工作失败或者他们得到一个可疑的结果时,数据科学家花费数小时猜测错误的来源。
展开查看详情

1.Automated Debugging of Big Data Analytics in Apache Spark Using BigSift Muhammad Ali Gulzar Miryung Kim University of California, Los Angeles #Res3SAIS

2.Big Data Debugging in the Dark Develop locally Hope it works Run in cloud Bug! Guesswork Map Reduce #Res3SAIS 2

3.BigDebug: Debugging Primitives for Interactive Big Data Processing ICSE 2016 Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Deep Tetali Tyson Condie, Todd Millstein, Miryung Kim Presented at Spark Summit 2017 #Res3SAIS 3

4.Interactive Debugging with BigDebug 1. Simulated Breakpoint 2. On Demand Guarded Watchpoint 3. Crash Culprit Identification 4. Backward and Forward Tracing #Res3SAIS 4

5.Titian: Data Provenance support in Spark VLDB 2015 Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, Tyson Condie #Res3SAIS 5

6.Data Provenance – in SQL Sensors Tuple-ID Time Sendor-ID Temperature T1 11AM 1 34 SELECT time, AVG(temp) T2 11AM 2 35 FROM sensors T3 11AM 3 35 T4 12PM 1 35 GROUP BY time T5 12PM 2 35 T6 12PM 3 100 T7 1PM 1 35 T8 1PM 2 35 Why ID-2 and Result- Time AVG(temp) T9 1PM 3 80 ID-3 have those ID high values? ID-1 11AM 34.6 ID-2 12PM 56.6 Outlier ID-3 1PM 50 Outlier #Res3SAIS 6

7.Step 1: Instrument Workflow in Spark Stage 1 Input Output Input ID Output ID ID ID Hadoop Combiner { id1, id 3} 400 LineageRDD LineageRDD offset1 id1 { id2 } 4 offset2 id2 offset3 id3 lines errors codes pairs Stage 2 Input ID Output ID Input ID Output ID Reducer 400 id1 [p1, p2] 400 LineageRDD 4 id2 [ p1 ] 4 Stage counts reports LineageRDD #Res3SAIS 7

8.Step 2: Example Backward Tracing Hadoop Combiner Worker3 Input ID Input ID OutputID Output ID Input ID Input ID OutputID Output ID Worker1 offset1 offset1 id1 id1, id { id1, id3} 3} 400 offset2 offset2 id2 { id2 }} 4 offset3 offset3 id3 Reducer Stage Input ID Output ID Input ID Input ID OutputID Output ID Input ID Input ID OutputID Output ID p1 400 [p1, p2] [p1, p2] 400 400 id1 p1 ]] [ p1 4 4 id2 Hadoop Combiner Worker2 Input ID Input ID OutputID Output ID Input ID Input ID OutputID Output ID offset1 offset1 id1 id1, …} { id1, …} 400 … … … … Reducer.Output ID Stage.Input ID Input ID Output ID p1 400 #Res3SAIS 8

9.Step 2: Example Backward Tracing Hadoop Combiner Input ID Input ID OutputID Output ID Input ID Input ID OutputID Output ID Worker1 offset1 offset1 id1 id1, id { id1, id3} 3} 400 Combiner.Output ID Reducer.Output ID offset2 offset2 id2 { id2 }} 4 offset3 offset3 id3 Input ID Output ID p1 400 Hadoop Combiner Worker2 Input ID Input ID OutputID Output ID Input ID Input ID OutputID Output ID offset1 offset1 id1 id1, …} { id1, …} 400 … … … … Input ID Output ID Combiner.Output ID Reducer.Output ID p1 400 #Res3SAIS 9

10.Step 2: Example Backward Tracing Hadoop Combiner Input ID Input ID OutputID Output ID Input ID Input ID OutputID Output ID Worker1 offset1 offset1 id1 id1, id { id1, id3} 3} 400 offset2 offset2 id2 { id2 }} 4 offset3 offset3 id3 Hadoop.Output ID Combiner.Input ID Hadoop Combiner Worker2 Input ID Input ID OutputID Output ID Input ID Input ID OutputID Output ID offset1 offset1 id1 id1, …} { id1, …} 400 … … … … Hadoop.Output ID Combiner.Input ID #Res3SAIS 10

11.BigSift: Automated Debugging of Big Data Analytics in DISC Applications SoCC 2017 Muhammad Ali Gulzar, Matteo Interlandi, Xueyuan Han, Tyson Condie, Miryung Kim #Res3SAIS 11

12.Motivating Example for Automated Debugging with BigSift • Alice writes a Spark program that identifies, for each state in the US, the delta between the minimum and the maximum snowfall reading for each day of any year and for any particular year. • An input data record that measures 1 foot of snowfall on January 1st of Year 1992, in the 99504 zip code (Anchorage, AK) area, appears as 99504 , 01/01/1992 , 1ft #Res3SAIS 12

13. Debugging Incorrect Output TextFile FlatMap GroupByKey Map Output AK , 01/01 , 304.8 AK , 01/01 , [304.8, 21336, 245, 85] AK , 01/01 , 21251 99504 , 01/01/1992 , 1ft AK , 1992 , 304.8 AK , 03/01 , [30.5 , 145] AK , 03/01 , 114.5 99504 , 03/01/1992 , 0.1ft AK , 03/01 , 30.5 AK , 1992 , [304.8 , 30.5] AK , 1992 , 274.3 99504 , 01/01/1993 , 70in AK , 1992 , 30.5 AK , 1993 , [21336, 145, 85] AK , 1993 , 21251 99504 , 03/01/1993 , 145mm AK , 01/01 , 21336 AK , 1994 , [245] AK , 1994 , 0 99504 , 01/01/1994 , 245mm AK , 1993 , 21336 CA , 02/01 , [0] CA , 02/01 , 0 99504 , 01/01/1993 , 85mm AK , 03/01 , 145 CA , 1991 , [0] CA , 1991 , 0 90031 , 02/01/1991 , 0mm AK , 1993 , 145 AK , 01/01 , 245 AK , 1994 , 245 It is challenging because …. the …. input records is very large and the program involves computing min and max and a unit conversion #Res3SAIS 13

14. Data Provenance with Titian [VLDB ‘15] TextFile FlatMap GroupByKey Map Output AK , 01/01 , 304.8 AK , 01/01 , [304.8, 21336, 245, 85] AK , 01/01 , 21251 99504 , 01/01/1992 , 1ft AK , 1992 , 304.8 AK , 03/01 , [30.5 , 145] AK , 03/01 , 114.5 99504 , 03/01/1992 , 0.1ft AK , 03/01 , 30.5 AK , 1992 , [304.8 , 30.5] AK , 1992 , 274.3 99504 , 01/01/1993 , 70in AK , 1992 , 30.5 AK , 1993 , [21336, 145, 85] AK , 1993 , 21251 99504 , 03/01/1993 , 145mm AK , 01/01 , 21336 AK , 1994 , [245] AK , 1994 , 0 99504 , 01/01/1994 , 245mm AK , 1993 , 21336 CA , 02/01 , [0] CA , 02/01 , 0 99504 , 01/01/1993 , 85mm AK , 03/01 , 145 CA , 1991 , [0] CA , 1991 , 0 90031 , 02/01/1991 , 0mm AK , 1993 , 145 AK , 01/01 , 245 It over-approximates the AK , 1994 ,scope of failure-inducing inputs i.e. records in 245 …. …. the faulty key-group are all marked as faulty #Res3SAIS 14

15. Delta Debugging • Alice tests the output from different input configurations using a test function TextFile FlatMap GroupByKey Map Output AK , 01/01 , 304.8 AK , 1992 , 304.8 delta: Float) : Boolean = def test(key:String, { AK , 03/01< , 6000 delta 30.5 99504 , 01/01/1992 }, 1ft AK , 1992 , 30.5 AK , 01/01 , [304.8, 21336, 245, 85] AK , 01/01 , 21251 1 99504 , 03/01/1992 , 0.1ft AK , 01/01 , 21336 99504 , 01/01/1993 , 70in AK , 03/01 , [30.5 , 145] AK , 03/01 , 114.5 AK , 1993 , 21336 99504 , 03/01/1993 , 145mm AK , 1992 , [304.8 , 30.5] AK , 1992 , 274.3 AK , 03/01 , 145 99504 , 01/01/1994 , 245mm AK , 1993 , [21336, 145, 85] AK , 1993 , 21251 AK , 1993 , 145 99504 , 01/01/1993 , 85mm AK , 1994 , [245] AK , 1994 , 0 2 AK , 01/01 , 245 90031 , 02/01/1991 , 0mm CA , 02/01 , [0] CA , 02/01 , 0 AK , 1994 , 245 CA , 1991 , [0] CA , 1991 , 0 …. …. Run 1 fails the test #Res3SAIS 15

16. Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. TextFile FlatMap GroupByKey Map Output AK , 01/01 , 304.8 99504 , 01/01/1992 , 1ft AK , 1992 , 304.8 1 99504 , 03/01/1992 , 0.1ft AK , 03/01 , 30.5 AK , 01/01 , [304.8, 21336] AK , 01/01 , 21031.2 AK , 03/01 , [30.5] AK , 03/01 , 0 2 99504 , 01/01/1993 , 70in AK , 1992 , 30.5 AK , 1992 , [304.8 , 30.5] AK , 1992 , 274.3 AK , 01/01 , 21336 AK , 1993 , [21336] AK , 1993 , 0 AK , 1993 , 21336 Run 2 fails the test #Res3SAIS 16

17.Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. TextFile FlatMap GroupByKey Map Output AK , 01/01 , 304.8 99504 , 01/01/1992 , 1ft AK , 1992 , 304.8 99504 , 03/01/1992 , 0.1ft AK , 03/01 , 30.5 AK , 01/01 , [304.8] AK , 01/01 , 0 99504 , 01/01/1993 , 70in AK , 03/01 , [30.5] AK , 03/01 , 0 AK , 1992 , 30.5 AK , 1992 , [304.8 , 30.5] AK , 1992 , 274.3 Run 3 passes the test #Res3SAIS 17

18.Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. TextFile FlatMap GroupByKey Map Output 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft AK , 01/01 , 21336 AK , 01/01 , [21336] AK , 01/01 , 0 99504 , 01/01/1993 , 70in AK , 1993 , [21336] AK , 1993 , 0 AK , 1993 , 21336 Run 4 passes the test #Res3SAIS 18

19.Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. TextFile FlatMap GroupByKey Map Output 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft AK , 01/01 , 304.8 AK , 01/01 , [304.8] AK , 01/01 , 0 99504 , 01/01/1993 , 70in AK , 1992 , [304.8] AK , 1992 , 0 AK , 1992 , 304.8 Run 5 passes the test #Res3SAIS 19

20.Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. TextFile FlatMap GroupByKey Map Output 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft AK , 03/01 , 30.5 AK , 03/01 , [30.5] AK , 03/01 , 0 99504 , 01/01/1993 , 70in AK , 1992 , [30.5] AK , 1992 , 0 AK , 1992 , 30.5 Run 6 passes the test #Res3SAIS 20

21.Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. TextFile FlatMap GroupByKey Map Output 99504 , 01/01/1992 , 1ft 99504 , 03/01/1992 , 0.1ft AK , 01/01 , 21336 AK , 01/01 , [21336] AK , 01/01 , 0 99504 , 01/01/1993 , 70in AK , 1993 , [21336] AK , 1993 , 0 AK , 1993 , 21336 Run 7 passes the test #Res3SAIS 21

22.Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. TextFile FlatMap GroupByKey Map Output 99504 , 01/01/1992 , 1ft AK , 03/01 , 30.5 AK , 03/01 , [30.5] AK , 03/01 , 0 99504 , 03/01/1992 , 0.1ft AK , 1992 , 30.5 AK , 1992 , [30.5] AK , 1992 , 0 99504 , 01/01/1993 , 70in AK , 01/01 , 21336 AK , 01/01 , [21336] AK , 01/01 , 0 AK , 1993 , 21336 AK , 1993 , [21336] AK , 1993 , 0 Run 8 passes the test #Res3SAIS 22

23.Delta Debugging • DD is an systematic debugging procedure guided by a test function. DD is inefficient because it does not eliminate irrelevant input records upfront. TextFile FlatMap GroupByKey Map Output 99504 , 01/01/1992 , 1ft AK , 01/01 , 304.8 AK , 01/01 , [304.8 , 21336] AK , 01/01 , 21031.2 99504 , 03/01/1992 , 0.1ft AK , 1992 , 304.8 AK , 1992 , [304.8] AK , 1992 , 0 99504 , 01/01/1993 , 70in AK , 01/01 , 21336 AK , 1993 , [21336] AK , 1993 , 0 AK , 1993 , 21336 Run 9 fails the test #Res3SAIS 23

24.Automated Debugging in DISC with BigSift Input: A Spark Output: Minimum Fault- Program, A Test Inducing Function Input Records Data Provenance + Delta Debugging Test Prioritizing Bitmap based Predicate Backward Test Pushdown Traces Memoization #Res3SAIS 24

25.Optimization 1: Test Predicate Pushdown • Observation: During backward tracing, data provenance traces through all the partitions even though only a few partitions are faulty Test Test Test Test Test Test Test If applicable, BigSift pushes down the test function to test the output of combiners in order to isolate the faulty partitions. #Res3SAIS 25

26. Optimization 2 :Prioritizing Backward Traces • Observation: The same faulty input record may contribute to multiple output records failing the test. TextFile FlatMap GroupByKey Map Output AK , 01/01 , 304.8 AK , 01/01 , [304.8, 21336, 245, 85] AK , 01/01 , 21251 99504 , 01/01/1992 , 1ft AK , 1992 , 304.8 AK , 03/01 , [30.5 , 145] AK , 03/01 , 114.5 99504 , 03/01/1992 , 0.1ft AK , 03/01 , 30.5 AK , 1992 , [304.8 , 30.5] AK , 1992 , 274.3 99504 , 01/01/1993 , 70in AK , 1992 , 30.5 AK , 1993 , [21336, 145, 85] AK , 1993 , 21251 99504 , 03/01/1993 , 145mm AK , 01/01 , 21336 AK , 1994 , [245] AK , 1994 , 0 99504 , 01/01/1994 , 245mm AK , 1993 , 21336 CA , 02/01 , [0] CA , 02/01 , 0 99504 , 01/01/1993 , 85mm AK , 03/01 , 145 CA , 1991 , [0] CA , 1991 , 0 90031 , 02/01/1991 , 0mm AK , 1993 , 145 In case of multiple faulty AK , 01/01 , 245 outputs, BigSift overlaps two backward AK , 1994 , 245 traces to minimize the scope of fault-inducing input records …. …. #Res3SAIS 26

27.Optimization 3: Bitmap based Test Memoization 0 • Observation: Delta debugging may 0 0 try running a program on the same 1 subset of input redundantly. 1 • BigSift leverages bitmap to compactly 0 1 encode the offsets of original input to 0 1 ! refer to an input subset 0 Input Data Bitmap Test Outcome We use a bitmap based test memoization technique to avoid redundant testing of the same input dataset. #Res3SAIS

28.Demo

29.Debugging Time Subject Program Running Time (sec) Debugging Time (sec) Subject Program Fault Original Job DD BigSift Improvement Movie Histogram Code 56.2 232.8 17.3 13.5X Inverted Index Code 107.7 584.2 13.4 43.6X Rating Histogram Code 40.3 263.4 16.6 15.9X Sequence Count Code 356.0 13772.1 208.8 66.0X Rating Frequency Code 77.5 437.9 14.9 29.5X College Student Data 53.1 235.3 31.8 7.4X Weather Analysis Data 238.5 999.1 89.9 11.1X Transit Analysis Code 45.5 375.8 20.2 18.6X BigSift provides up to a 66X speed up in isolating the precise fault-inducing input records, in comparison to the baseline DD #Res3SAIS 29