使用Apache Spark来调整处理语言

我们已经为Spark开发了一个感知工作负载的性能调优框架,该框架收集和分析关于集群中所有Spark应用程序的遥测信息。基于这种分析(使用Spark擅长的批处理、实时流和ML分析),该框架可以识别许多提高Spark工作负载总体性能的方法:通过识别具有导致显著性能的歪斜分布的数据集ance降级,通过识别将从缓存中受益的表和数据帧,通过识别在哪里可以使用广播联接来提高比重新分区联接的性能的查询,通过识别要在集群级别为驱动程序和执行器使用的最佳默认值,容器大小,以及通过识别工作负载的最佳云机器类型。
展开查看详情

1.Using&Spark&to&Tune&Spark Adrian'Popescu,'Shivnath Babu #AI7SAIS

2.Meet$the$speakers Adrian$Popescu Shivnath Babu • Data%engineer%at%Unravel • Cofounder%and%CTO%at%Unravel,%Adjunct% • PhD%from%EPFL,%Switzerland Professor%at%Duke%University • Focusing%on%easeKofKuse%and%manageability% • 8+%years%of%experience%in%performance% of%dataKintensive%systems monitoring%&%modeling%of%data%management% • Recipient%of%US%National%Science% systems Foundation%CAREER%Award,%three%IBM% • Focusing%on%tuning%and%optimization%of%Big% Faculty%Awards,%HP%Labs%Innovation% Data%apps Research%Award #Exp8SAIS 2

3.Many%apps%are%being%built%in%Spark #Exp8SAIS 3

4. But,%let%us%face%it:% Running%Spark%apps%in% production%is%hard #Exp8SAIS 4

5.My#app#often#fails#with#Out#of# Memory… DATA#SCIENTIST #Exp8SAIS 5

6.My#app#is#too#slow… DATA#ENGINEER #Exp8SAIS 6

7.My#app#is#missing#SLA… DATA#PIPELINE#OWNER #Exp8SAIS 7

8.This%rogue%app%is%wasting%resources% and%reducing%cluster%throughput OPERATIONS%TEAMS #Exp8SAIS 8

9.Many%factors%affect%app%performance #Exp8SAIS 9

10.To#add#to#Spark’s#complexity • Many#types#of#Spark#apps########################### • SQL • Streaming • AI/ML Simple#SQL#and# • Graph Programming#Interface# • Scala/Python/R • Many#app#submission#methods#in#Spark#### • CLI • Thrift=Server • Notebooks=like=Zeppelin,=Jupyter,=Hue • ETL=tools=like=Informatica,=Pentaho,=and=Talend • Schedulers=like=Airflow,=Autosys,=Control=M,=Oozie,=Tidal,=TWS • Many#infrastructure#choices#for#Spark######### • OnNpremises=multiNtenant=clusters • Transient=cloud=clusters • AutoNscaling=clusters • Containerized=deployments= #Exp8SAIS 10

11.Can-we-convert-this-problem into-a-data-problem? #Exp8SAIS 11

12.First:'Bring'all'monitoring'data'to'a' single'platform Resource' Manager'API History'Server' API Container' Metrics Data' Statistics SQL'Query' Plans Logs Metadata Configuration One$complete$correlated$view. #Exp8SAIS 12

13.Then:&Apply&intelligent&algorithms&to& analyze&the&data&automatically Resource& Manager&API History&Server& API Container& Metrics Data& Statistics SQL&Query& Plans Logs Metadata Configuration One$complete$correlated$view. Built4in$intelligence. #Exp8SAIS 13

14.#Exp8SAIS 14

15. Why$not$use$Spark$itself? Resource$ Manager$API What$ History$Server$ application$&$ API Container$ cluster$ Metrics management$ Data$ Statistics tasks$can$we$ SQL$Query$ automate$with$ Plans intelligent$ Logs Metadata algorithms$in$ Configuration Spark? #Exp8SAIS 15

16.Let$us$take$three$(hard)$tasks • Failures$in$Spark • SLA%management%for%real0time%data%pipelines • Application%autotuning #Exp8SAIS 16

17.Manual&Root&Cause&Analysis&of&Spark&Failures Typical(Failure(in(Spark • Many(levels(of(correlated(stack(traces • Identifying(the(root(cause(is(hard(and(time(consuming #Exp8SAIS 17

18.Automated&Root&Cause&Analysis&of&Spark&Failures • Reduce&troubleshooting&time&from&days&to&seconds • Improve)productivity)of)data)scientists)and)analysts #Exp8SAIS 18

19.Automatic)Root)Cause)Analysis Feature$ Learning$ Error$ vectors Algorithm Container$ Logs Template$ for$ Extraction Predictive$ Model Root$ causes Predictive$ Model #Exp8SAIS 19

20.We#have#created#a#Failure#Taxonomy Root1Node Category1of1failure Configuration Data Resource1 Deployment1 Errors Errors Errors Errors Input1Path1 Number1 SparkSQL Not1 Format1 JsonProcessing … Available Exception Exception Root1cause1labels #Exp8SAIS 20

21.Two$Ways$to$get$Root-Cause$Labels • Manual'diagnosis'by'a'domain'expert • Automatic'injection'of'the'root'cause #Exp8SAIS 21

22.Unravel’s Large,scale.Lab.Framework.for. Automatic.Root.Cause.Analysis Environment: = Lab(created(on(demand(on(cloud(or(on=premises = Workloads(are(run(and(failures(are(injected Spark.and.multi,tenant.Workloads: , Variety(of(workloads:(Batch,(ML,(SQL,(Streaming,(etc. Failures: = Large(set(of(root(causes(learned(from(customers(&( partners.(Constantly(updated = Continuously(inject(these(root(causes(to(train(&(test( models(for(root=cause(prediction( #Exp8SAIS 22

23.Injecting)Failures Application Application FAILED Input6Feature6 Execution Monitor Extraction Labeled6 Failures Injected6 Label Failure Injected)failure)examples: • Invalid6input • No6space6left6on6device • Invalid6memory6 • Transformations6inside6 configuration other6transformations • OOME:6Java6heap6space • Runtime6error • OOME:6GC6overhead6limit6 • Arithmetic6error • Container6killed6by6YARN • Invalid6configuration6 • Runtime6incompatibility settings #Exp8SAIS 23

24.Extracting*Input*Features*from*Logs java.lang.OutOfMemoryError: Java heap space at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:114) at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:112) at … • Extracting+stack+traces+and+error+messages • Tokenize+by+class+names+and+words • Create+a+vocabulary+of+words+from+all+words+collected Tokens*example: java.lang.OutOfmemoryError Java heap space at scala.reflect.ManifestFactory$$anon$9.newArray(Manife st.scala:114) #Exp8SAIS 24

25.Input&Feature&extraction • Bag&of&Words&with&TF8IDF – Computes+a+vocabulary+of+words – Uses+TF9IDF+to+reflect+importance+of+words+in+a+document • Doc2Vec – Maps+words,+paragraphs,+or+documents+to+multi9dimensional+vectors – Evaluates+the+placement+of+words+wrt neighboring+words – Uses+a+39layer+neural+network #Exp8SAIS 25

26.System'Architecture Feature$ Learning$ Error$ vectors Algorithm Container$ Logs Template$ for$ Extraction Predictive$ Model Root$ causes Root$cause$ New$failure of$the$ Error$ failure Container$ Predictive$ Logs Template$ Model Extraction Feature$ 26 #Exp8SAIS vector

27.Predictive)Models • Shallow)Learning – Logistic*Regression Very)easy)to) – Random*forests implement)these) in)Spark • Deep)Learning – Neural*networks #Exp8SAIS 27

28.Predicting*the*Root*Cause*of*Failures • Training and%testing*with%injected%failures • Test%to%train%data%set%ratio%75%*to*25% • Models:%logistic%regression,%random%forests% Work%with% Logistic%Regression Random%Forests deep%learning% Accuracy*Score* 100 is%in%progress 95 [%] 90 See*our*talk*at* 85 Strata,*NY*2017* 80 TF>IDF Doc2Vec for*more*details #Exp8SAIS 28

29.Let$us$take$three$(hard)$tasks • Failures*in*Spark • SLA$management$for$real>time$data$pipelines • Application*autotuning #Exp8SAIS 29