Spark ML模型交叉验证的并行性

用交叉验证来调整SPARK ML模型可以是一个极其昂贵的计算过程。随着超参数组合的数目增加,被评估的模型的数量也增加。SCALL中的默认配置是逐一评估这些模型来选择性能最好的。当使用大量模型运行此过程时,如果模型的训练和评估没有充分利用可用的集群资源,那么对于每个模型,这种浪费将加剧并导致长时间运行。
展开查看详情

1.Model Parallelism in Spark ML 
 Cross-validation Nick Pentreath Principal Engineer Bryan Cutler Software Engineer DBG / June 5, 2018 / © 2018 IBM Corporation

2.About Nick @MLnick on Twitter & Github Principal Engineer, IBM CODAIT - Center for Open-Source Data & AI Technologies Machine Learning & AI Apache Spark committer & PMC Author of Machine Learning with Spark Various conferences & meetups DBG / June 5, 2018 / © 2018 IBM Corporation

3.About Bryan Software Engineer, IBM CODAIT Apache Spark committer Apache Arrow committer Python, Machine Learning OSS @BryanCutler on Github DBG / June 5, 2018 / © 2018 IBM Corporation

4.Center for Open Source Data and AI Technologies CODAIT codait.org CODAIT aims to make AI solutions dramatically easier to create, deploy, Improving Enterprise AI Lifecycle in Open Source and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission DBG / June 5, 2018 / © 2018 IBM Corporation

5.Agenda Model Tuning in Spark Scaling Model Tuning Performance Results Best Practices Future Directions in Optimizing Pipelines DBG / June 5, 2018 / © 2018 IBM Corporation

6.Model Tuning in Spark DBG / June 5, 2018 / © 2018 IBM Corporation

7.Model Tuning in Spark Model selection: workflow within a workflow Data Feature Model Ingest Final Model Processing Engineering Selection Candidate models Adjust Train Evaluate DBG / June 5, 2018 / © 2018 IBM Corporation

8.Model Tuning in Spark Pipeline cross-validation Spark ML Pipeline Tokenizer CountVectorizer LogisticRegression # features: # features: regParam: regParam: 10 100 0.001 0.1 Parameters DBG / June 5, 2018 / © 2018 IBM Corporation

9.Model Tuning in Spark Pipeline cross-validation CountVectorizer LogisticRegression Tokenizer # features: 10 regParam: 0.001 # features: 10 CountVectorizer LogisticRegression # features: Tokenizer 100 # features: 10 regParam: 0.1 regParam: CountVectorizer LogisticRegression 0.001 Tokenizer # features: 100 regParam: 0.001 regParam: 0.1 CountVectorizer LogisticRegression Tokenizer # features: 100 regParam: 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation

10.Model Tuning in Spark Pipeline cross-validation Tokenizer CountVectorizer LogisticRegression # features: # features: 10 100 regParam: regParam: 0.001 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation

11.Model Tuning in Spark Pipeline cross-validation DBG / June 5, 2018 / © 2018 IBM Corporation

12.Model Tuning in Spark Pipeline cross-validation DBG / June 5, 2018 / © 2018 IBM Corporation

13.Model Tuning in Spark Pipeline cross-validation DBG / June 5, 2018 / © 2018 IBM Corporation

14.Model Tuning in Spark Based on XKCD comic: https://xkcd.com/303/ & https://github.com/mislavcimpersak/xkcd-excuse-generator Cross-validation is expensive! • 5 x 5 x 5 hyperparameters = 125 pipelines • ... across 4 machine learning models = 500 • If training & evaluation does not fully utilize available cluster resources then that waste is compounded for each model DBG / June 5, 2018 / © 2018 IBM Corporation

15. Scaling Model Tuning DBG / June 5, 2018 / © 2018 IBM Corporation

16.Scaling Model Tuning Parallel model evaluation CountVectorizer LogisticRegression Tokenizer # features: 10 regParam: 0.001 # features: 10 CountVectorizer LogisticRegression # features: Tokenizer 100 # features: 10 regParam: 0.1 regParam: CountVectorizer LogisticRegression 0.001 Tokenizer # features: 100 regParam: 0.001 regParam: 0.1 CountVectorizer LogisticRegression Tokenizer # features: 100 regParam: 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation

17.Scaling Model Tuning Parallel model evaluation CountVectorizer LogisticRegression Tokenizer # features: 10 regParam: 0.001 # features: 10 CountVectorizer LogisticRegression # features: Tokenizer 100 # features: 10 regParam: 0.1 regParam: CountVectorizer LogisticRegression 0.001 Tokenizer # features: 100 regParam: 0.001 regParam: 0.1 CountVectorizer LogisticRegression Tokenizer # features: 100 regParam: 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation

18.Scaling Model Tuning Parallel model evaluation CountVectorizer LogisticRegression Tokenizer # features: 10 regParam: 0.001 # features: 10 CountVectorizer LogisticRegression # features: Tokenizer 100 # features: 10 regParam: 0.1 regParam: CountVectorizer LogisticRegression 0.001 Tokenizer # features: 100 regParam: 0.001 regParam: 0.1 CountVectorizer LogisticRegression Tokenizer # features: 100 regParam: 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation

19.Scaling Model Tuning Parallel model evaluation CountVectorizer LogisticRegression Tokenizer # features: 10 regParam: 0.001 # features: 10 CountVectorizer LogisticRegression # features: Tokenizer 100 # features: 10 regParam: 0.1 regParam: CountVectorizer LogisticRegression 0.001 Tokenizer # features: 100 regParam: 0.001 regParam: 0.1 CountVectorizer LogisticRegression Tokenizer # features: 100 regParam: 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation

20.Scaling Model Tuning Parallel model evaluation • Added in SPARK-19357 and SPARK-21911 (PySpark) • Parallelism parameter governs the maximum # models to be trained at once DBG / June 5, 2018 / © 2018 IBM Corporation

21.Scaling Model Tuning Parallel model evaluation Tokenizer CountVectorizer LogisticRegression # features: # features: 10 100 regParam: regParam: 0.001 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation

22.Scaling Model Tuning Parallel model evaluation DBG / June 5, 2018 / © 2018 IBM Corporation

23.Scaling Model Tuning Parallel model evaluation DBG / June 5, 2018 / © 2018 IBM Corporation

24.Scaling Model Tuning Parallel model evaluation DBG / June 5, 2018 / © 2018 IBM Corporation

25.Scaling Model Tuning Implementation considerations • Parallelism parameter sets the size of threadpool under the hood • Dedicated ExecutionContext created to avoid deadlocks with using the default threadpool • Used Futures instead of parallel collections – more flexible • Model-specific parallel fitting implementations not supported • SPARK-22126 DBG / June 5, 2018 / © 2018 IBM Corporation

26.Scaling Model Tuning Performance tests • Data size: 100,000 -> 5,000,000 • Compared parallel CV to serial CV with varying number of samples • Number features: 10 • Number partitions: 10 • Simple LogisticRegression with regParam and fitIntercept; parameter grid size 12 • Number CV folds: 5 • Parallelism: 3 • Measure elapsed time for cross-validation • Standalone cluster with 30 cores DBG / June 5, 2018 / © 2018 IBM Corporation

27.Scaling Model Tuning Results • ±2.4x speedup • Stays roughly constant as # samples increases DBG / June 5, 2018 / © 2018 IBM Corporation

28.Scaling Model Tuning Best practices • Simple integer parameter is the only thing you can set (for now) • Too low => under-utilize resources • Too high => could lead to memory issues or overloading cluster • Rough rule: # cores / # partitions • But depends on data and model sizes • Mid-sized cluster probably <= 10 DBG / June 5, 2018 / © 2018 IBM Corporation

29.Optimizing Tuning for Pipeline Models DBG / June 5, 2018 / © 2018 IBM Corporation