TimeSeriesForecastingAutoML_Shan_191116

议题二:
使用分布式自动机器学习进行时间序列分析

喻杉,Intel大数据分析团队软件工程师。她目前专注于在analytics-zoo大数据和人工智能平台上开发自动机器学习组件。在加入intel前,她在浙江大学获得了学士和硕士学位。

内容简介:
对于时间序列预测搭建机器学习应用的过程非常繁琐且需要大量经验。为了提供一个简单易用的时间序列预测工具,我们将自动机器学习应用于时间序列预测,将特征生成,模型选择和超参数调优等过程实现自动化。我们的工具基于Ray(UC Berkeley RISELab开源的针对高级AI 应用的分布式框架,并作为Analytics zoo(由intel开源的统一的大数据分析和人工智能平台)的一部分功能提供给用户。

展开查看详情

1. AI 使用分布式自动机器学习进行时间序列分析 Shengsheng Huang, Shan Yu, Jason Dai Intel Analytics Zoo Team Nov 11, 2019

2. Agenda A Unified Analytics + AI platform – Analytics Zoo Background about Time Series Forecasting • Time Series Forecasting and its applications • Pain points & how we address them Time Series Forecasting with AutoML in Analytics Zoo • Architecture & Training Workflow • Features & Usage Nov 11, 2019

3.A Unified Analytics + AI Platform Nov 11, 2019

4. What is Analytics Zoo Distributed, High-Performance Unified Analytics + AI Platform Deep Learning Framework Distributed TensorFlow, Keras, PyTorch for Apache Spark and BigDL on Apache Spark https://github.com/intel-analytics/bigdl https://github.com/intel-analytics/analytics-zoo Accelerating Data Analytics + AI Solutions At Scale Nov 11, 2019

5. Unified Big Data Analytics and AI Platform Seamless Scaling from Laptop to Production Prototype on laptop Experiment on clusters Production deployment w/ using sample data with history data distributed data pipeline Production Data pipeline • Easily prototype the integrated data analytics & AI solution • “Zero” code change from laptop to distributed cluster • Directly access production data (Hadoop/Hive/HBase) without data copy • Seamlessly deployed on production big data clusters Nov 11, 2019

6. Analytics Zoo Unified Big Data Analytics and AI Platform Use case Recommendation Anomaly Detection Text Classification Text Matching Model Image Classification Object Detection Seq2Seq Transformer BERT Feature Engineering image 3D image text time series Integrated tfpark: Distributed TF on Spark Distributed Keras w/ autograd on Spark Analytics & AI nnframes: Spark Dataframes & ML Distributed Model Serving Pipelines Pipelines for Deep Learning (batch, streaming & online) TensorFlow Keras PyTorch BigDL NLP Architect Apache Spark Apache Flink Backend/ Library Ray MKLDNN OpenVINO Intel® Optane™ DCPMM DL Boost (VNNI) https://github.com/intel-analytics/analytics-zoo Nov 11, 2019

7. More Information on Analytics Zoo • Project website • https://github.com/intel-analytics/analytics-zoo • Tutorials • CVPR 2018: https://jason-dai.github.io/cvpr2018/ • AAAI 2019: https://jason-dai.github.io/aaai2019/ • “BigDL: A Distributed Deep Learning Framework for Big Data” • In proceedings of ACM Symposium on Cloud Computing 2019 (SOCC’19) • Use cases • Azure, CERN, MasterCard, Office Depot, Tencent, Midea, etc. • https://analytics-zoo.github.io/master/#powered-by/ Nov 11, 2019

8. Background about Time Series Forecasting Nov 11, 2019

9. Time Series Data • What is Time Series • A time series is a series of data points indexed/listed in time order. • Usually numerical • scalar (univariant) • vector (multivariant) • Unstructured data (video, songs, etc.) • Examples Total volume of taxi passengers in NYC from 2014/07-2015/02 ( source : https://github.com/intel-analytics/analytics-zoo/blob/master/apps/anomaly- • Stock prices, sales volume, IoT sensor detection/anomaly-detection-nyc-taxi.ipynb) readings, CPU/IO monitoring data, etc. Nov 11, 2019

10. Time Series Forecasting • What is Time Series Forecasting lookback k steps predict h steps forward • Given all history observations 𝑦1 , … , 𝑦𝐭 , Predict values of next 𝐡 steps, 𝑦𝐭+𝟏 , … , 𝑦𝐭+𝒉 • Usually only lookback 𝐤 steps, 𝑦𝐭−𝒌+𝟏 , … , 𝑦𝐭 𝑦1 𝑦2 𝑦𝑡−𝑘+1 𝑦𝑡 𝑦𝑡+1 𝑦𝑡+ℎ • Applications • Sales volume/demand prediction, etc. • As the 1st step for Anomaly Detection • AIOps (anomaly detection, root case analysis, resource planning, etc.) Nov 11, 2019

11. Pain points and how we address them • Pain Points of Traditional Methods • Widely-used statistics based models (AR, MA, ES, ARIMA, etc.) • Hard to capture complex non-linear, cross-series patterns in (multivariant) data • Make (unreasonable) assumptions about underlying distribution • Some methods are computational costly (e.g. Gaussian Process based methods) • Hard to integrate & scale with production solutions/pipelines • What’s in Analytics Zoo • Neural networks based (hybrid) models – more flexible and expressive • additional data processing, features, and metrics for time series • AutoML for hyper-parameter tuning, model selection, feature selection, etc. • Scalability and E2E Pipelines Nov 11, 2019

12.Time Series Forecasting w/ AutoML in Analytics Zoo Nov 11, 2019

13. AutoML Overview Source: Taking the Human out of Learning Applications : A Survey on Automated Machine Learning. Yao, Q., Wang, et. al Nov 11, 2019

14. AutoML + Time Series Prediction In Analytics Zoo • AutoML Framework • FeatureTransformer • Model • SearchEngine • Pipeline • Time Series Prediction w/ AutoML • TimeSequencePredictor https://medium.com/riselab/scalable-automl-for-time-series-prediction-using-ray-and-analytics-zoo-b79a6fd08139 • TimeSequencePipeline *Other names and brands may be claimed as the property of others. Nov 11, 2019

15.Typical Workflow of Training w/ AutoML Search presets Each trial runs a different combination of hyper parameters FeatureTransformer … trail jobs with tunable parameters SearchEngine trial trial best model Model /parameters trial trial with tunable parameters Pipeline Ray Tune configured with best parameters/model Workflow implemented in TimeSequencePredictor Nov 11, 2019

16. General API Usage • Training a Predictor • fit (w/ automl) • recipe • distributed • Using a Pipeline • save/load • evaluate/predict • fit (incremental) Nov 11, 2019

17. State-of-Art Neural Networks for Time Series Forecasting • Non-linear(NN) + Linear (AR) • NN handles time series as a sequence modeling problem (strategies usually seen in NLP are used, e.g. LSTM/GRU, encoder-decoder, attention, memory networks, transformer, etc.) A Memory-Network Based Solution for Multivariate Time-Series Forecasting https://arxiv.org/abs/1809.02105 Nov 11, 2019

18. Future Work • Time Series • Additional models (e.g. statistical, MLP, transformer, etc.) • Additional features (e.g. auto-encoder, etc.) • AutoML • Model Ensemble • Neural Architecture Search Nov 11, 2019

19. More Information about AutoML+TimeSeries in Analytics Zoo • Resources • Source code as a branch of analytics-zoo repo @ https://github.com/intel-analytics/analytics- zoo/tree/automl • README @ https://github.com/intel-analytics/analytics-zoo/blob/automl/pyzoo/zoo/automl/README.md • A demo notebook @ https://github.com/intel-analytics/analytics- zoo/blob/automl/apps/automl/nyc_taxi_dataset.ipynb • Blog https://medium.com/riselab/scalable-automl-for-time-series-prediction-using-ray-and-analytics-zoo- b79a6fd08139 • Contact AnalyticsZoo team or community • Discuss it in analytics-zoo user-group @ https://groups.google.com/forum/#!forum/bigdl-user-group • Raise issues or questions @ https://github.com/intel-analytics/analytics-zoo/issues • Contact me @ shan.yu@intel.com Nov 11, 2019

20. 在阿里云E-MR上使用Analytics Zoo + 获取技术和资源支持,请联系: 邮件: wesley.du@intel.com Analytics Zoo已经集成在阿里云E-MR平台: 钉钉: Nov 11, 2019

21.

22.• Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit intel.com/performance. • Intel does not control or audit the design or implementation of third-party benchmark data or websites referenced in this document. Intel encourages all of its customers to visit the referenced websites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. • Optimization notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. • Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com/benchmarks. • Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. • *Other names and brands may be claimed as the property of others. • © Intel Corporation Nov 11, 2019