使用深度学习、Ray和Analytics Zoo进行自动化时间序列分析

本次分享主要介绍基于深度学习、Ray和Analytics Zoo进行自动化时间序列分析的功能及应用场景。

时间序列指的是一组按照时间发生顺序排列的数据点序列。时间序列预测是利用过去一段时间内事件的特征来预测未来时间上该事件的特征。很多真实世界的应用(如Telcos网络质量分析, 数据中心操作的日志分析,对于昂贵设备可预测性维护等)均使用到时间序列预测。时间序列预测也可以作为异常检测的前期步骤,当实际值和预测值相差过大时进行预警。

传统的时间序列预测方法通常基于可描述性(统计)模型进行曲线外插。此类方法通常包含对于数据模式的假设,将时间序列分解成构成要素,如周期性,趋势,噪声等。新的机器学习方法对数据可以假设更少更宽松,尤其是神经网络模型,通常将时间序列预测处理为序列建模问题,且近期已经成功应用于时间序列分析。
另一方面,对于时间序列预测搭建机器学习应用的过程非常繁琐且需要大量经验。为了提供一个简单易用的时间序列预测工具,我们将自动机器学习应用于时间序列预测,将特征生成,模型选择和超参数调优等过程实现自动化。我们的工具基于Ray(UC Berkeley RISELab开源的针对高级AI 应用的分布式框架,并作为Analytics zoo(由intel开源的统一的大数据分析和人工智能平台)的一部分功能提供给用户。

展开查看详情

1. AI Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo Intel Analytics Zoo Team Mar 13, 2020

2. Agenda • Background • Introduction of Analytics Zoo • Background about Time Series Analysis • Background about AutoML and Ray • Time Series Analysis using AutoML and Ray on Analytics Zoo • Use Case Sharing Mar 13, 2020

3.Background Mar 13, 2020

4. What is Analytics Zoo Distributed, High-Performance Unified Analytics + AI Platform Deep Learning Framework Distributed TensorFlow, Keras, PyTorch for Apache Spark and BigDL on Apache Spark https://github.com/intel-analytics/bigdl https://github.com/intel-analytics/analytics-zoo Accelerating Data Analytics + AI Solutions At Scale Mar 13, 2020

5. Unified Big Data Analytics and AI Platform Seamless Scaling from Laptop to Production Prototype on laptop Experiment on clusters Production deployment w/ using sample data with history data distributed data pipeline Production Data pipeline • Easily prototype the integrated data analytics & AI solution • “Zero” code change from laptop to distributed cluster • Directly access production data (Hadoop/Hive/HBase) without data copy • Seamlessly deployed on production big data clusters Mar 13, 2020

6. Analytics Zoo Unified Big Data Analytics and AI Platform Models & Recommendation Time Series Computer Vision NLP Algorithms ML Workflow AutoML for Time Series Automatic Cluster Serving Integrated Distributed TensorFlow & PyTorch on Spark RayOnSpark Analytics & AI Pipelines Spark Dataframes & ML Pipelines for DL Model Serving Library & Distributions Distributed Analytics DL Frameworks Python Libraries Framework (Cloudera/Databricks/….) (Spark/Flink/Ray/…) (TF/PyTorch/…) (Numpy/Pandas/…) https://github.com/intel-analytics/analytics-zoo Mar 13, 2020

7. Time Series Analysis • Time Series data • A series of data that is observed sequentially in time. • Numerical & unstructured • Stock prices, sales volume, CPU/IO monitoring data, etc. • Example of time series analysis • Product demand prediction Total volume of taxi passengers in NYC from 2014/07-2015/02 ( source : • Network quality analysis https://github.com/intel-analytics/analytics-zoo/blob/master/apps/anomaly- detection/anomaly-detection-nyc-taxi.ipynb) • Predictive maintenance for high- value equipment Mar 13, 2020

8. AutoML Overview Taking the Human out of Learning Applications : A Survey on Automated Machine Learning. Yao, Q., Wang, et. al Mar 13, 2020

9. Ray and Ray On Spark • Ray • A distributed framework for emerging AI applications • RayOnSpark • Directly run Ray programs on big data cluster • Seamlessly integrate ray into spark data processing pipeline https://medium.com/riselab/rayonspark-running-emerging-ai-applications-on-big-data-clusters-with-ray-and-analytics-zoo-923e0136ed6a Mar 13, 2020

10.Time Series Analysis using AutoML and Ray on Analytics Zoo Mar 13, 2020

11. Time Series Solution In Analytics Zoo Time Series Solution User Models • Time series Applications Recommendation Anomaly • Time series forecasting Time Series Algorithms Trend Prediction Computer Built-in Detection Vision • Anomaly detection NLP Algorithms Analytics Zoo and Models … … • Time Series Clustering • etc • AutoML Framework Feature Generation Model Selection AutoML ML Hyper-Parameter Cluster Serving • Seamless scaling Workflow … Tuning • Full-stack Intel SW+HW Integrated Distributed TensorFlow & PyTorch on Spark RayOnSpark Optimization w/ Analytics Analytics and AI Pipelines Spark Dataframes & ML Pipelines for DL Model Inference Zoo Laptop K8s Cluster YARN Cluster Spark Cluster Mar 13, 2020

12. AutoML + Time Series Analysis Framework In Analytics Zoo • AutoML Framework • FeatureTransformer • Model • SearchEngine • Pipeline • Time Series Prediction w/ AutoML • TimeSequencePredictor https://medium.com/riselab/scalable-automl-for-time-series-prediction-using-ray-and- analytics-zoo-b79a6fd08139 • TimeSequencePipeline *Other names and brands may be claimed as the property of others. Mar 13, 2020

13.Typical Workflow of Training w/ AutoML Search presets Each trial runs a different combination of hyper parameters FeatureTransformer … trail jobs with tunable parameters SearchEngine trial trial best model Model /parameters trial trial with tunable parameters Pipeline Ray Tune configured with best parameters/model Workflow implemented in TimeSequencePredictor Mar 13, 2020

14. General API Usage • Training a Predictor • fit (w/ automl) • recipe • distributed • Using a Pipeline • save/load • evaluate/predict • fit (incremental) Mar 13, 2020

15.Application: Time Series Forecasting Intel Confidential Mar 13, 2020

16.Application: Anomaly Detection Intel Confidential Mar 13, 2020

17.Use case sharing Mar 13, 2020

18.Time Series Based Network Quality Prediction in SK Telecom Data Loading Preprocess RDD of Tensor Model Code of TF File, HTTP, Kafka Data Model Spark-SQL Data Loader Data Source APIs DL Training & Inferencing DRAM Store forked. tiering Flash Store customized. SIMD Acceleration https://databricks.com/session_eu19/apache-spark-ai-use-case-in-telco-network-quality-analysis-and-prediction-with-geospatial-visualization Mar 13, 2020

19.Unsupervised Time Series Anomaly Detection for Baosight https://software.intel.com/en-us/articles/lstm-based-time- series-anomaly-detection-using-analytics-zoo-for-apache- spark-and-bigdl Mar 13, 2020

20. Yunda: Anomaly Detection for AIOps • AIOps • Monitoring log/metrics analysis for data center operations • AIOps helps cost saving and MTTR (mean-time-to-repair) https://www.intel.cn/content/www/cn/zh/analytics/artificial-intelligence/yunda-brings-quality-change-to-the-express-delivery- industry.html Mar 13, 2020

21. More Information about AutoML + Time Series in Analytics Zoo • Scalable AutoML for Time Series Analysis • Source code as a branch of analytics-zoo repo @ https://github.com/intel-analytics/analytics-zoo/tree/automl • README @ https://github.com/intel-analytics/analytics-zoo/blob/automl/pyzoo/zoo/automl/README.md • Blog https://medium.com/riselab/scalable-automl-for-time-series-prediction-using-ray-and-analytics-zoo-b79a6fd08139 • Anomaly Detection Reference Examples • Time Series Forecast w/ AutoML https://github.com/intel-analytics/analytics-zoo/blob/automl/apps/automl • Anomaly Detection based on Forecast https://github.com/intel-analytics/analytics-zoo/tree/master/apps/anomaly- detection • Anomaly Detection based on AutoEncoder https://github.com/intel-analytics/analytics- zoo/tree/master/apps/anomaly-detection-hd • Real-world Customer Applications • Baosight’s anomaly detection for intelligent equipment management. Details refer to http://software.intel.com/en- us/articles/lstm-based-time-series-anomaly-detection-using-analytics-zoo-for-apache-spark-and-bigdl • Yunda anomaly detection for AIOps https://www.intel.cn/content/www/cn/zh/analytics/artificial-intelligence/yunda- brings-quality-change-to-the-express-delivery-industry.html Mar 13, 2020

22. More Information on Analytics Zoo • Project website • https://github.com/intel-analytics/analytics-zoo • https://github.com/intel-analytics/bigdl • Tutorials • CVPR 2018: https://jason-dai.github.io/cvpr2018/ • AAAI 2019: https://jason-dai.github.io/aaai2019/ • “BigDL: A Distributed Deep Learning Framework for Big Data” • In proceedings of ACM Symposium on Cloud Computing 2019 (SOCC’19) • Use cases • Azure, CERN, MasterCard, Office Depot, Tencent, Midea, etc. • https://analytics-zoo.github.io/master/#powered-by/ Mar 13, 2020

23.