NASAC2019-What’s on-going in Spark + AI Community

NASAC 2019开源大数据系统软件社区资深管理员论坛
What’s on-going in Spark + AI Community


1. AI Shengsheng Huang Intel AnalyticsZoo team

2. Agenda • Efforts for building unified data analytics + AI in production • Efforts to support emerging AI applications AI

3. Agenda • Efforts for building unified data analytics + AI in production • Efforts to support emerging AI applications AI

4. What’s new in spark + ai community Spark 3. 0 • Optimizations on SQL execution (adaptive query execution, dynamic partition pruning ) • DataSourceV2 • Project Hydrogen (Barrier execution mode, Accelerator-aware scheduling, optimized data exchange) • Spark Graph • Spark on Kubernetes • … MLFlow – ML lifecycle management • Tracking – log code, data, config, results of experiments, and compare & query • Projects – code packaging format for reproducible runs on any platform • Models – model packaging format for sending models to diverse deployment tools. Koalas – pandas API on Spark All for Productivity Delta Lake – ACID layer upon data lakes AI

5. Rationale behind the efforts in community “Hidden Technical Debt in Machine Learning Systems”, Sculley et al., Google, NIPS 2015 Paper • Integration/Injection of heterogenous data models/sources, computation models, software/hardware components, … (e.g. DataSourceV2, Project Hydrogen, Spark Graph) • E2E Workflow, ML Lifecycle, Serving, Deployment, Orchestration, … (e.g. MLFlow, KubeFlow, Seldon, TFX) • Efficiency & Reliability (e.g. SQL-related optimizations, Delta Lake) • Friendly APIs (e.g. Koalas) AI

6. AI on Big Data High-Performance Unified Analytics + AI Platform Deep Learning Framework Distributed TensorFlow*, PyTorch*, for Apache Spark Keras* and BigDL on Apache Spark Accelerating DATA Analytics + AI Solutions DEPLOYMENT At SCALE *Other names and brands may be claimed as the property of others. AI

7. Analytics Zoo Unified End-to-End Data Analytics + AI Platform Anomaly Use case Recommendation Text Classification Text Matching Detection Image Seq2Se BER Model Classification Object Detection q Transformer T Feature Engineering image 3D image text time series Integrated tfpark: Distributed TF on Spark Distributed Keras/PyTorch on Spark Analytics/AI nnframes: Spark Dataframes & Distributed Model Serving Pipelines ML (batch, streaming & online) Pipelines for Deep Learning TensorFlow Keras PyTorch BigDL NLP Architect Apache Spark Apache Flink Backend/ Library Ray MKLDNN OpenVINO Intel® Optane™ DCPMM DL Boost (VNNI) AI

8. Distributed TensorFlow on Spark #pyspark code • Data wrangling and train_rdd = spark.hadoopFile(…).map(…) dataset = TFDataset.from_rdd(train_rdd,…) analysis using PySpark #tensorflow code import tensorflow as tf • Deep learning model slim = tf.contrib.slim images, labels = dataset.tensors development using with slim.arg_scope(lenet.lenet_arg_scope()): TensorFlow or Keras logits, end_points = lenet.lenet(images, …) loss = tf.reduce_mean( \ tf.losses.sparse_softmax_cross_entropy( \ logits=logits, labels=labels)) #distributed training on Spark • Distributed training / optimizer = TFOptimizer.from_loss(loss, Adam(…)) optimizer.optimize(end_trigger=MaxEpoch(5)) inference on Spark Write TensorFlow code inline in PySpark program AI

9.Object Detection and Image Feature Extraction at* • Reuse existing Hadoop/Spark clusters for deep learning with no changes (image search, IP protection, etc.) • Efficiently scale out on Spark with superior performance (3.83x speed-up vs. GPU severs) as benchmarked by JD *Other names and brands may be claimed as the property of others. AI

10. Product Recommendations in Office Depot* us/articles/real-time-product- recommendations-for-office-depot- using-apache-spark-and-analytics- zoo-on *Other names and brands may be claimed as the property of others. AI

11. Computer Vision Based Product Defect Detection in Midea* tensorflow-on-analytics *Other names and brands may be claimed as the property of others. AI

12. Particle Classifier for High Energy Physics in CERN* Deep learning pipeline for physics data Model serving using Apache Kafka and Spark *Other names and brands may be claimed as the property of others. AI

13. Wrap Up Community is making efforts to make Spark a unified Analytics + AI platform Analytics Zoo is also working towards similar goal, by • Seamless integration various components, e.g. Tensorflow, PyTorch, BigDL, etc. • Providing full-stack optimizations involving hardware/software (VNNI, MKL-DNN, OpenVINO, etc.) • Providing ease of use, end-to-end, from laptop to production platform We are both contributors and practitioners. We use, learn, and contribute. AI

14. Agenda • Efforts for building unified data analytics + AI in production • Efforts to support emerging AI applications AI

15. Towards General AI Strategies to build AI for game playing, robots, autonomous driving, etc. Expert State/Action Supervised Environment Demonstrations Pairs Learning State Action Reward Agen t Deep Reinforcement Learning Imitation Learning (DRL) AI

16.Parallel Architecture for Deep RL Actor Massively Parallel Methods for Deep Reinforcement Learning Actor Actor AI

17. Ray On Spark Ray • • a distributed framework for emerging AI applications open-sourced by UC Berkeley RISELab RayOnSpark • a feature recently added to Analytic Zoo • allows users to directly run Ray programs on Apache Hadoop*/YARN • Ray applications can be seamlessly integrated into Spark applications-on-big-data-clusters-with-ray-and-analytics-zoo- pipeline and operate directly on Spark RDDs or 923e0136ed6a DataFrames. *Other names and brands may be claimed as the property of others. AI

18. Building AI to Play FIFA FIFA18* - A real-time 3D soccer simulation video game by Electronic Arts* Our Experiment Platform (collaborations w/ SJTU) • runs alongside FIFA game in a non-intrusive way • provides abstraction of game environment (observations, actions, rewards, scores, semantics, etc.) • Implemented agents: RL, IL, Hybrid (IL + RL) Future Work: Results on Shooting Bronze Scenario • Transfer between Google Research Football and FIFA? Score Goal Ratio • Train agents in massive scale w/ Ray & RayOnSpark master 10112.78 92% Human • Additional models/scenarios, etc. demonstrator 7284.98 84.96% IL 10345.18 92.54% RL (Policy Agent 5606.31 40.25% Gradient) Hybrid (RL+IL) 10514.43 95.59% game-using-distributed-tensorflow-on-analytics-zoo *Other names and brands may be claimed as the property of others. AI

19. Scalable AutoML for Time Series Analysis AutoML Framework Time Series Forecasting w/ AutoML • Data processing and feature engineering • Neural network based (hybrid) models • Automated feature selection, model selection, hyper parameter using-ray-and-analytics-zoo-b79a6fd08139 tuning AI

20. Wrap Up We’re extending the Spark stack to support emerging AI applications • RayOnSpark We’re building emerging AI applications • Building AI to play FIFA • Scalable AutoML for Time Series Analysis AI

21. More Information on Analytics Zoo • Project website • • Tutorials • CVPR 2018: • AAAI 2019: • “BigDL: A Distributed Deep Learning Framework for Big Data” • In proceedings of ACM Symposium on Cloud Computing 2019 (SOCC’19) • Use cases • Azure, CERN, MasterCard, Office Depot, Tencent, Midea, etc. • AI


23.Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit Intel does not control or audit the design or implementation of third-party benchmark data or websites referenced in this document. Intel encourages all of its customers to visit the referenced websites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. Optimization notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Atom, Intel Core, Iris, Movidius, Myriad, Intel Nervana, OpenVINO, Intel Optane, Stratix, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation AI