Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World

In this talk, we will share how we benefited from using Apache Spark to build Workday’s new analytics product, as well as some of the challenges we faced along the way. Workday Prism Analytics was launched in September 2017, and went from zero to one hundred enterprise customers in under 15 months. Leveraging innovative technologies from Platfora acquisition gave us a jump-start, but it still required a considerable engineering effort to integrate with Workday ecosystem. We enhanced workflows, added new functionalities and transformed Hadoop-based on-premises engines to run on Workday cloud. All of this would not have been possible without Spark, to which we migrated most of earlier MapReduce code. This enabled us to shorten time to market while adding advanced functionality with high performance and rock-solid reliability. One of the key components of our product is Self-Service Data Prep. Powerful and intuitive UI empower users to create ETL-like pipelines, blending Workday and external data, while providing immediate feedback by re-executing the pipelines on sampled data. Behind the scenes, we compile these pipelines into plans to be executed by Spark SQL, taking advantage of the years of work done by the open source community to improve the engine’s query optimizer and physical execution. We will outline the high-level implementation of product features, mapping logical models and sub-systems, adding new data types on top of Spark, and using caches effectively and securely, in multiple Spark clusters running under YARN, while sharing HDFS resources. We will also describe several real-life war stories, caused by customers stretching the product boundaries in complexity and performance. We conclude with the unique Spark tuning guidelines distilled from our experience of running it in production, in order to ensure that the system is able to execute complex, nested pipelines with multiple self-joins and self-unions.
展开查看详情

1.Lessons Learned Using Apache Spark for Self-Service Data Prep (and More) in SaaS World Pavel Hardak (Product Manager, Workday) Jianneng Li (Software Engineer, Workday) #UnifiedAnalytics #SparkAISummit

2. Safe Harbor Statement This presentation may contain forward-looking statements for which there are risks, uncertainties, and assumptions. If the risks materialize or assumptions prove incorrect, Workday’s business results and directions could differ materially from results implied by the forward-looking statements. Forward-looking statements include any statements regarding strategies or plans for future operations; any statements concerning new features, enhancements or upgrades to our existing applications or plans for future applications; and any statements of belief. Further information on risks that could affect Workday’s results is included in our filings with the Securities and Exchange Commission which are available on the Workday investor relations webpage: www.workday.com/company/investor_relations.php Workday assumes no obligation for and does not intend to update any forward-looking statements. Any unreleased services, features, functionality or enhancements referenced in any Workday document, roadmap, blog, our website, press release or public statement that are not currently available are subject to change at Workday’s discretion and may not be delivered as planned or at all. Customers who purchase Workday, Inc. services should make their purchase decisions upon services, features, and functions that are currently available. #UnifiedAnalytics #SparkAISummit 2

3.Agenda ● Workday - Finance and HCM in the cloud ● Workday Platform - “Power of One” ● Prism Analytics - Powered by Apache Spark ● Production Stories & Lessons Learned ● Questions #UnifiedAnalytics #SparkAISummit 3

4. Execute Financial Management Human Capital Planning Management Plan ● “Pure” SaaS apps suite ○ Finance and HCM Prism Analytics and Reporting ● Customers: 2,500+ ○ 200+ of Fortune 500 ● Revenue: $2.82B ○ Growth: 32% YoY Analyze #UnifiedAnalytics #SparkAISummit 4

5. Workday Confidential #UnifiedAnalytics #SparkAISummit 5

6.One Source for Data | One Security Model | One Experience | One Community One Platform Business Process Object Reporting and Framework Data Model Analytics Security Machine Integration Learning Cloud #UnifiedAnalytics #SparkAISummit 6

7. One Source for Data | One Security Model | One Experience | One Community One Platform Business Process Object Reporting and Framework Data Model Analytics Security Machine Integration Learning Cloud Object Data Model Durable Extensible Metadata #UnifiedAnalytics #SparkAISummit 7

8. One Source for Data | One Security Model | One Experience | One Community One Platform Business Process Object Reporting and Framework Data Model Analytics Security Machine Integration Learning Cloud Security Encryption Privacy and Trust Compliance #UnifiedAnalytics #SparkAISummit 8

9. One Source for Data | One Security Model | One Experience | One Community One Platform Business Process Object Reporting and Framework Data Model Analytics Security Machine Integration Learning Cloud Reporting and Analytics Dashboards Distribution Collaboration #UnifiedAnalytics #SparkAISummit 9

10. Execute Financial Management Human Capital Planning Management Plan Prism Analytics and Reporting Analyze #UnifiedAnalytics #SparkAISummit 10

11. Workday Financial Management Execute Financial Management Workday Human Capital Workday Planning Human Capital Planning Management Management Plan Workday Prism Analytics and Reporting Prism Analytics and Reporting Integrate 3rd Party Data Data Management Data Preparation Data Discovery Report Publishing Prism Analytics Analyze #UnifiedAnalytics #SparkAISummit 11

12.Workday Prism Analytics The full spectrum of Finance and HCM insights, all within Workday. Workday Data + Non-Workday Data #UnifiedAnalytics #SparkAISummit 12

13. Prism Analytics Workflow Acquisition Preparation Analysis Data Discovery Finance, HCM Cleanse and Transform Map Reporting Blend Datasets Operational CRM Service ticketing Ingest Apply Security Permissions Surveys Point of Sale Worksheets Industry systems Stock grants Legacy systems More… Publish Data Source #UnifiedAnalytics #SparkAISummit 13

14.Spark in Prism Analytics Spark Spark Driver Executor Interactive Data Prep Spark Spark Spark Prism Executor Driver Executor Data Prep Publishing Prism Spark Spark Driver Executor Query Engine Prism YARN HDFS / S3 #UnifiedAnalytics #SparkAISummit 14

15. Interactive Data Prep in Prism Number of samples Examples and statistics Transform Stages #UnifiedAnalytics #SparkAISummit 15

16.Interactive Data Prep in Prism #UnifiedAnalytics #SparkAISummit 16

17.Interactive Data Prep in Prism Powered by Spark Edit Transform #UnifiedAnalytics #SparkAISummit 17

18.Data Prep Publishing in Prism Also powered by Spark #UnifiedAnalytics #SparkAISummit 18

19.Data Prep: Interactive vs. Publishing Interactive Publishing Data size 100 - 100K rows Billions of rows Sampling Yes No Caching Yes No Latency Seconds Minutes to hours Result Returned in memory Written to disk SLA Best effort Consistent performance #UnifiedAnalytics #SparkAISummit 19

20.Data Prep: Interactive vs. Publishing Same plan! #UnifiedAnalytics #SparkAISummit 20

21.Prism Logical Model #UnifiedAnalytics #SparkAISummit 21

22.Prism Logical Model • Superset of SQL operators • Compiles to Spark plans through Spark SQL • Implements custom Catalyst rules and strategies #UnifiedAnalytics #SparkAISummit 22

23.Example: Interactive Data Prep Operators IngestSampler Prism Logical Plan LogicalIngestSampler Spark Logical Plan IngestSamplerExec Spark Physical Plan IngestSamplerRDD RDD #UnifiedAnalytics #SparkAISummit 23

24.Prism Data Types #UnifiedAnalytics #SparkAISummit 24

25.Implementing Additional Data Types • Prism has a richer type system than Catalyst • Uses StructType and StructField to implement additional data types #UnifiedAnalytics #SparkAISummit 25

26.Example: Prism Currency Type object CurrencyType extends StructType( Array( StructField(“amount”,DecimalType(26, 6)), StructField(“code”, StringType))) >> { “amount”: 1000.000000, “code”: “USD” } >> { “amount”: -999.000000, “code”: “YEN” } #UnifiedAnalytics #SparkAISummit 26

27.Lessons Learned #UnifiedAnalytics #SparkAISummit 27

28.Lessons #1: Nested SQL #UnifiedAnalytics #SparkAISummit 28

29.Lesson #1: Nested SQL • SQL requires computed columns to be nested – SELECT 1 as c1, c1 + 1 as c2; /* ✗ */ – SELECT c1 + 1 as c2 FROM (SELECT 1 as c1); /* ✓ */ • First version: one nesting per computed column – Does not scale to 100s of columns – Takes a long time to compile and optimize #UnifiedAnalytics #SparkAISummit 29