工作日交易数据的闪电快速分析

在本文中,我们将分享在为事务性数据构建高度可伸缩的多租户分析服务时所吸取的教训。我们将从大局和业务需求开始。然后描述具有用于数据准备、发布和查询引擎的批处理和交互式模块的体系结构,并注意相关的Spark技术。然后我们将深入Prism查询引擎的内部,重点介绍所使用的Spark SQL、DataFrames和Catalyst编译器特性。我们将描述在编译和执行复杂管道和查询时遇到的问题,以及如何使用缓存、采样和查询编译技术来支持交互式用户体验。
展开查看详情

1.Lightning-Fast Analytics for Workday Transactional Data Pavel Hardak, Dir Product (Workday) Ned Borisov (Ph.D), Sr Eng Mgr (Workday) #ExpSAIS18

2.Agenda • Workday (Pavel H) – Introduction to Workday – Business challenges – Platform for Transactional Apps • Prism Analytics (Ned B) – High Level Architecture – Functional Modules – Problems encountered • Wrap-up (Pavel H) #ExpSAIS18 2

3.Workday • Pure SaaS company (founded in 2005) • Enterprise cloud apps – HCM and Finances – Named as “Leader” in Gartner Magic Quadrants • 2200+ customers, 175+ of Fortune 500 – Revenue: $2.1B, 36% YoY • 8600+ employees worldwide – #7 in FORTUNE "100 Best Companies to Work For” – Pleasanton (HQ), San Mateo, San Francisco – Boulder (CO), Dublin (Ireland), Victoria (BC), … #ExpSAIS18 3

4.Workday Confidential #ExpSAIS18

5.Continuous Innovation in Cloud #ExpSAIS18 5

6.#ExpSAIS18 6

7.Enterprise SaaS Challenges • Concurrency – From small to huge companies - every ‘worker’ is Workday user • Reliability – All users add and change data, generating many transactions • Security – Customers trust us with very confidential and private information • Scalability – Import several years from the previous system(s) and keep growing • Speed – Everybody wants fast response time J #ExpSAIS18 7

8.One Source for Data | One Security Model | One Experience | One Community One Platform Business Process Object Reporting and Framework Data Model Analytics Security Machine Integration Learning Cloud #ExpSAIS18

9. One Source for Data | One Security Model | One Experience | One Community One Platform Object Data Model Object Data Model Durable Extensible Metadata #ExpSAIS18

10. One Source for Data | One Security Model | One Experience | One Community One Platform Reporting and Analytics Reporting and Analytics Dashboards Distribution Collaboration

11.But we want more… • Import 3rd party data from external sources – Unknown schema, need validations and cleansing • Blend external data with Workday data – Self Service Data Preparation – Publish custom report sources – Leverage the same security paradigms • Data Discovery and Reporting – Visualize, slice and dice by any dimension – Perform faster than ever before #ExpSAIS18 11

12.#ExpSAIS18 12

13.Just add some … • Water (?) • Coffee (?) • Energy drink (?) • Apache Spark (!) #ExpSAIS18 13

14.Why Apache Spark • Wanted to standardize on ONE data processing technology which keeps evolving • Needed extensibility to handle diverse use cases • Scalability for on-disk views and in-memory processing • SQL processing is a HUGE plus #ExpSAIS18

15.High Level Prism Architecture Report Queries Web UI Requests Prism Server Data Prep: Interactive Transforms Samples Workday Data HDFS External Data #ExpSAIS18

16.Data Preparation • A dataset may import other datasets to transform them (think SQL View) • Transforms include: Filter, Join, Union, Group By, etc. • Example data are shown to help verify the transformation #ExpSAIS18

17.High Level Prism Architecture Report Queries Web UI Requests Prism Server Data Prep: Interactive Transforms Lens Build: Batch Transforms Samples Data Workday Data HDFS External Data #ExpSAIS18

18. Lens Build • Materializing all transforms • Columnar format with further split into small Spark blocks Lens Jobs #ExpSAIS18

19.High Level Prism Architecture Report Queries Web UI Requests Prism Server Query Engine: Interactive BI Queries Data Prep: Lens Interactive Transforms Lens Build: Batch Transforms Samples Data Workday Data HDFS External Data #ExpSAIS18

20.Query Engine • Analyst-driven Analysis • Drag & drop chart creation • Analyst defined computed fields • Quick measurement aggregates • Execution • Query Engine executes the queries • Interactive response is required #ExpSAIS18

21.High Level Prism Architecture Report Queries Web UI Requests Prism Server Query Engine: Interactive BI Queries Data Prep: Lens Interactive Transforms Lens Build: Batch Transforms Samples Data Workday Data HDFS External Data #ExpSAIS18

22. Spark in Prism Architecture Prism Analytics launches and maintains lifecycle of three types of Spark Applications • Data Prep: a single (smaller) always-on Spark Application – executes dataset transformations over small samples of data • Lens Build: on-demand batch Application – one per Lens Build process – executes dataset transformations over full datasets • Query Engine: a single (larger) always-on Application – executes reporting queries over Lens data – caches columns of Lenses in memory #ExpSAIS18

23.Query Engine & Spark Prism Server Spark Spark Spark Prism Executor Spark Executor Spark Executor Spark Spark Spark Executor Query Engine Spark Executor Spark Server Driver Executor Spark Executor Spark Executor Spark Executor Spark Data Prep Executor Executor ... Executor #ExpSAIS18

24.Notable Observations • Memory Allocation Strategy • Row Level Security #ExpSAIS18

25.Memory Allocation Strategy • Executors 30% 60% 10% Column Data Execution Buffer Cache Executor JVM à 20% faster queries • Driver 20% 60% 20% Accumulators Streaming Buffer Driver JVM #ExpSAIS18

26.Row-Level Security • Implemented as a dimension predicate. For example: SELECT employee, SUM(quantity) FROM Employee_Stock_Grants WHERE supervisory_org IN (org1, org33, org_508) GROUP BY employee; • In-List for supervisory_org could be very large • More than one In-List • Complex list values (e.g. nested conjunctions) #ExpSAIS18

27.Scenario Details • Customer Use Case – Predicates with 10+ In-Lists – Values between 6K and 12K – Additional mix of conjunctions and disjunctions • The Same Query With Security = 100X Without Security #ExpSAIS18

28.Analysis • Finding 1 – Parsing, planning and optimizing was taking ~27 seconds – We did it 4 times • Finding 2 – Major cause is the number of times the Catalyst expressions (In and InSet) and their arguments were being traversed and copied during plan analysis and optimization. – Minor cause is the amount of time spent in serializing Scala’s TrieSet when shipping the plan to executors #ExpSAIS18

29.Solution • Custom InSet-Like expressions (case classes) – Hide the large literals sets through a curried-argument – Resulted in queries going from 27 sec to 4 sec. • Further Optimizations – Our InSet-Like expression did not materialize the target in-sets until after the plan was de-serialized on the executors – Resulted in improvement from 4 sec to 2 sec. #ExpSAIS18