Presents how Adobe uses Apache Spark

注脚

展开查看详情

1.Spark @ Adobe Cloud Platform Yogesh Natarajan | Sr. Software Engineer

2. Adobe Adobe Creative Cloud Adobe Document Cloud Adobe Experience Cloud ADOBE CLOUD PLATFORM © 2018 Adobe Systems Incorporated. All Rights Reserved.

3. SOLUTIONS Advertising Cloud Marketing Cloud Analytics Cloud Display Search Video Campaign Experience Manager Target Analytics Audience Manager PLATFORM Core Services People Places Assets Mobile Activation Sensei AI Framework and Tools Exchange & Adobe I/O Experience Data Models Content Data Sync Search Collaboration Ingestion Profile Governance © 2018 Adobe Systems Incorporated. All Rights Reserved.

4. Adobe Cloud Platform • Multi-PB Data Lake • > 150 TB data ingested per day • > 100K Spark jobs / day © 2018 Adobe Systems Incorporated. All Rights Reserved.

5. Spark Demands • Support for different workloads • Batch • Streaming • Interactive • Existing frameworks should be supported • Mesos • YARN • Databricks • Bring your own compute • Single tenant clusters • Developer tools © 2018 Adobe Systems Incorporated. All Rights Reserved.

6.© 2018 Adobe Systems Incorporated. All Rights Reserved.

7. Spark Workloads • Batch • Streaming • Interactive © 2018 Adobe Systems Incorporated. All Rights Reserved.

8.© 2018 Adobe Systems Incorporated. All Rights Reserved.

9. Batch • Data Ingestion • Data Validation • Data Cleansing • Data re-partitioning and landing for optimal access • Statistics • Identity and Profile • Ingestion of Profile and Event XDM • Segmentation and export • Metrics • Data generation for simulations © 2018 Adobe Systems Incorporated. All Rights Reserved.

10. Batch • Intelligent Services • Ingest of Event XDM • Conversion Rates • Statistics © 2018 Adobe Systems Incorporated. All Rights Reserved.

11.© 2018 Adobe Systems Incorporated. All Rights Reserved.

12. Streaming • Data Ingestion • Data Validation • Data Cleansing • Data Landing • Statistics • Identity and Profile • Ingest real-time XDM events • Intelligent Services • Ingest real-time XDM events © 2018 Adobe Systems Incorporated. All Rights Reserved.

13.© 2018 Adobe Systems Incorporated. All Rights Reserved.

14. Interactive • Identity and Profile • Preview Service – Sampling and Query • Intelligent Services • Notebooks for analysts and data scientists © 2018 Adobe Systems Incorporated. All Rights Reserved.

15. Compute Platform • Compute gateway • RESTful API’s • Supports multiple execution profiles • Compute UI • Spark on Mesos • Support for containers • Long running apps though marathon • Multi-tenant • Spark on Databricks • Upcoming support for Spark on YARN and K8s © 2018 Adobe Systems Incorporated. All Rights Reserved.

16.Experience Query Service Andrew Chen | Sr. Software Engineer

17. Agenda • Use case and requirements • System architecture • Spark performance optimizations © 2018 Adobe Systems Incorporated. All Rights Reserved.

18. Use Case and Requirements • Provide both interactive and batch SQL-based query interfaces for customers to facilitate BI queries on their customer data • Support stateful SQL sessions in a multi-tenant architecture • Persistent connections allow creating temporary views and user-defined functions © 2018 Adobe Systems Incorporated. All Rights Reserved.

19. Interactive Architecture © 2018 Adobe Systems Incorporated. All Rights Reserved.

20. PostgreSQL Protocol • Abstraction layer allows for customizations • Allows flexibility of underlying query engine • Existing client support • Well documented © 2018 Adobe Systems Incorporated. All Rights Reserved.

21. Presto SQL Parser • Open source • Uses ANTLR4 • Supports PostgreSQL out of the box • Supports plugins / customization © 2018 Adobe Systems Incorporated. All Rights Reserved.

22. Akka Framework • Akka Streams for TCP-level server interaction and implementation of PostgreSQL protocol • Asynchronous reactive architecture fits well with streaming messages-based protocol • Scala-based like Spark © 2018 Adobe Systems Incorporated. All Rights Reserved.

23. Challenges • Query translation • Requires implementing PostgreSQL system catalogs (schema metadata) • Used by connected clients to query schema metadata • 50+ metadata tables stored as Spark temp views © 2018 Adobe Systems Incorporated. All Rights Reserved.

24. Performance optimizations • Performance requirements • Up to 8-10 billion events/month per customer • Typical event schema can have thousands of columns with nested structures • Spark SQL 2.3 supports projection pruning on Parquet but not for nested columns © 2018 Adobe Systems Incorporated. All Rights Reserved.

25. Nested Column Pruning • SPARK-4502 PR #16578 for Spark 2.3 (user mallman) • Applied bugfixes for window functions and case sensitivity • PR #21320 merged to Spark 2.4 © 2018 Adobe Systems Incorporated. All Rights Reserved.

26. Demo © 2018 Adobe Systems Incorporated. All Rights Reserved.

27. We are hiring!!! https://www.adobe.com/careers.html © 2018 Adobe Systems Incorporated. All Rights Reserved.

28. Questions? © 2018 Adobe Systems Incorporated. All Rights Reserved.