Parallelizing with Apache Spark in Unexpected Ways

Out of the box, Spark provides rich and extensive APIs for performing in memory, large-scale computation across data. Once a system has been built and tuned with Spark Datasets/Dataframes/RDDs, have you ever been left wondering if you could push the limits of Spark even further? In this session, we will cover some of the tips learned while building retail-scale systems at Target to maximize the parallelization that you can achieve from Spark in ways that may not be obvious from current documentation. Specifically, we will cover multithreading the Spark driver with Scala Futures to enable parallel job submission. We will talk about developing custom partitioners to leverage the ability to apply operations across understood chunks of data and what tradeoffs that entails. We will also dive into strategies for parallelizing scripts with Spark that might have nothing to with Spark to support environments where peers work in multiple languages or perhaps a different language/library is just the best thing to get the job done. Come learn how to squeeze every last drop out of your Spark job with strategies for parallelization that go off the beaten path.

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Parallelizing With Apache Spark In Unexpected Ways Anna Holschuh, Target #UnifiedAnalytics #SparkAISummit

3.What This Talk is About • Tips for parallelizing with Spark • Lots of (Scala) code examples • Focus on Scala programming constructs #UnifiedAnalytics #SparkAISummit 3

4.Who am I • Lead Data Engineer/Scientist at Target since 2016 • Deep love of all things Target • Other Spark Summit talks: o 2018: Extending Apache Spark APIs Without Going Near Spark Source Or A Compiler o 2019: Lessons In Linear Algebra At Scale With Apache Spark : Let’s Make The Sparse Details A Bit More Dense #UnifiedAnalytics #SparkAISummit 4

5.Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data #UnifiedAnalytics #SparkAISummit 5

6.Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data #UnifiedAnalytics #SparkAISummit 6

7. Introduction > Hello, Spark!_ Driver Application Partition Executor Job Action Dataset Dataframe Stage Transformation RDD Task Shuffle #UnifiedAnalytics #SparkAISummit 7

8.Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data #UnifiedAnalytics #SparkAISummit 8

9.Parallel Job Submission and Schedulers Let’s do some data exploration • We have a system of Authors, Articles, and Comments on those Articles • We would like to do some simple data exploration as part of a batch job • We execute this code in a built jar through spark-submit on a cluster with 100 executors, 5 executor cores, 10gb/driver, and 10gb/executor. • What happens in Spark when we kick off the exploration? #UnifiedAnalytics #SparkAISummit 9

10.Parallel Job Submission and Schedulers The Execution Starts • One job is kicked off at a time. • We asked a few independent questions in our exploration. Why can’t they be running at the same time? #UnifiedAnalytics #SparkAISummit 10

11.Parallel Job Submission and Schedulers • All of our questions run as separate The Execution Completes jobs. • Examining the timing demonstrates that these jobs run serially. #UnifiedAnalytics #SparkAISummit 11

12.Parallel Job Submission and Schedulers One more sanity check • All of our questions, running serially. #UnifiedAnalytics #SparkAISummit 12

13.Parallel Job Submission and Schedulers Can we potentially speed up our exploration? • Spark turns our questions into 3 Jobs • The Jobs run serially • We notice that some of our questions are independent. Can they be run at the same time? • The answer is yes. We can leverage Scala Concurrency features and the Spark Scheduler to achieve this… #UnifiedAnalytics #SparkAISummit 13

14. Parallel Job Submission and Schedulers Scala Futures • A placeholder for a value that may not exist. • Asynchronous • Requires an ExecutionContext • Use Await to block • Extremely flexible syntax. Supports for- comprehension chaining to manage dependencies. #UnifiedAnalytics #SparkAISummit 14

15. Parallel Job Submission and Schedulers Let’s rework our original code using Scala Futures to parallelize Job Submission • We pull in a reference to an implicit ExecutionContext • We wrap each of our questions in a Future block to be run asynchronously • We block on our asynchronous questions all being completed • (Not seen) We properly shut down the ExecutorService when the job is complete #UnifiedAnalytics #SparkAISummit 15

16.Parallel Job Submission and Schedulers Our questions are now • All of our questions run as separate asked concurrently jobs. • Examining the timing demonstrates that these jobs are now running concurrently. #UnifiedAnalytics #SparkAISummit 16

17.Parallel Job Submission and Schedulers One more sanity check • All of our questions, running concurrently. #UnifiedAnalytics #SparkAISummit 17

18.Parallel Job Submission and Schedulers A note about Spark Schedulers • The default scheduler is FIFO • Starting in Spark 0.8, Fair sharing became available, aka the Fair Scheduler • Fair Scheduling makes resources available to all queued Jobs • Turn on Fair Scheduling through SparkSession config and supporting allocation pool config • Threads that submit Spark Jobs should specify what scheduler pool to use if it’s not the default Reference: #UnifiedAnalytics #SparkAISummit 18

19.Parallel Job Submission and Schedulers The Fair Scheduler is enabled #UnifiedAnalytics #SparkAISummit 19

20.Parallel Job Submission and Schedulers Creating a DAG of Futures on the Driver • Scala Futures syntax enables for- comprehensions to represent dependencies in asynchronous operations • Spark code can be structured with Futures to represent a DAG of work on the Driver • When reworking all code into futures, there will be some redundancy with Spark’s role in planning and optimizing, and Spark handles all of this without issue #UnifiedAnalytics #SparkAISummit 20

21.Parallel Job Submission and Schedulers Takeaways Why use this strategy? • Actions trigger Spark to do things (i.e. create • To maximize resource utilization in your Jobs) cluster • Spark can certainly handle running multiple • To maximize the concurrency potential of your Jobs at once, you just have to tell it to job (and thus speed/efficiency) • This can be accomplished by multithreading • Fair Scheduling pools can support different the driver. In Scala, this can be accomplished notions of priority of work in jobs using Futures. • Fair Scheduling pools can support multi-user • The way tasks are executed when multiple environments to enable more even resource jobs are running at once can be further allocation in a shared cluster configured through either Spark’s FIFO or Fair Scheduler with configured supporting pools. #UnifiedAnalytics #SparkAISummit 21

22.Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data #UnifiedAnalytics #SparkAISummit 22

23.Partitioning Strategies A first experience with partitioning #UnifiedAnalytics #SparkAISummit 23

24.Partitioning Strategies Getting started with partitioning • .repartition() vs .coalesce() • Custom partitioning is supported with the RDD API only (specifically through implicitly added PairRDDFunctions) • Spark supports the HashPartitioner and RangePartitioner out of the box • One can create custom partitioners by extending Partitioner to enable custom strategies in grouping data #UnifiedAnalytics #SparkAISummit 24

25.Partitioning Strategies How can non-standard partitioning be useful? #1 : Collocating data for joins • We are joining datasets of Articles and Authors together by the Author’s id. • When we pull the raw Article dataset, author ids are likely to be distributed somewhat randomly throughout partitions. • Joins can be considered wide transformations depending on underlying data and could result in full shuffles. • We can cut down on the impact of the shuffle stage by collocating data by the id to join on within partitions so there is less cross chatter during this phase. #UnifiedAnalytics #SparkAISummit 25

26.Partitioning Strategies #1: Collocating data for joins Articles Authors Articles Authors {author_id: 1} {author_id: 1} {author_id: 2} {author_id: 1} {id: 1} {id: 1} {author_id: 3} {author_id: 1} {author_id: 5} {author_id: 1} {author_id: 1} {author_id: 2} {author_id: 2} {author_id: 2} {author_id: 3} {id: 2} {author_id: 2} {id: 2} {author_id: 4} {author_id: 2} {author_id: 1} {author_id: 3} {author_id: 2} {author_id: 3} {author_id: 3} {id: 3} {id: 3} {author_id: 3} {author_id: 4} {author_id: 4} {author_id: 1} {author_id: 4} {author_id: 2} {id: 4} {id: 4} {author_id: 4} {author_id: 4} {id: 5} {id: 5} {author_id: 5} {author_id: 5} {author_id: 5} #UnifiedAnalytics #SparkAISummit 26

27.Partitioning Strategies How can non-standard partitioning be useful? #2 : Grouping data to operate on partitions as a whole • We need to calculate an Author Summary report that needs to have access to all Articles for an Author to generate meaningful overall metrics • We could leverage .map and .reduceByKey to combine Articles for analysis in a pairwise fashion or by gathering groups for processing • Operating on a whole partition grouped by an Author also accomplishes this goal #UnifiedAnalytics #SparkAISummit 27

28.Partitioning Strategies Implementing a Custom Partitioner #UnifiedAnalytics #SparkAISummit 28

29.Partitioning Strategies Takeaways • Partitioning can help even out data skew for more reliable and performant processing. • The RDD API supports more fine-grained partitioning with Hash and Range Partitioners. • One can implement a custom partitioner to have even more control over how data is grouped, which creates opportunity for more performant joins and operations on partitions as a whole. • There is expense involved in repartitioning that has to be balanced against the cost of an operation on less organized data. #UnifiedAnalytics #SparkAISummit 29