Internals of Speeding up PySpark with Arrow

Back in the old days of Apache Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, likewise did the constant improvement of the optimisers (Catalyst and Tungsten). But, after Spark 2.3, PySpark has sped up tremendously thanks to the addition of the Arrow serialisers. In this talk you will learn how the Spark Scala core communicates with the Python processes, how data is exchanged across both sub-systems and the development efforts present and underway to make it as fast as possible.

展开查看详情

1.

2.

3.

4. WHOAMI > Ruben Berenguel (@berenguel) > PhD in Mathematics > (big) data consultant > Lead data engineer using Python, Go and Scala > Right now at Affectv #UnifiedDataAnalytics #SparkAISummit

5. What is Pandas? #UnifiedDataAnalytics #SparkAISummit

6. What is Pandas? > Python Data Analysis library #UnifiedDataAnalytics #SparkAISummit

7. What is Pandas? > Python Data Analysis library > Used everywhere data and Python appear in job offers #UnifiedDataAnalytics #SparkAISummit

8. What is Pandas? > Python Data Analysis library > Used everywhere data and Python appear in job offers > Efficient (is columnar and has a C and Cython backend) #UnifiedDataAnalytics #SparkAISummit

9.#UnifiedDataAnalytics #SparkAISummit

10.#UnifiedDataAnalytics #SparkAISummit

11.#UnifiedDataAnalytics #SparkAISummit

12.#UnifiedDataAnalytics #SparkAISummit

13.#UnifiedDataAnalytics #SparkAISummit

14.#UnifiedDataAnalytics #SparkAISummit

15.#UnifiedDataAnalytics #SparkAISummit

16.#UnifiedDataAnalytics #SparkAISummit

17.#UnifiedDataAnalytics #SparkAISummit

18.#UnifiedDataAnalytics #SparkAISummit

19. HOW DOES PANDAS MANAGE COLUMNAR DATA? #UnifiedDataAnalytics #SparkAISummit

20.#UnifiedDataAnalytics #SparkAISummit

21.#UnifiedDataAnalytics #SparkAISummit

22.

23.

24.

25.

26.

27. WHAT IS ARROW? #UnifiedDataAnalytics #SparkAISummit

28. WHAT IS ARROW? > Cross-language in-memory columnar format library #UnifiedDataAnalytics #SparkAISummit

29. WHAT IS ARROW? > Cross-language in-memory columnar format library > Optimised for efficiency across languages #UnifiedDataAnalytics #SparkAISummit