Apache Spark Side of Funnels

Last year we decided to build an in-house solution for Funnel analysis which should be accessible to our business user through our BI tool. Backend part should run on Apache Spark and since the BI tool can only run SQL queries that implies that the solution is a pure Spark SQL implementation of Funnel analysis. In this talk we will cover various Spark SQL features we have used to optimize query performance and implement various filters which enable end users to get actionable insights. KEY TAKEAWAYS: – single query approach to Funnel analysis (can be applied to any funnel-like problem) – using window functions to ensure ordering of the events in the funnel – examples of higher order functions to calculate funnel metrics

展开查看详情

1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2.Spark Side of the Funnels Stipanicev Zoran, GetYourGuide #UnifiedDataAnalytics #SparkAISummit

3.About me Software engineer for the past 13 years and started working with data 12 years ago with Oracle and moved to reporting and BI over the years. Last 4 years enabling business users at GetYourGuide to make better decisions with data. Senior BI Engineer, Data Platform 3

4.Agenda 1. Intro to GetYourGuide 2. Introduction to Funnels 3. Deep Dive 4. Further Possibilities 5. End Result 4

5.Intro to GetYourGuide

6.We make it simple to book and enjoy incredible experiences

7.Europe’s largest marketplace for travel experiences 50k+ 25M+ Products in 150+ Tickets sold countries 150+ 600+ $650M+ Traveler nationalities Strong global team In VC funding

8.Introduction to Funnels #UnifiedDataAnalytics #SparkAISummit 8

9.Requirements 1. Looker as frontend and Spark as backend 2. Respect the order of events 3. Each step can consist of multiple events 4. Anything can happen between two steps of the Funnel 5. Support for Funnel wide and step specific filters 6. Sessions based on Touch-points (for some use cases) 7. Performance: 4 weeks of data in under 60 sec ideally under 30 sec 8. Option to ignore the order of events :) #UnifiedDataAnalytics #SparkAISummit 9

10.Internal vs External solution 1. Performance ○ External is faster because it’s custom built bottom up 2. Where is the data? ○ With internal solution we are not sending data to 3rd parties 3. Flexibility ○ With internal solution we can join it to all of our internal data to bring more insights to our stakeholders ○ Adding it to internal dashboards 4. Cost ○ Differs for each company :)

11.Touchpoints explained TOUCH POINTS G f Direct Booking Events Events Events Funnel Session 1 Funnel Session 2 Funnel Session 3

12.Funnel filtering explained If you define that you want see visitors in the funnel A - B - C - D - E How can the actual funnel look like? 1) xyzDBz A CyxD B ABAB C BC DE 2) xyzlmnop A CCCC How many steps do we match in this cases 1) A B C D E 2) A B A B E 3) A B C E D

13.Deep Dive

14.How do we build a Funnel 1) We filter for only selected Events 2) Concatenate all the events in a single session into a string a) We use an alias for each step (A,B,C…) i) It streamlines the rest of the query ii) And nothing changed when we added step specific filters 3) We compare the generated string with the Funnel specified by filters in our BI tool

15.Pseudo SQL of the implementation SELECT CASE … WHEN FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ WHEN event = ‘ProductPage’ AND product_id = ‘123’ THEN ‘B’ ... END)OVER(PARTITION BY session ORDER BY timestamp ROWS …) AS funnel // funnel => “ABABBBCDE...” FROM event_log WHERE ((event IN (‘LandingPage’,‘HomePage’)) OR (event = ‘ProductPage’ AND product_id = ‘123’) …) AND date = ... )t WHERE t.funnel RLIKE '...'

16.Pseudo SQL - Innere Where SELECT CASE … WHEN FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ WHEN event = ‘ProductPage’ AND product_id = ‘123’ THEN ‘B’ ... END)OVER(PARTITION BY session ORDER BY timestamp ROWS …) AS funnel FROM event_log WHERE 1=1 AND ((event IN (‘LandingPage’,‘HomePage’)) OR (event = ‘ProductPage’ AND product_id = ‘123’) …) AND date BETWEEN ... )t WHERE t.funnel RLIKE '...'

17. Explain plan before the fix + - Filter ( ((event = Landing Page) || (event = Home Page)) || ((event = ProductPage) && (product_id = … +- FileScan parquet … Location: PrunedInMemoryFileIndex[dbfs:…/event=AboutView ... PartitionCount: 1502, PartitionFilters: [isnotnull(date), (date >= 18135), (date < 18142)], … What do we see in the explain plan? ● Event partition listed is not for any of the filtered events ● Partition count is really high (date range is 7 days) ● Date partition pruning is applied

18.Pseudo SQL - Innere Where Fixed SELECT CASE … WHEN FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ …. FROM event_log WHERE 1=1 AND date BETWEEN ... AND ( (event IN (‘LandingPage’,‘HomePage’)) OR (event = ‘ProductPage’) OR (event …) ) AND ( (event IN (‘LandingPage’,‘HomePage’)) OR (event = ‘ProductPage’ AND product_id = ‘123’) …) )t WHERE t.funnel RLIKE '...'

19. Explain plan after the fix + - Filter ( ((event = LandingPage) || (event = HomePage)) || ((event = ProductPage) && (product_id = … +- FileScan parquet … Location: PrunedInMemoryFileIndex[dbfs:…/event=LandingPage ... PartitionCount: 21, PartitionFilters: [isnotnull(date), (date >= 18135), (date < 18142)], (event_name = Landing… What do we see in the explain plan? ● Event name partitions are pruned ● Partition count is a lot lower (7 days x 3 events) ● Date partition pruning is still applied

20.Sample of rows from inner to outer query Event Alias Funnel Visitor ID A BBABABABCBC A1B1 B BBABABABCBC A1B1 C BBABABABCBC A1B1 A BAAA C9D9 B BAAA C9D9

21.Pseudo SQL - Outer Where SELECT CASE … WHEN FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ …. FROM event_log WHERE 1=1 AND date BETWEEN ... )t WHERE t.funnel RLIKE 'A.*B.*C.*D.*E'

22.Pseudo SQL - Outer Where SELECT CASE … WHEN FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ …. FROM event_log WHERE 1=1 AND date BETWEEN ... )t WHERE locate('A', funnel) > 0

23.Pseudo SQL - Outer Select SELECT CASE WHEN alias = ‘A’ THEN 1 WHEN alias = ‘B’ AND funnel RLIKE ‘A.*B’ THEN 2 WHEN alias = ‘C’ AND funnel RLIKE ‘A.*B.*C’ THEN 3 ELSE -1 END AS step FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ …. FROM event_log WHERE 1=1 AND date BETWEEN ... )t WHERE locate('A', funnel) > 0

24.Pseudo SQL - Outer Select SELECT CASE WHEN alias = ‘A’ THEN 1 WHEN alias = ‘B’ AND locate('B', funnel, locate('A', funnel)) > 0 THEN 2 WHEN alias = ‘C’ AND locate('C', funnel, locate('B', funnel, locate('A', funnel))) > 0 THEN 3 ELSE -1 END AS step FROM ( SELECT concat_ws('', collect_list( CASE WHEN event IN (‘LandingPage’,‘HomePage’) THEN ‘A’ …. FROM event_log WHERE 1=1 AND date BETWEEN ... )t WHERE locate('A', funnel) > 0

25.Sample of rows from outer to BI tool query Event Alias Funnel Visitor ID Step A BBABABABCBC A1B1 1 B BBABABABCBC A1B1 2 C BBABABABCBC A1B1 3 A BAAA C9D9 1 B BAAA C9D9 -1

26.Further Possibilities

27.Slicing the funnel ProductPg ProductPg Checkout LandingPg AddToCart LandingPg ID = 1 ID = 2 Funnel ProductPg Checkout LandingPg AddToCart ID = 3 Funnel

28.Slicing the funnel ● First we get all the values that satisfy the filters ● Then we collect them into an array with a window function to apply them to every step of the funnel ● Last step is to explode using LATERAL VIEW to support multiple dimensions ● And now we can expose it as a dimension to users

29.Slicing – Inner select SELECT CASE … WHEN FROM ( SELECT … collect_set(CASE WHEN product_id IN (1,2,3) THEN product_id END) OVER(PARTITION BY session ORDER BY timestamp ROWS …) AS product_id_array // Distinct values … FROM event_log WHERE ((event IN (‘LandingPage’,‘HomePage’)) OR (event = ‘ProductPage’ AND product_id = ‘123’) …) AND date = ... )t WHERE t.funnel RLIKE '...'