Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph

Spark’s graph capabilities are great at enabling analysis of networks for use-cases such as fraud-detection, illicit network detection, and supply chain risk analysis. However, in order for a data scientist to perform analytics on a network (e.g., Page Rank, community detection, etc.), they end up spending all their time fighting a mountain of data integration challenges. A specific challenge this talk will focus on is connecting entities in a network within and across data domains. We will explore how you can leverage the Spark ecosystem’s graph capabilities to perform massive-scale entity resolution (ER). As a result, your data scientists will be able to more quickly and effectively perform graph analytics that drive business and mission value. Key takeaways: 1) The Spark ecosystem enables you to quickly get started with graph analytics use-cases at scale 2) Complementing traditional ER techniques with the context of graph relationships allows you to connect entities that you could not easily connect before
展开查看详情

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Massive-Scale Entity Resolution Using Spark + Graph Max Melnick, Deloitte Consulting LLP #UnifiedAnalytics #SparkAISummit

3.About Me • Passion for building tech products • Engineering Lead / Architect / Developer • Spark Certified Developer • Based in Washington, DC • UVA Systems Engineering maxmelnick.com • Love sports, travel, cooking/eating, and maxmelnick@gmail.com listening to podcasts linkedin.com/in/maxmelnick #UnifiedAnalytics #SparkAISummit 3

4. MissionGraph™ by MissionGraph™ is an open architecture, data integration, enhancement, and exploration platform that powers massive- scale analysis. #UnifiedAnalytics #SparkAISummit 4

5.Agenda • Entity Resolution (ER) Overview • Spark + Graph ER Solution Walkthrough – Technical Architecture – Example Patterns • Graph gotchas and tips #UnifiedAnalytics #SparkAISummit 5

6.ER enables analytics #UnifiedAnalytics #SparkAISummit 6

7.ER Use-Cases • Customer 360 • Fraud Detection • Network Analysis • Recommendation Engines #UnifiedAnalytics #SparkAISummit 7

8.Logical ER Flow #UnifiedAnalytics #SparkAISummit 8

9.Simple ER Example #UnifiedAnalytics #SparkAISummit 9

10.Simple ER Example (cont.) #UnifiedAnalytics #SparkAISummit 10

11.Simple ER Example (cont.) #UnifiedAnalytics #SparkAISummit 11

12.Simple ER Example (cont.) #UnifiedAnalytics #SparkAISummit 12

13.ER is hard • Difficult to scale algorithms vertically (more of the same data) or horizontally (new types of data) • Prohibitively expensive to compare each record with every other record • Heterogeneous datasets • Data lacks strong keys • Difficult to manage changes over time • Similarity varies significantly across types of entities, languages, etc. • Data quality issues #UnifiedAnalytics #SparkAISummit 13

14.Improve ER with Spark + Graph Better + = ER #UnifiedAnalytics #SparkAISummit 14

15.Technical Architecture #UnifiedAnalytics #SparkAISummit 15

16.Flexible graph candidate selection The flexibility of graph enables you to easily add new attributes to your candidate selection query vs #UnifiedAnalytics #SparkAISummit 16

17.Flexible graph candidate selection – Spark GraphFrames query #UnifiedAnalytics #SparkAISummit 17

18.Flexible graph candidate selection – query by phone #UnifiedAnalytics #SparkAISummit 18

19.Flexible graph candidate selection – query by phone GraphFrames SparkSQL #UnifiedAnalytics #SparkAISummit 19

20.Flexible graph candidate selection – query by phone or address #UnifiedAnalytics #SparkAISummit 20

21.Flexible graph candidate selection – query by phone or address GraphFrames SparkSQL Candidate selection query changes Same candidate selection query #UnifiedAnalytics #SparkAISummit 21

22.Flexible graph candidate selection – query by phone or address or email vs #UnifiedAnalytics #SparkAISummit 22

23.Flexible graph candidate selection – query by phone or address or email SparkSQL GraphFrames Candidate selection query changes Same candidate selection query #UnifiedAnalytics #SparkAISummit 23

24.Simplify entity canonicalization #UnifiedAnalytics #SparkAISummit 24

25.Simplify entity canonicalization (cont.) #UnifiedAnalytics #SparkAISummit 25

26.Graph context helps when data is limited #UnifiedAnalytics #SparkAISummit 26

27.Graph context helps when data is limited (cont.) #UnifiedAnalytics #SparkAISummit 27

28.Graph context helps when data is limited (cont.) #UnifiedAnalytics #SparkAISummit 28

29.Graph gotchas • Supernodes • Graph adoption learning curve • Not a silver bullet • Less streaming support than traditional SQL- based workflows #UnifiedAnalytics #SparkAISummit 29