- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
展开查看详情
1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2 .Massive-Scale Entity Resolution Using Spark + Graph Max Melnick, Deloitte Consulting LLP #UnifiedAnalytics #SparkAISummit
3 .About Me • Passion for building tech products • Engineering Lead / Architect / Developer • Spark Certified Developer • Based in Washington, DC • UVA Systems Engineering maxmelnick.com • Love sports, travel, cooking/eating, and maxmelnick@gmail.com listening to podcasts linkedin.com/in/maxmelnick #UnifiedAnalytics #SparkAISummit 3
4 . MissionGraph™ by MissionGraph™ is an open architecture, data integration, enhancement, and exploration platform that powers massive- scale analysis. #UnifiedAnalytics #SparkAISummit 4
5 .Agenda • Entity Resolution (ER) Overview • Spark + Graph ER Solution Walkthrough – Technical Architecture – Example Patterns • Graph gotchas and tips #UnifiedAnalytics #SparkAISummit 5
6 .ER enables analytics #UnifiedAnalytics #SparkAISummit 6
7 .ER Use-Cases • Customer 360 • Fraud Detection • Network Analysis • Recommendation Engines #UnifiedAnalytics #SparkAISummit 7
8 .Logical ER Flow #UnifiedAnalytics #SparkAISummit 8
9 .Simple ER Example #UnifiedAnalytics #SparkAISummit 9
10 .Simple ER Example (cont.) #UnifiedAnalytics #SparkAISummit 10
11 .Simple ER Example (cont.) #UnifiedAnalytics #SparkAISummit 11
12 .Simple ER Example (cont.) #UnifiedAnalytics #SparkAISummit 12
13 .ER is hard • Difficult to scale algorithms vertically (more of the same data) or horizontally (new types of data) • Prohibitively expensive to compare each record with every other record • Heterogeneous datasets • Data lacks strong keys • Difficult to manage changes over time • Similarity varies significantly across types of entities, languages, etc. • Data quality issues #UnifiedAnalytics #SparkAISummit 13
14 .Improve ER with Spark + Graph Better + = ER #UnifiedAnalytics #SparkAISummit 14
15 .Technical Architecture #UnifiedAnalytics #SparkAISummit 15
16 .Flexible graph candidate selection The flexibility of graph enables you to easily add new attributes to your candidate selection query vs #UnifiedAnalytics #SparkAISummit 16
17 .Flexible graph candidate selection – Spark GraphFrames query #UnifiedAnalytics #SparkAISummit 17
18 .Flexible graph candidate selection – query by phone #UnifiedAnalytics #SparkAISummit 18
19 .Flexible graph candidate selection – query by phone GraphFrames SparkSQL #UnifiedAnalytics #SparkAISummit 19
20 .Flexible graph candidate selection – query by phone or address #UnifiedAnalytics #SparkAISummit 20
21 .Flexible graph candidate selection – query by phone or address GraphFrames SparkSQL Candidate selection query changes Same candidate selection query #UnifiedAnalytics #SparkAISummit 21
22 .Flexible graph candidate selection – query by phone or address or email vs #UnifiedAnalytics #SparkAISummit 22
23 .Flexible graph candidate selection – query by phone or address or email SparkSQL GraphFrames Candidate selection query changes Same candidate selection query #UnifiedAnalytics #SparkAISummit 23
24 .Simplify entity canonicalization #UnifiedAnalytics #SparkAISummit 24
25 .Simplify entity canonicalization (cont.) #UnifiedAnalytics #SparkAISummit 25
26 .Graph context helps when data is limited #UnifiedAnalytics #SparkAISummit 26
27 .Graph context helps when data is limited (cont.) #UnifiedAnalytics #SparkAISummit 27
28 .Graph context helps when data is limited (cont.) #UnifiedAnalytics #SparkAISummit 28
29 .Graph gotchas • Supernodes • Graph adoption learning curve • Not a silver bullet • Less streaming support than traditional SQL- based workflows #UnifiedAnalytics #SparkAISummit 29