Graph Features in Spark 3.0 - Integrating Graph Querying and Algorithms in Spark Graphg

Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and Graph Algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research. The tutorial aims to help understanding when graphs should be used and how Spark Graph can be used to extend analytical workflows. In this tutorial we will explore the concepts and motivations behind graph querying and graph algorithms, the components of the new Spark Graph module and their APIs, and how those APIs allow you to successfully write your own graph applications and integrate them in your data science workflows.

The tutorial is a mixture of presentation, code examples, and notebooks. We will demonstrate how to write an end-to-end Graph application that operates on different kinds of input data. We will show how Spark Graph interacts with Spark SQL and openCypher Morpheus, a Spark Graph extension that allows you to easily manage multiple graphs and provides built-in Property Graph Data Sources for the Neo4j graph database as well as Cypher language extensions.

At the end of the tutorial, attendees will have a good understanding of when to apply graphs in their data science workflows, how to bring Spark Graph into an existing Spark workflow and how to make best use of the new APIs. This tutorial will be both lead by the presenters and also hands-on interactive session. The tutorial material will be made available during the presentation.


1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2.#UnifiedDataAnalytics #SparkAISummit

3.Graphs are everywhere #UnifiedDataAnalytics #SparkAISummit 3

4.… and growing #UnifiedDataAnalytics #SparkAISummit 4

5.Graphs at Spark Summit #UnifiedDataAnalytics #SparkAISummit 5

6.Property Graphs & Big Data The Property Graph data model is becoming increasingly mainstream Cloud graph data services like Azure CosmosDB or Amazon Neptune Simple graph features in SQLServer 2017, multiple new graph DB products New graph query language to be standardized by ISO Neo4j becoming common operational store in retail, finance, telcos … and more Increasing interest in graph algorithms over graph data as a basis for AI Apache® Spark is the leading scale-out clustered memory solution for Big Data Spark 2: Data viewed as tables (DataFrames), processed by SQL, in function chains, using queries and user functions, transforming immutable tabular data sets #UnifiedDataAnalytics #SparkAISummit 6

7.Graphs are coming to Spark [SPARK-25994] SPIP: Property Graphs, Cypher Queries, and Algorithms Goal ● bring Property Graphs and the Cypher Query language to Spark ● the SparkSQL for graphs Status ● Accepted by the community ● Implementation still Work in Progress #UnifiedDataAnalytics #SparkAISummit 7

8.Demonstration #UnifiedDataAnalytics #SparkAISummit

9.The Property Graph #UnifiedDataAnalytics #SparkAISummit

10.The Whiteboard Model Is the Physical Model Eliminates Graph-to-Relational Mapping In your data Bridge the gap between logical model and DB models #UnifiedDataAnalytics #SparkAISummit

11. Property Graph Model Components name: “Dan” Nodes born: May 29, 1970 name: “Ann” • The objects in the graph twitter: “@dan” born: Dec 5, 1975 • Can have name-value KNOWS properties KNOWS • Can be labeled User User FOLLOWS Relationships S RE EW VI • Relate nodes by type and EW VI RE date: S direction Jan 10, 2011 • Can have name-value name: “Cars, Inc” properties Business sector: “automotive” #UnifiedDataAnalytics #SparkAISummit

12. Relational Versus Graph Models Relational Model Graph Model Alice Burgers, Inc Burgers, Inc S Pizza, Inc IEW REV Pretzels REVIEWS Alice Pizza, Inc REV User User-Business Business IEW S Pretzels #UnifiedDataAnalytics #SparkAISummit

13.Graphs in Spark 3.0 #UnifiedDataAnalytics #SparkAISummit

14.Tables for Labels Property Graph • In Spark Graph, PropertyGraphs are represented by Graph Type – Node Tables and Relationship Tables • Tables are represented by DataFrames Node Tables – Require a fixed schema • Property Graphs have a Graph Type Rel. Tables – Node and relationship types that occur in the graph – Node and relationship properties and their data type #UnifiedDataAnalytics #SparkAISummit

15.Tables for Labels :User:ProAccount id name 0 Alice :REVIEWS Graph Type { :User:ProAccount ( :Business :Business name: STRING name: Burgers, Inc ), id name :Business ( 1 Burgers, Inc name: STRING :User:ProAccount ), name: Alice :REVIEWS :REVIEWS } id source target 0 0 1 #UnifiedDataAnalytics #SparkAISummit

16.Creating a graph Property Graphs are created from a set of DataFrames. There are two possible options: - Using Wide Tables - one DF for nodes and one for relationships - column name convention identifies label and property columns - Using NodeFrames and RelationshipFrames - requires a single DataFrame per node label combination and relationship type - allows mapping DF columns to properties #UnifiedDataAnalytics #SparkAISummit 16

17. Storing and Loading review.json user.json business.json Create Node and Create Property Graph Store Property Graph Relationship Tables as Parquet #UnifiedDataAnalytics #SparkAISummit 17

18.Demonstration #UnifiedDataAnalytics #SparkAISummit

19.Graph Querying with Cypher #UnifiedDataAnalytics #SparkAISummit

20.What is Cypher? • Declarative query language for graphs – "SQL for graphs" • Based on pattern matching • Supports basic data types for properties • Functions, aggregations, etc #UnifiedDataAnalytics #SparkAISummit 20

21. Pattern matching Query graph: Data graph: Result: 3 5 1 2 4 a b 2 2 3 4 5 1 #UnifiedDataAnalytics #SparkAISummit

22.Basic Pattern: Alice's reviews? Forrest User REVIEWS Gump? NODE RELATIONSHIP NODE (:User {name:'Alice'} ) -[:REVIEWS]-> (business:Business) LABEL PROPERTY Type VAR LABEL #UnifiedDataAnalytics #SparkAISummit

23.Cypher query structure • Cypher operates over a graph and returns a table • Basic structure: MATCH pattern WHERE predicate RETURN/WITH expression AS alias, ... ORDER BY expression SKIP ... LIMIT ... #UnifiedDataAnalytics #SparkAISummit

24.Basic Query: Businesses Alice has reviewed? MATCH (user:User)-[r:REVIEWS]->(b:Business) WHERE = 'Alice' RETURN, r.rating #UnifiedDataAnalytics #SparkAISummit

25. Query Comparison: Colleagues of Tom Hanks? SELECT AS coReviewer, count(co) AS nbrOfCoReviews FROM User AS user JOIN UserBusiness AS ub1 ON ( = ub1.user_id) JOIN UserBusiness AS ub2 ON (ub1.b_id = ub2.b_id) JOIN User AS co ON ( = ub2.user_id) WHERE = "Alice" GROUP BY MATCH (user:User)-[:REVIEWS]->(:Business)<-[:REVIEWS]-(co:User) WHERE = 'Alice' RETURN AS coReviewer, count(*) AS nbrOfCoReviews #UnifiedDataAnalytics #SparkAISummit

26.Variable-length patterns MATCH (a:User)-[r:KNOWS*2..6]->(other:User) RETURN a, other, length(r) AS length Allows the traversal of paths of variable length Returns all results between the minimum and maximum number of hops #UnifiedDataAnalytics #SparkAISummit 26

27.Aggregations • Cypher supports a number of aggregators – min(), max(), sum(), avg(), count(), collect(), ... • When aggregating, non-aggregation projections form a grouping key: MATCH (u:User) RETURN, count(*) AS count The above query will return the count per unique name #UnifiedDataAnalytics #SparkAISummit 27

28.Projections list • UNWIND ‘a’ UNWIND [‘a’, ‘b’, ‘c’] AS list ‘b’ ‘c’ • WITH – Behaves similar to RETURN – Allows projection of values into new variables – Controls scoping of variables MATCH (n1)-[r1]->(m1) WITH n1, collect(r1) AS r // r1, m1 not visible after this RETURN n1, r #UnifiedDataAnalytics #SparkAISummit

29.Expressions • Arithmetic (+, -, *, /, %) • Logical (AND, OR, NOT) • Comparison (<, <=, =, <>, >=, >) • Functions – Math functions (sin(), cos(), asin(), ceil(), floor()) – Conversion (toInteger(), toFloat(), toString()) – String functions – Date and Time functions – Containers (Nodes, Relationships, Lists, Maps) – … #UnifiedDataAnalytics #SparkAISummit