Extending Spark Graph for the Enterprise with Morpheus and Neo4j

Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and graph algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research.

Morpheus is an open-source library that is API compatible with Spark Graph and extends its functionality by:

A Property Graph catalog to manage multiple Property Graphs and Views
Property Graph Data Sources that connect Spark Graph to Neo4j and SQL databases
Extended Cypher capabilities including multiple graph support and graph construction
Built-in support for the Neo4j Graph Algorithms library In this talk, we will walk you through the new Spark Graph module and demonstrate how we extend it with Morpheus to support enterprise users to integrate Spark Graph in their existing Spark and Neo4j installations.
We will demonstrate how to explore data in Spark, use Morpheus to transform data into a Property Graph, and then build a Graph Solution in Neo4j.


1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2.Extending Spark Graph for the Enterprise with Morpheus and Neo4j Martin Junghanns & Sören Reichardt Neo4j #UnifiedDataAnalytics #SparkAISummit

3.Motivation #UnifiedDataAnalytics #SparkAISummit

4.Graphs are everywhere #UnifiedDataAnalytics #SparkAISummit 4

5.… and growing #UnifiedDataAnalytics #SparkAISummit 5

6.The Property Graph Model Node ● Represents an entity within the graph ● Can have labels Relationship ● Connects a start node with an end node ● Has one type Property ● Describes a node/relationship: e.g. name, age, weight etc ● Key-value pair: String key; typed value (string, number, list, ...) #UnifiedDataAnalytics #SparkAISummit 6

7.Graph Patterns with Cypher #UnifiedDataAnalytics #SparkAISummit

8.Graphs are coming to Spark #UnifiedDataAnalytics #SparkAISummit 8

9. https://git.io/fjqp6 Spark Project Improvement Proposal ● Defines a Cypher-compatible Property Graph type based on DataFrames ● Replaces GraphFrames querying with Cypher ● Reimplements GraphFrames/GraphX algos on the Property Graph type ● Running PoC: [SPARK-27299][GRAPH][WIP] Spark Graph API design proposal #UnifiedDataAnalytics #SparkAISummit

10.SPIP: What are we trying to do? ● “Spark Cypher” ○ Run a Cypher query on a Property Graph returning a tabular result ● Implementation is based on Spark SQL ○ Property Graphs are composed of one or more DFs ● Provide Scala, Python and Java APIs ● Deep dive: Graph Features in Spark 3.0: Thursday 11AM, Room G104 #UnifiedDataAnalytics #SparkAISummit

11.SPIP: How does it look like? spark-graph-api SPIP spark-cypher spark-sql #UnifiedDataAnalytics #SparkAISummit 11

12.Spark Graph Demo #UnifiedDataAnalytics #SparkAISummit 12

13.SPIP: What are we not solving? ● Addresses the Cypher Property Graph Model ○ Does not deal with variants of that model (e.g. RDF) ● No multiple graph features ○ API is flexible to support this in future iterations ● No Property Graph Catalog ○ Also no Property Graph specific Data Sources #UnifiedDataAnalytics #SparkAISummit

14.... but ... #UnifiedDataAnalytics #SparkAISummit 14

15. Morpheus: Spark Graph for the enterprise #UnifiedDataAnalytics #SparkAISummit

16.The OLTP / OLAP landscape Tables Graphs PostgreSQL, Transactional Oracle, Neo4j SQLServer Data Integration & Analytics Spark SQL Morpheus #UnifiedDataAnalytics #SparkAISummit

17.Morpheus creates Property Graphs ... Hive, DF, JDBC TABLES PROPERTY GRAPH SUB- composing Morpheus DataFrames GRAPH SOURCES FS snapshot #UnifiedDataAnalytics #SparkAISummit

18.… wrangles Property Graphs ... DataFrame Driving Table Property Property Property Graph Cypher Graph Result Cypher Graph Result QUERY QUERY Cypher DataFrame SPIP QUERY Table Result #UnifiedDataAnalytics #SparkAISummit

19.… analyses graphs in Spark and Neo4j ... Property Property Graph GRAPH Graph ALGOS ANALYSIS DataFrame DataFrame toolsets #UnifiedDataAnalytics #SparkAISummit

20.… and stores Property Graphs SUBGRAPH Property Graph Morpheus STORE FS snapshot #UnifiedDataAnalytics #SparkAISummit

21.Spark and Neo4j Spark is an immutable data processing engine ○ Spark SQL organizes data in tables (DataFrames) ○ DataFrames can be queried via SQL ○ Spark SQL programs are optimized by Catalyst Neo4j is a native transactional CRUD database ○ Neo4j graphs use a native graph data representation ○ Neo4j graphs can be queried using Cypher ○ Neo4j has optimized in-process MT graph algos #UnifiedDataAnalytics #SparkAISummit

22.Morpheus: SQL + Cypher in one session Graphs and tables are both useful data models ○ Finding paths and subgraphs, and transforming graphs ○ Viewing, aggregating and ordering values The Morpheus project parallels Spark SQL ○ PropertyGraph type (composed of DataFrames) ○ Catalog of graph data sources, named graphs, views, ○ Cypher query language A CypherSession adds graphs to a SparkSession #UnifiedDataAnalytics #SparkAISummit

23.What is Morpheus used for? Data integration ○ Integrate (non-)graphy data from multiple, heterogeneous data sources into one or more property graphs Distributed Cypher execution ○ OLAP-style graph analytics Data science ○ Integration with other Spark libraries ○ Feature extraction using Neo4j Graph Algorithms #UnifiedDataAnalytics #SparkAISummit

24. https://bit.ly/2oUfnA5 Neo4j Graph Algorithms Pathfinding Centrality / Community & Search Importance Detection • Parallel Breadth First Search • Degree Centrality • Triangle Count • Parallel Depth First Search • Closeness Centrality • Clustering Coefficients • Shortest Path • CC Variations: Harmonic, Dangalchev, • Connected Components (Union Find) • Single-Source Shortest Path Wasserman & Faust • Strongly Connected Components • All Pairs Shortest Path • Betweenness Centrality • Label Propagation • Minimum Spanning Tree • Approximate Betweenness Centrality • Louvain Modularity – 1 Step & Multi-Step • A* Shortest Path • PageRank • Balanced Triad (identification) • Yen’s K Shortest Path • Personalized PageRank • K-Spanning Tree (MST) • ArticleRank • Random Walk • Eigenvector Centrality Link Similarity Prediction • Euclidean Distance • Adamic Adar neo4j.com/docs/ • Cosine Similarity • Common Neighbors • Jaccard Similarity • Preferential Attachment graph-algorithms/current/ • Overlap Similarity • Resource Allocations • Pearson Similarity • Same Community * Available in GraphFrames • Total Neighbors #UnifiedDataAnalytics #SparkAISummit

25. Free O’Reilly Book neo4j.com/ graph-algorithms-book • Spark & Neo4j Examples • Machine Learning Chapter #UnifiedDataAnalytics #SparkAISummit 25

26. Cypher An open language for graph querying #UnifiedDataAnalytics #SparkAISummit

27.Cypher query language Cypher 9 is the latest full version of openCypher ○ Implemented in Neo4j 3.5 ○ Implemented in whole/part by six other vendors ○ Several other partial and research implementations ○ Cypher for Gremlin is another openCypher project #UnifiedDataAnalytics #SparkAISummit

28.Cypher 9 in Morpheus and Spark Graph (SPIP) Cypher is a full CRUD language ○ RETURNs only tabular results: not composable ○ Results can include graph elements (paths, relationships, nodes) or property values Morpheus and SPIP implement most of read-only Cypher ○ No MERGE or DELETE ○ Spark immutable data + transformations #UnifiedDataAnalytics #SparkAISummit

29.Cypher 10 in Morpheus - Multiple graphs Cypher 10 proposes support for Multiple Graphs ○ Multiple Graph CIP: https://git.io/fjmrx Allows for Cypher Query composition ○ Similar to chaining transformations on DataFrames Support Graph Catalog for managing Graphs ○ Analogous to Spark SQL catalog Query support for Graph Construction #UnifiedDataAnalytics #SparkAISummit

由Apache Spark PMC & Committers发起。致力于发布与传播Apache Spark + AI技术,生态,最佳实践,前沿信息。