SQLGraph -- When ClickHouse marries graph processing

首先介绍了关于作者,关于他的实验,如何使用ClickHouse.
然后,说明了为什么Graph processing engine和关于Graph processing engined的相关情况。
最后是一些实例。

展开查看详情

1. SQLGraph When ClickHouse marries graph processing 郑天祺 (Amos Bird) 中科院 计算所 网络数据实验室

2.About Me • Active ClickHouse Contributor • ~70 valid PRs • ~40 Stack Overflow Answers • Open Source Enthusiast (Hacked 20+ Projects) • DB: Impala, Greenplum, Cockroach, Citus, Kudu • Misc: emacs, tmux, gdb, fish-shell, tdesktop ... • SQLGraph Author https://github.com/amosbird

3.About My Lab http://www.ict.ac.cn/jgsz/kyxt/wlzdsys/ • CCF Task Force on Big Data (大数据专家委员会) • Organizing BDTC (大数据大会) • Practitioner on Big Data

4.About My Lab http://www.ict.ac.cn/jgsz/kyxt/wlzdsys/ • CCF Task Force on Big Data (大数据专家委员会) • Organizing BDTC (大数据大会) • Practitioner on Big Data My role

5.How We Use ClickHouse Crawler Extractor tens of millions docs per day billions of docs in total Web RocketMQ Before Natural Elastic MongoDB Language Search Old Stuff Understanding Full Text & Search GIS Data MySQL Neo4J Sloooow The NLU module outputs well structured, clean, (almost) immutable data. NLU : Natural Language Understanding

6.How We Use ClickHouse Crawler Extractor tens of millions docs per day Web billions of docs in total RocketMQ After Natural Elastic MongoDB Language Search Understanding Some Full Text GIS Data Search in CH? ClickHouse w/ Graph Process Engine Swift Graph process engine is a specialized component to run analytical graph algorithms efficiently. The NLU module outputs well structured, clean, (almost) immutable data. NLU : Natural Language Understanding

7.Why Graph Processing Engine • Finding Important Nodes and Edges • PageRank • Personalized PageRank • Shortest Path • Community Detection • Label Propagation • Connection Components • Louvain • Recommendation • Collaborative Filtering • Random Walk • Neural Embeddings

8.Why Graph Processing Engine in RDBMS • Hidden graph structures all over the place • Graph traversal via joins is both slow and inconvenient • Almost impossible to do iterative algorithms (recursion) • Cumbersome graph extraction process if acting as a graph data source • Powerful in data management and processing • Full fledged processing (Instead of awk, perl … ) • Mature concepts of metadata handling • Widely adopted and ease of use interface Mining what exists, aiding what emerges

9.Why Graph Processing Engine in ClickHouse • From Architect’s Perspective • Fast data storage and fast data processing • Can handle many data sources • Rich user interface • From Developer’s Perspective • Nice building blocks for performance critical applications • Versatile SQL pipelines for mixed SQL/graph processing • Good code quality Think OLAP but in the graph field.

10.What to Expect from an OLAP Graph DB • General Users • Easy to use (think as a graph) • Easy to understand (view as a graph) • Faster than Non-Graph databases or OLTP graph databases • Graph Experts • Extremely fast analysis • Interactive graph algorithm designing • Ability to do low level optimizations Be efficient to end users and computer hardware.

11.System Design Goals 1. Visualization Usability 2. Graphs Being First-Class Citizens 3. SQL accessibility 1. Graph Algorithm SDK Versatile 2. Graph Inspector and Rich Operators 3. Reusable Graph Modeling 1. Outperform graph compute engines Performance 2. Quick development iterations 3. Fast graph data digestion

12.SQLGraph Interactive Explorative UI (RESTful, JDBC, cmd, ) Graph SQL Relational SQL SQL Plus Graph API ClickHouse Graph Engine Edge Tables In Memory Graph Vertex Tables Graph Algorithms Tables Unified Data View Source Data Kafka CSV MySQL Mongo

13.Some Results Calculate PageRank value per person PageRank 100000 Relative Slow Down Factor (Log) 11351.0 SELECT vp(v, 'name') AS name, pagerank 10000 2533.1 FROM pagerank(wz) 519.3 1000 ORDER BY pagerank DESC LIMIT 5 131.6 100 54.4 43.0 18.6 ┌─name─┬────pagerank─┐ 10 │ li │ 0.06964007 │ 1 1 │ zhao │ 0.063854694 │ 1 │ qian │ 0.06365149 │ │ sun │ 0.06347877 │ │ shen │ 0.019985389 │ graph500 twitter └──────┴─────────────┘ Find a longest path which ends at ‘shen’ Three-Hop Path Query 100000 20590.5 28844.5 15389.3 15457.3 18579.0 Relative Slow Down Factor (Log) SELECT pp(p, 'name', 'w') AS path 10000 1419.0 FROM edgePath(wz) 1000 428.2 WHERE vp(p[-1].2, 'name') = 'shen' 100 32.0 ORDER BY length(path) DESC 10 3.4 LIMIT 1 1 1 1 ┌─path─────────────────────────────────┐ │ zhou--[3]->wang--[2]->chu--[4]->shen │ └──────────────────────────────────────┘ graph500 twitter

14.Let’s Dive in examples and implementations

15.Graph Query Interface • Graph Algorithms are table functions • Graphs are special tables that used as the first argument • Collection of functions to retrieve graph info • Example : Get the top 5 pagerank value from graph “wz” SELECT vp(v, 'name’), -- retrieve the vertex property “name” pagerank FROM pagerank( -- the graph algorithm wz, -- the graph table 5, -- iterations 0.85, -- damping factor 0.01 -- epsilon ) ORDER BY pagerank DESC LIMIT 5

16.Graph Query Output • Three kinds of outputs : vertices, edges or graph info • Vertices : PageRank, CC, BC, Radii, etc. • Has ‘v : Vertices’ column in the output list • Cover all vertices, no duplications • Can be inserted back into the graph as a global vertex property SELECT id, groupArray(vp(v, 'name')) AS names FROM cc(wz) GROUP BY id ┌─id─┬─names─────────────────────────────────────────────────────────┐ │ 0 │ ['zhao','qian','sun','li'] │ │ 4 │ ['zhou','wu','zheng','wang','feng','chen','chu','wei','shen'] │ │ 14 │ ['han','yang'] │ │ 12 │ ['jiang'] │ └────┴───────────────────────────────────────────────────────────────┘

17.Graph Query Output • Three kinds of outputs : vertices, edges or graph info • Edges : Hop, SSSP, CommonNeighbors, LinkCircle, etc. • Has ‘e : Edges’ column in the output list • Support function combinators: Out, In, All, Path • Can be used to derive subgraphs: create graph as select … • Has ‘Graph’ format to support Non-Structural output (for visualization) SELECT pp(p, 'name', 'w') FROM hopInPath(wz, ( SELECT v FROM vertex(wz) WHERE name = 'zhao' )) ┌─pp(p, 'name', 'w')────────────────────────────┐ │ zhao<-[1]--li<-[2]--sun<-[3]--qian<-[2]--zhao │ │ zhao<-[1]--li<-[4]--zhou │

18.Graph Query Output • Three kinds of outputs : vertices, edges or graph info • Graph info : TriangleCount, MaxClique, ClusteringCoef, etc. • Results are graph characters SELECT * FROM triangle(wz) ┌─count─┐ │ 2 │ └───────┘ Mainly for graph mining algorithms

19.Graph Query Output Pipeline (Compared) Single Run Multiple Runs Post (Reusable) Post Processing Path Processing Builder Vertex Vertex & Graph Info Hybrid Compute Output Queue (Synchronous) Edge or Path Output γv (Asynchronous) Graph ClickHouse V.id = E.to Algorithm Table Functions Join V.id = E.from Special Join Message Graph Table Scan Vertex Edge Scan Scan Vertex Edge Scan Scan Vertexica SQLGraph

20.How to Get/Build a Graph What does a graph look like in RDBMS? Movie User Movie Producer Name Name Name 北极 Age Genre Capital 风 父亲 2M User 3 2 3M Producer UM 5 PM 2M User 李四 暴雪 Producer 5 红星 张三 Movie Movie 王五 Rating Invest How can we make the graph structure efficient? We need CSR/CSC to be fast! CSR/C : Compressed Sparse Row/Column

21.CSR/C Explained Compressed sparse storage 1 2 CSR CSC Src 0 1 2 3 4 Dest 0 1 2 3 4 Index 0 4 5 5 7 Index 0 0 2 5 6 0 Dest 1 2 3 4 2 2 4 1 Src 0 4 0 1 3 0 0 3 4 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Similar to Arrays in ClickHouse Vanilla edge list storage 0,1 0,2 0,3 0,4 1,2 3,2 3,4 4,1 Easy to store and modify (normal tables)

22. Movie 北极 风 父亲 From Tables to CSR/C 2M User 3 2 3M Producer 5 2M 李四 暴雪 5 红星 张三 王五 Define vertex tables to encode key columns CREATE TABLE Producer (Name String key, Capital UInt64) ENGINE V; CREATE TABLE Movie (Name String key, Genre String) ENGINE V; CREATE TABLE User (Name String key, Age int) ENGINE V; Producer 红星:0 Key:HashMap Name: Key 暴雪:1 -> [0, P-1] Capital E.g. Property:Columns Movie 风:0 Key:HashMap Vertex Table: Producer Name: Key 北极:1 -> [0, M-1] 父亲:2 Genre Property:Columns E.g. User 张三:0 Vertex Table: Movie Key:HashMap Name: Key 李四:1 -> [0, N-1] 王五:2 Age Property:Columns E.g. Vertex Table: User

23. Movie 北极 风 父亲 From Tables to CSR/C 2M User 3 2 3M Producer 5 2M 李四 暴雪 5 红星 张三 王五 Define edge tables to store relations CREATE TABLE Producer_Movie (Src VS(Producer), Dst VD(Movie), Invest UInt64) ENGINE E; CREATE TABLE User_Movie (Src VS(User), Dst VD(Movie), Rating int) ENGINE E; src:_VS (Producer) Producer_Movie 红星-风: <0, 0> src:_VS (User) User_Movie 张三-北极: <0, 1> Producer: VS 红星-北极: <0, 1> User: VS 李四-风: <1, 0> dst:_VD (Movie) dst:_VD (Movie) Movie: VD 暴雪-父亲: <1, 1> 李四-父亲: <1, 2> order by <_VS, _VD> order by <_VS, _VD> Movie: VD Invest E.g. 王五-父亲: <2, 2> Rating Property:Columns Property:Columns E.g. Edge Table: Producer_Movie Edge Table: User_Movie Storing hidden columns _vs, _vd and use ‘MergeTree order by _vs, _vd’ as the underlying engine

24. Movie 北极 风 父亲 From Tables to CSR/C 2M User 3 2 3M Producer 5 2M 李四 暴雪 5 红星 张三 王五 Define MetaSQL to specify how to build a graph CREATE GRAPH User_Movie_Producer AS SELECT * FROM edgeGroup(User_Movie, Producer_Movie) MetaSQL Defines the following workflow: 1. Adjusting IDs of vertices and edges 2. Aggregating edge properties 3. Grouping edges (also assigning Eid) 张三:0 Eid edge group 李四:1 User: Movie: Producer: 张三-北极: <0, 4> 1 vertex num=N vertex num=M vertex num=P _vs 李四-风: <1, 3> 2 王五:2 风:3 key=Name key=Name key=Name _vd 李四-父亲: <1, 5> 3 北极:4 Property Handle Property Handle Property Handle Rating 王五-父亲: <2, 5> 4 父亲:5 Invest 红星-风: <6, 3> 5 红星:6 Vertex Table Proxy 红星-北极: <6, 4> 6 暴雪:7 暴雪-父亲: <7, 5> 7 E.g. MetaSQL: User_Movie_Producer E.g.

25. Movie 北极 风 父亲 From Tables to CSR/C 2M User 3 2 3M Producer 5 2M 李四 暴雪 5 红星 张三 王五 Execute MetaSQL to build a graph REFRESH User_Movie_Producer [FULL] Rating Invest Eid _VS 0 1 2 3 4 5 6 7 5 NULL 1 Index 0 2 3 3 3 3 4 6 3 NULL 2 2 NULL 3 _VD 4 3 5 5 3 4 5 5 NULL 4 NULL 2M 5 Index 0 1 2 3 4 5 6 NULL 3M 6 Forward Topology NULL 2M 7 _VD 0 1 2 3 4 5 6 7 Edge Properties Index 0 2 3 3 3 3 4 6 Eid 1 6 0 6 1 2 7 _VS 4 3 5 5 3 4 5 Index 0 1 2 3 4 5 6 Index 0 1 2 3 4 5 6 Edge Index Backward Topology Graph: User_Movie_Producer

26.How to Write a Graph Algorithm • Two ways of writing graph algorithms: python or c++ • Python: for ad-hoc queries, debugging and performance tuning Argument List: Vertices UInt64 DateTime Output List: e Edge w Weight def run(pygraph, start_vertices, steps, time): Python JIT UDF ClickHouse OutList InList /python_folder VertexProperties EdgeProperties BuiltinOperators Each file under <python_folder> is treated as an python model and loaded and jitted to native code Graph Algorithm NoteBook

27.How to Write a Graph Algorithm • Two ways of writing graph algorithms: python or c++ • C++: for long use/well known algorithms, efficiency Real Part Virtual Part Dest 0 1 2 3 4 Chunk A Offset 0 4 4 4 6 ... A 1 2 3 4 2 4 1 A Source 1 2 3 4 2 4 1 ... B B 1 2 C C subgraph 0 D D Real E E Virtual 4 3 Parts Parts F F VR-Chunk G G RS : Real Part Starting Vertex H H RE : Real Part Ending Vertex RB : Real Part Starting Edge I I VV : Virtual Part Vertex VE : Virtual Part Ending Edge J J CSC Edge List Chunking Line (Edge Num per Chunk) D-Chunk Starting Vertex, First Edge, Last Edge, 0 Frontier Offset 3 Also Needed by Ligra 7 Degree Array Traverse 8 Collect 12 Sparse Degree Array 14 Parallel Scan Frontier Degree Prefix Sum Array CSR s Edge List Binary search over the degree prefix sum array SilverChunk Framework ((14, 2), 8) Chunk1 Chunk2 Chunk3 Chunk4

28.Graph Query Helper • vp, ep, pp for outputting related properties • v for getting the internal id(s) of a vertex/vertices • edge, vertex for retrieving edges/vertices of a graph respectively • show graphs/algos for global inspections • desc graph/algo for detail info • graphInfo as a binary interface to access graph info ……

29. Graph Query Helper (E.g.) DESCRIBE GRAPH wz ┌──────────────────────────┬─name─────────┬─value or type─┬─comment────────────────────┐ │ Edge properties: │ │ │ │ │ │ region │ String │ │ │ │ time │ Date │ │ │ │ w │ Float64 │ │ │ Vertex group: │ │ │ │ │ │ │ │ │ │ Vertex name: │ default.node │ │ │ │ Key name: │ id │ UInt32 │ vertex id range : 0 --- 15 │ │ Vertex properties: │ │ │ │ │ │ id │ UInt32 │ │ │ │ age │ UInt32 │ │ │ │ name │ String │ │ │ │ │ │ │ │ Meta info: │ │ │ │ │ │ vertex_num │ 16 │ │ │ │ edge_num │ 16 │ │ │ │ symmetric │ false │ │ │ │ load │ true │ │ └──────────────────────────┴──────────────┴───────────────┴────────────────────────────┘