钱正平--为并行图数据处理提供高层抽象/语言

为并行图数据处理提供高层抽象/语言
展开查看详情

1.FLASH: Towards a higher level of abstraction in parallel graph computing ԅଚᤈࢶහഝ॒ቘ൉‫׀‬ṛ੶ು᨝᧍᥺ Zhengping Qian (᰸ྋଘ) Senior Staff Engineer Alibaba

2.Graph scenarios at Graph technologies Project Flink FLASH On-going research Alibaba and challenges FLASHᶱፓ ๅग़Ꮈᑪૡ֢ ᴨ᯾ࢶ࣋ว ࢶ॒ቘದ๞Ө೴౴

3.Graph scenarios at Graph technologies Project Flink FLASH On-going research Alibaba and challenges FLASHᶱፓ ๅग़Ꮈᑪૡ֢ ᴨ᯾ࢶ࣋ว ࢶ॒ቘದ๞Ө೴౴

4.Graph analytics for better decisions හഝᳵጱ॔๥‫ى‬ᘶӨᕮ຅ᇙ à ๅᔜ‫ݢ޾ٵ‬ᶌጱ٬ᒽ Graph Nodes Better Machine Learning Decision Properties Relationships

5.https://www.artificial-intelligence.video/undergraduate-machine-learning-4-introduction-to-probability-linear-algebra-and-pagerank

6.Graph matters at Alibaba, too ᴨ᯾ӿ੄ጱࢶ࣋ว • Billions of vertices (‫܈پ‬Պጱᅩ) • Hundred billions of edges (౮ጯӤ‫܉‬Պጱᬟ) • Real-time updates (ਫ෸ๅෛ) o e.g., 320K Tmall transactions/s in 2017

7.Graph data processing paradigms See ፡ ࢶහഝ॒ቘ᝜ୗ Insights Get statistics ၏ᥠ ᇙ஄ᕹᦇ Pattern mining On-line inference ཛྷୗ‫܃‬ᯈ ࣁᕚᶼၥ Machine learning ๢࢏਍ԟ

8.

9.Graph algorithms …… ࢶᓒဩ Product User OneID Location ……

10.Pattern-based fraud detection चԭৼࢶཛྷୗጱ୑ଉ༄ၥ transfer $100.00 real-time updates transaction “accomplices” $99.00 Click farming-simple Click farming-complex On-line gambling ‫ܔڬ‬1 ‫ܔڬ‬2 ᕚӤᩚ‫ܗ‬

11.Community-based recommendation in Taobao चԭࢶጱ๢࢏਍ԟଫአԭႣਪവគ !"#$" %$&'()*" +**$,-(*#" +**$,-(*#" 80s color = red ? Male brand = Gap IT professional for summer We encode the following “intuitions” ࢶᕮ຅‫ڥํ௳מ‬ԭᥴ٬‫ۖސٯ‬/ᕮຎ‫ݢ‬ᥴ᯽௔ᒵᵙ᷌҅Ԟᚆ୚‫ف‬ग़໏௔Ѻ • Reflect each user/item’s uniqueness • Utilize structural information (transaction or viewing edges) • Utilize user/item attributes (handling cold start problem in CF) • Introduce nice “surprise” via graph exploration

12.On-line knowledge graph reasoning चԭᎣᦩࢶᨏጱࣁᕚവቘ Knowledge-powered intelligent service ᦊᎣ๐‫ۓ‬ Bundle recommendation ‫ܔٻ‬വគ

13.Graph scenarios at Graph technologies Project Flink FLASH On-going research Alibaba and challenges FLASHᶱፓ ๅग़Ꮈᑪૡ֢ ᴨ᯾ࢶ࣋ว ࢶ॒ቘದ๞Ө೴౴

14.Vision ౄว Lookup of Graph Graph Pattern Machine vertex/edge traversal algorithms matching Learning Graph Store any graph, run any algorithms, and serve the results on-line as needed ਂ‫ؙ‬ձ֜ࢶֵ҅አձ֜‫ړ‬ຉૡٍ޾຅ୌᥴ٬ොໜ҅ኜᛗࣁᕚ๐‫ۓ‬

15.Build an extensible stack to allow “play-and-play” existing graph technologies ຅ୌ‫ݢ‬ಘ઀ጱ๐‫֛ۓ‬ᔮ҅‫ێ׵‬ሿํࢶದ๞ኞா Graph Pattern Machine Graph traversal algorithms matching Learning TinkerPop Gremlin Vertex- openCypher ? centric APIs Neo4j GraphLab, JanusGraph Flink Gelly etc. YARN HDFS

16.Build an extensible stack to allow “play-and-play” existing graph technologies ຅ୌ‫ݢ‬ಘ઀ጱ๐‫֛ۓ‬ᔮ҅‫ێ׵‬ሿํࢶದ๞ኞா Graph Pattern Machine Graph traversal algorithms matching Learning TinkerPop Gremlin FLASH openCypher ? Shared graph operations & primitives Flink YARN HDFS

17.Graph scenarios at Graph technologies Project Flink FLASH On-going research Alibaba and challenges FLASHᶱፓ ๅग़Ꮈᑪૡ֢ ᴨ᯾ࢶ࣋ว ࢶ॒ቘದ๞Ө೴౴

18.Vertex-centric programming ᶎ‫ݻ‬ᶮᅩጱࢶᖫᑕ Arijit Khan, "Vertex-Centric Graph Processing: Good, Bad, and the Ugly", in EDBT 2017

19.FLASH (FiLter, locAl, puSH) FLASH (main query language) oB = A.FlashOperator (arguments...), where A and B are both sets of vertices + control flow (e.g., while loop) Basic Operators oB = A.Filter(condition): Filter vertices in A which satisfy the given condition o B = A.Local(operation): Apply some local operations (aggregation e.g.) to A oB = A.Push(neighbors, attributes): For each vertex in A, push the given attributes to all its neighbors. The output set contains all vertices’ neighbors. Auxiliary Operators oSet (Union, Intersect, etc.), Order (Asc, Desc, TopK) and Group (by Key)

20.An example: Connected Component (CC) ᓒဩᐏֺғᬳ᭗‫ړ‬ᰁ

21.An example: Connected Component (CC) ᓒဩᐏֺғᬳ᭗‫ړ‬ᰁ A = V.local(cid = id); while (A.size>0) A = A.push("adj", "cid", "preid=min(cid)") .filter(preid < cid).local(cid = preid); return V.{id, cid};

22.FLASH VS vertex-centric programming Өᶎ‫ݻ‬ᶮᅩᖫᑕጱྲ᫾ (CC) A = V.local(cid = id); while (A.size>0) A = A.push("adj", "cid", "preid=min(cid)") .filter(preid < cid).local(cid = preid); return V.{id, cid}; How about other algorithms? Gremlin? Cypher? ‫ڦ‬ጱᓒဩԞᚆᤒᬡҘ޾Gremlin޾Cypher᧍᥺ྲ᫾ޫҘ

23. 4.50 HITS algorithm 4.51 Arithmetic Expression Evaluation Algorithms checked 4.52 Tree Function Computation 4.53 Truss Graph Decomposition 4.54 Undirected Basic Cycle Computation 4.104 Eulerian Circuit Computation 4.55 Personalized PageRank 4.105 Graph Summary ૪ḵᦤ‫ݢ‬ᤒᬡᓒဩ 4.56 k-Edge Connected Component 4.57 ECC Graph Decomposition 4.58 Steiner Component 4.106 Graph Coarsening 4.107 Bidirectional BFS 4.108 Graph Voronoi Diagram Partitioner 4.59 2-Hop Labelling for Reachability 4.109 Collaborative Filtering 4.60 2-Hop Labelling for Shortest Path 4.110 Maximal Bipartite Matching 4.61 k-Path Cover 4.111 Matrix Factorization 4.62 Maximum Bipartite Matching 4.112 Random Walk in All Vertices (Node2Vec) 4.9 Maximal Independent Set 4.63 Maximum Weight Bipartite Matching 4.113 k-means 4.10 Minimal Vertex Cover 4.64 Stable Bipartite Matching 4.114 Maximal bi-clique enumeration https://www.uts.edu.au/staff/lu.qin 4.11 Ego-Network Computation 4.12 Triangle Enumeration 4.65 Tree Graph Decomposition 4.66 Core-Tree Graph Decomposition 4.115 Belief Propagation 4.116 FM Sketch 4.13 Directed Triangle Enumeration 4.67 Double k-Core 4.117 Stochastic Gradient Descent (SGD) 4.14 Clustering Coefficient Computation 4.68 (k,r)-Core 5. Cypher to FLASH 4.15 Maximal Matching 4.69 Graph Diameter 5.1 Query Friends 4.16 Minimal Dominating Set 4.70 Subgraph Matching 5.2 Les Personnages 4.17 All-pair Shortest Path 4.71 Subgraph Enumeration 5.3 UEFA Euro 2016 4.18 Minimal Graph Coloring 4.72 Motif Enumeration 5.4 Winter Network 3. Gremlin to FLASH 4.19 Structural Clustering 4.73 Ear Decomposition 5.5 Fitness and Nutritional Recommendations 3.1 Between Vertices 4.20 Closeness Centrality 4.74 Densest Subgraph 5.6 The Cantina Bar 3.2 Duplicate Edge Detection 4.21 Betweenness Centrality 4.75 Locally Densest Subgraph 5.7 NBA Sneakers 3.3 Duplicate Vertex Detection 4.22 Graph Eccentricity 4.76 Frequent Pattern Mining 5.8 FIS Alpine Skiing Seasons 3.4 Recommendation 4.23 k-NN Friend Recommendation 4.77 SimRank 5.9 ClimbingDB (social network climbing database) 3.5 Traversal Induced Values 4.24 Depth of Each Vertex on a Tree 4.78 Vertex Connectivity 5.10 NHL Team Ranking Model 3.6 Lowest Common Ancestor on Trees 4.25 Euler-Path on a Tree 4.79 k-Vertex Connected Component 5.11 England Senior Squad 3.7 Maximum Depth on Trees 4.26 Rooting an Undirected Tree 4.80 Contraction Hierarchy Computation 5.12 Pokémon X & Y 3.8 Pagination 4.27 Depth-First Traversal (DFS) on a Tree 4.81 Core Graph Hierarchy Computation 5.13 Travel Helper 3.9 If-Then Based Grouping 4.28 Post-order Traversal on a Tree 4.82 Sparse Certification Computation 5.14 NBA Playoff Prediction 3.10 Connected Component 4.29 Number of Descendants on a Tree 4.83 Multi-Source Breadth-First Search using Bitmap 5.15 Running Competition 3.11 Cycle Detection 4.30 Reachability Querying on a Tree 4.84 Pushbox Game using Dynamic Programming 5.16 Beer & Breweries GraphGist 3.12 Shortest Path 4.31 Lowest Common Ancestor Querying on a Tree 4.85 A* based Shortest Path 5.17 OMA 3.13 Degree Centrality 4.32 Minimum Spanning Tree 4.86 Eight Number Game using A* Algorithm 5.18 Movie Recommendations with k-NN and Cosine Similarity 3.14 Betweenness Centrality 4.33 Maximal Clique Enumeration 4.87 Undirected Graph Cut 5.19 Interpreting Citation Patterns in Academic Publications 3.15 Closeness Centrality 4.34 Maximum Flow 4.88 Gomory-Hu Tree Construction 5.20 Organization Learning 3.16 Eigenvector Centrality 4.35 Directed Cycle Detection 4.89 Frequent Subgraph Mining in a Graph Database 5.21 Competency Management 3.17 Time-based Indexing 4.36 Degeneracy Graph Ordering 4.90 Influence Maximization 5.22 Drug repurposing by hetnet relationship prediction 3.18 Determine Followers in a Follow Graph 4.37 Topological Sort 4.91 Strong/Weak Tie Computation 5.23 Roads, Nodes and Automobiles 3.19 Determine Co-workers of Softwares for Each Person 4.38 Graph Simulation 4.92 Directed Minimum Spanning Tree 5.24 Restaurant Recommendations 3.20 Detect Students with Conflicting Schedule 4.39 Graph Dual-Simulation 4.93 Shortest Path in a Temporal Graph 5.25 Supply Chain Management 3.21 Group Neighbor Labels 4.40 Bipartiteness Testing 4.94 Semi-Clustering on a Graph 5.26 Bank Fraud Detection 4. Graph algorithms in FLASH 4.41 Minimum Steiner Path 4.95 Multi-Criteria Social Graph Partitioning 5.27 Rebels financial system 4.1 Induced Subgraph 4.42 Minimum Group Steiner Path 4.96 Perfect Matching Testing on a Bipartite Graph 5.28 Tor Network Graph 4.2 Breadth-First Graph Traversal (BFS) 4.43 Minimum Steiner Tree 4.97 Perfect Matching on a Bipartite Graph 5.29 Finding Influencers in a Social Network 4.3 PageRank Computation 4.44 Minimum Group Steiner Tree 4.98 Minimum Vertex Cover in a Bipartite Graph 5.29.1 Finding User Counts 4.4 Graph Keyword Search 4.45 Undirected Depth-First Traversal (DFS) 4.99 Planarity Testing 5.29.2 Looking at forwarded messages 4.5 Connected Component 4.46 Directed Depth-First Traversal (DFS) 4.100 Longest Path in a Directed Acyclic Graph 5.29.3 Finding Conversations 4.6 Single Source Shortest Path (Weighted Graph) 4.47 Bridge Detection 4.101 Minimum k Disjoint Path Computation 5.30 Network versions 4.7 Core Graph Decomposition 4.48 Bi-connected Component 4.102 Core Decomposition in an Uncertain Graph 5.31 Entitlements and Access Control 4.8 Strongly Connected Component 4.49 Cut Vertex Detection 4.103 Graph Pattern Matching (with Reachability Edges)

24.An optimized version of CC ᤒᬡ॔๥ᓒဩս۸ V.local(p = (min(adj,id) == id ? min(adj):min(adj,id))); do { star(); V.filter(s == 1).push(“adj”, “id”, “f”).push(“f”, “p”, “ap”).local(pm = min(p, ap)) .push(“p”, “pm”, “f”).local(p = min(f)); star(); V.filter(s == 1).push(“adj”, “id”, “f”).push(“f”, “p”, “ap”).local(pm = min(ap(it != p))) .filter(pm != null).push(“p”, “pm”, “f”).local(p = min(f)); A = V.push(“p”, “id”, “f”).push(“f”, “p”, “g”).filter(p != g).local(p = g); } while(A.size > 0) return V.{id, p}; procedure star() { V.local(s = 1).push(“p”, “id”, “f”).push(“f”, “p”, “g”).filter(p==g).local(s = 0).push(“g”).local(s = 0); V.push(“p”, “id”, “f”).filter(s == 0).push(“f”).local(s = 0); }

25.System optimizations made possible: example of push/pull ꧋ᦜๅग़ጱᔮᕹս۸ғpush/pullᐏֺ BFS(s) A = V.filter(id == s).local(dis = 0); while(A.size > 0) { A = A.push(“out”, “dis”, “predis”) .filter(!hasAttr(“dis”)) .local(dis = min(predis) + 1); } return V.{id, dis}; https://www.slideshare.net/yuichiroyasui/graph500-bigdata2013

26.System optimizations made possible: example of push/pull ꧋ᦜๅग़ጱᔮᕹս۸ғpush/pullᐏֺ BFS(s) A = V.filter(id == s).local(dis = 0); while(A.size > 0) { if (A.size < threshold) { // TODO: switch to pull mode continue; } A = A.push(“out”, “dis”, “predis”) .filter(!hasAttr(“dis”)) .local(dis = min(predis) + 1); } return V.{id, dis}; https://www.slideshare.net/yuichiroyasui/graph500-bigdata2013

27.System optimizations made possible: example of (vertex) reordering ꧋ᦜๅग़ጱᔮᕹս۸ғ(ᶮᅩ)ᶲଧս۸ᐏֺ Triangle enumeration A = V.local(t = both).local(t = append(t,id)) .push(both, t, tList) .local(t = append(tList,id)) .local(t=filter(t, it.0<it.1 && it.1 < it.2 && contain(both, it.0)); return {t};

28.System optimizations made possible: example of (vertex) reordering ꧋ᦜๅग़ጱᔮᕹս۸ғ(ᶮᅩ)ᶲଧս۸ᐏֺ Triangle enumeration A = V.local(t = filter(both,it.id > id)) // t.0>t.1 .local(t = append(t,id)) .local(tout = filter(both, id > it. id)) // t.1>t.2 .push(tout, t, tList) .local(t = append(tList,id)) .local(t=filter(contain(both, it.0)); return {t}; https://en.wikipedia.org/wiki/Greedy_coloring

29.Preliminary evaluation (VS Flink Gelly) ӨFlink Gellyᬰᤈ‫ྍڡ‬ጱฃᖫᑕ/௔ᚆྲ᫾ LOCs LOCs Soc Soc Twitter Twitter UK UK Friendster Friendster Algorithms (Gelly) (FLASH) (Gelly) (FLASH) (Gelly) (FLASH) (Gelly) (FLASH) (Gelly) (FLASH) CC 52 8 - - 9m49s 9m10s 45m57s 49m55s - - SSSP 53 7 - - 5m31s 5m34s 13m13s 13m14s - - PageRank 51 7 - - 11m13s 11m22s 31m20s 31m29s - - Community 93 11 - - 27m48s 26m19s 1h24m 1h41m - - Detection Triangle 194 5 - - - - - - 45m19s 55m28s Counting Jaccard 299 6 7m11s 10m43s - - - - - - Similarity Environment: 16 nodes with 20GB memory and 4 cores each