Building a Graphy Time Machine

图形数据库允许用户分析高度互联的数据集,并在这些关系中查找模式。社交网络、企业层次结构、欺诈检测、网络分析或构建整体知识图是图形数据库的重要用例。然而,这些节点和连接边缘的数据集随着时间的推移而变化。无论您是开发人员、架构师还是数据科学家,您都可能希望通过时间旅行来分析过去,甚至预测未来。
虽然您的图形数据库可能缺乏管理图形数据修订历史的内置支持,但本文将向您展示如何对一般图形类以性能方式进行管理。最重要的是,这不需要任何突破性的新想法。我们将简单地从现有的持久数据结构文献中借用一些工具和技巧,并对它们进行调整,以便在图形数据库软件中获得良好的性能。

展开查看详情

1. One Engine, one Query Language. Building a Graphy Time Machine Multiple Data Models. Michael Hackstein ArangoDB Inc. Copyright © ArangoDB Inc. , 2018

2.About Me 1. Michael Hackstein 2. ArangoDB Core Team a. Leading Graph Development Team i. Graph Features ii. Smart Graphs iii. UI 3. Masters Degree a. Spec. in Databases and Information Systems 4. Twitter/Github: @mchacki Copyright © ArangoDB Inc. , 2018

3.Motivation 1. Data evolves over time: a. Documents are added or removed b. Attributes are changed 2. We often need a full history or some kind of version control 3. This includes graph data Copyright © ArangoDB Inc. , 2018

4.Requirement: Property Graphs 1. Allow to store attributes on vertices 2. Allow to store attributes on edges Copyright © ArangoDB Inc. , 2018

5. Time-Slicing Copyright © ArangoDB Inc. , 2018

6.Use-Case 1. We have a graph-data model 2. The data is changing over time 3. We have a requirement that we can see how the data looked like at a certain point in time 4. We are able to compress, archive or delete data older than a certain point. a. If we cannot clean up we need to be prepared for "infinite" production of data. Copyright © ArangoDB Inc. , 2018

7.Example Step 1: Start Original Graph data, the starting point Add {created: ts1, expires: ts2} to every edge A A C B C B E D E D Copyright © ArangoDB Inc. , 2018

8.Example Step 2: Modify B -> B' Full Graph, all nodes are active A->B->D is obsolete in 1. A->B'->D is active at 1 A A Copy on C B C B B' Write E D E D Copyright © ArangoDB Inc. , 2018

9.Querying for Time-Slicing Query the neighbors of a vertex (A) at a certain time FOR n, e IN 1 OUTBOUND 'vertex/A' GRAPH 'timemachine' FILTER e.created >= @timestamp FILTER e.expires < @timestamp RETURN n Query a (sub-) graph starting at vertex (A) at a certain time FOR n, e, p IN 1..@depth OUTBOUND 'vertex/A' GRAPH 'timemachine' FILTER p.edges[*].created ALL >= @timestamp FILTER e.edges[*].expires < @timestamp RETURN n Copyright © ArangoDB Inc. , 2018

10.Example Step 3: Querying Timestamp == 0 Timestamp == 5 A A C B B' C B B' E D E D Copyright © ArangoDB Inc. , 2018

11. DEMO TIME Copyright © ArangoDB Inc. , 2018

12.Problem solved? 1. Solved: a. Store complete history b. Query for any point in time 2. New / Open: a. We would need infinite space b. What about complexity / cost of inserts? c. What about complexity of reads? Copyright © ArangoDB Inc. , 2018

13.Infinite space 1. Scale to more machines a. Delays the issue 2. In most use-cases: a. Most recent data needs full versions b. Old data can be compacted and archived c. => Use-case specific "garbage" collection Copyright © ArangoDB Inc. , 2018

14.Overhead of a modification 1. "Mental Model": a. Update one Attribute (1 operation) 2. "Real Data": a. Copy the Vertex and overwrite this attribute (1 operation) b. Copy all n connected edges with new created/expires (n operations) c. Modify expire of all n connected edges (n operations) Copyright © ArangoDB Inc. , 2018

15. Vertex-Proxies Copyright © ArangoDB Inc. , 2018

16.Vertex Proxies to the rescue Logical Vertex Stored Vertex IN A A OUT Copyright © ArangoDB Inc. , 2018

17.Modify A IN IN A A A' A A' OUT OUT Copyright © ArangoDB Inc. , 2018

18.Overhead of a modification (Proxy) 1. "Mental Model": a. Update one Attribute (1 operation) 2. "Real Data": a. Copy the Vertex and overwrite this attribute (1 operation) b. Create edges IN -> Vertex' -> OUT (2 operations) c. Modify expire IN -> Vertex -> OUT (2 operations) Copyright © ArangoDB Inc. , 2018

19.Replace A with new B D E D E D E A A B F F F Copyright © ArangoDB Inc. , 2018

20.Overhead of a replace (Proxy) 1. "Mental Model": a. Replace one Vertex (1 operation) b. Update k connected edges (k operations) 2. "Real Data": a. Create new Vertex (1 operation) b. Create In and Out Proxy (2 operations) c. Create k new connected edges (k operations) d. Expire k old connected edges (k operations) Copyright © ArangoDB Inc. , 2018

21. DEMO TIME Copyright © ArangoDB Inc. , 2018

22.Number of Edges created 1. It sounds like the number of edges can explode? a. Total Number of edges is high b. #Edges (Proxy) < #Edges (noProxy) if #Edges/Vertex > 3 2. Number of edges per Vertex: a. Assume (A) has k inbound edges, and n outbound edges and L versions b. IN-Proxy: k inbound, L outbound c. Out-Proxy: L inbound, n outbound d. A (each): 1 inbound, 1 outbound Copyright © ArangoDB Inc. , 2018

23.Influence on Query costs 1. For simplicity assume we only search with direction on edges a. For reverse-direction same complexity applies 2. And assume we only search for one specific timestamp a. Timestamp- Range-search is possible, just makes the formula more complicated ;) 3. "Mental-Model": a. Every vertex A has n outbound edges => (O(n)) for depth 1, (O(nd)) for depth d 4. "Proxy-Model": a. Specific version of A has 1 outbound edge to OUT => (O(1)) b. OUT has n outbound edges to IN => (O(n)) for depth 1, (O(n d)) for depth d c. IN has L outbound edges, only 1 is relevant => (O(L)), linear search 5. "No-Proxy-Model": a. Specific version of A has n * L outbound edges, only n are relevant => (O(n*L)) Copyright © ArangoDB Inc. , 2018

24. Vertex-Centric index Copyright © ArangoDB Inc. , 2018

25.Linear Scan does not scale 1. This step is optional, depends on use-case a. Small number of versions L => linear might be okay b. High number of versions L => you may need this optimization 2. The number of neighbors n does not matter in the proxy solution. a. We need all neighbors n in every query. b. Improving the time to find one specific neighbor is of no help. Copyright © ArangoDB Inc. , 2018

26. Vertex-Centric Index A A No index => scan all connected edges index (_from, color) => scan all edges of color Copyright © ArangoDB Inc. , 2018

27.Vertex-Centric Index applied 1. We need a Sorted index a. First sort by vertex (_from) b. Second sort by created c. In ArangoDB: Skiplist(["_from", "created"]) 2. Find the relevant version in O(log L) instead of O(L) 3. Drawbacks: a. Uses additional space b. Increase cost of write 4. Only relevant on OUT -> Vertex edges a. In ArangoDB use three edge collections, e.g.: outToV, vToIn, inToOut Copyright © ArangoDB Inc. , 2018

28. DEMO TIME Copyright © ArangoDB Inc. , 2018

29.Problem solved? 1. Solved: a. Store complete history b. Query for any point in time 2. New / Open: a. We would need infinite space b. What about complexity / cost of inserts? c. What about complexity of reads? Copyright © ArangoDB Inc. , 2018