OrpheusDB: Bolt-on Versioning for Relational Databases

OrpheusDB: Bolt-on Versioning for Relational Databases
展开查看详情

1.OrpheusDB : Bolt-on Versioning for Relational Databases Silu Huang 1 , Liqi Xu 1 , Jialin Liu 1 , Aaron J. Elmore 2 , Aditya Parameswaran 1 1 University of Illinois (UIUC) 2 University of Chicago 1

2.Motivation: Dataset Versioning Data Science is Popular Dataset Versioning Arises 2

3.Motivation: Example from Biological Domain           Add N ew Record/Column Correct and Curate   Merge Back Download and Analyze     Neighborhood Cooccurrence ENSP1 ENSP2 0 53 ENSP1 ENSP3 0 87 ENSP4 ENSP5 426 0 Protein1 Protein2 Non-Linear Process Hundred to Thousand of Versions 3

4.Versioned Dataset Management In the Wild Shared System/Folder: everyone can access Make Private Copies, Modify, Analyze, and Integrate back. 4 Massive Redundancy No True Collaboration N o Querying Capability

5.Existing Management Systems: Alternatives I . SC-VC, G itHub Cannot scale to large dataset Limited functionalities II. Temporal Databases Only support a linear chain of versions 5 III. Decibel [1], DEX [2] Build “from the ground up” No parser, query optimizer, transaction manager for now [1] Decibel: The relational dataset branching system. PVLDB 2016. [2] DEX: Query Execution in a Delta-based Storage System. SIGMOD 2017 “Bolt-on” Approach ?

6.How to Support Compact S torage? How to Support True Collaboration? A llow branching, checkout, commit, merge How to Support Advanced Querying? Issue query against versioning provenance Identify versions satisfying some property Show aggregate statistics across versions 6 Version Control Commands SQL Commands Unmodified DBMS Versioning Layer Client Issues to address: Bolt-on Versioning: “Reuse” Relational Databases

7.How to Support Compact S torage? Data Representation How to Support True Collaboration? A llow branching, checkout, commit, merge How to Support Advanced Querying? Issue query against versioning provenance Identify versions satisfying some property Show aggregate statistics across versions 7 Version Control Commands SQL Commands Unmodified DBMS Versioning Layer Client Issues to address: Bolt-on Versioning: “Reuse” Relational Databases

8.Outline 8 Strawman Approach Data Representation Access/Storage Optimization

9.Strawman Approach : Issues Neighborhood Cooccurrence ENSP1 ENSP2 0 53 V1 ENSP1 ENSP2 0 53 V2 ENSP1 ENSP2 225 53 V3 ENSP1 ENSP3 0 87 V1 ENSP1 ENSP3 0 87 V2 ENSP1 ENSP3 0 87 V3 ENSP4 ENSP5 426 0 V1 ENSP4 ENSP5 426 36 V3 Protein1 Protein2 VID Augment each tuple with a version ID (VID) 1. Storage is Huge Optimized Representation Scheme 9 Two Issues 2. Checkout is Costly Partition Scheme

10.Outline 10 Strawman Approach Data Representation Access/Storage Optimization

11.Data Representation: Alternatives Storage is Huge Neighborhood Cooccurrence ENSP1 ENSP2 0 53 V1 ENSP1 ENSP2 0 53 V2 ENSP1 ENSP2 225 53 V3 ENSP1 ENSP3 0 87 V1 ENSP1 ENSP3 0 87 V2 ENSP1 ENSP3 0 87 V3 ENSP4 ENSP5 426 0 V1 ENSP4 ENSP5 426 36 V3 Protein1 Protein2 VID (a) Table with Versioned Records Neighborhood Cooccurrence ENSP1 ENSP2 0 53 {V1,V2} ENSP1 ENSP2 225 53 {V3} ENSP1 ENSP3 0 87 {V1,V2,V3} ENSP4 ENSP5 426 0 {V1} ENSP4 ENSP5 426 36 {V3} Protein1 Protein2 Vlist (b) Combined Table 11 Example: Clone V3 as V4 Commit is Costly

12.Data Representation: Alternatives Neighborhood Cooccurrence ENSP1 ENSP2 0 53 V1 ENSP1 ENSP2 0 53 V2 ENSP1 ENSP2 225 53 V3 ENSP1 ENSP3 0 87 V1 ENSP1 ENSP3 0 87 V2 ENSP1 ENSP3 0 87 V3 ENSP4 ENSP5 426 0 V1 ENSP4 ENSP5 426 36 V3 Protein1 Protein2 VID (a) Table with Versioned Records Neighborhood Cooccurrence ENSP1 ENSP2 0 53 {V1,V2} ENSP1 ENSP2 225 53 {V3 ,V4 } ENSP1 ENSP3 0 87 {V1,V2,V3 ,V4 } ENSP4 ENSP5 426 0 {V1} ENSP4 ENSP5 426 36 {V3 ,V4 } Protein1 Protein2 Vlist (b) Combined Table 12 Example: Clone V3 as V4 Commit is Costly Storage is Huge

13.Data Representation: Alternatives Neighborhood Cooccurrence R1 ENSP1 ENSP2 0 53 R2 ENSP1 ENSP2 225 53 R3 ENSP1 ENSP3 0 87 R4 ENSP4 ENSP5 426 0 R5 ENSP4 ENSP5 426 36 (c) Data Table + Versioning Table Protein1 Protein2 RID R1 {V1,V2} R2 {V3} R3 {V1,V2,V3} R4 {V1} R5 {V3} Vlist RID V1 {R1,R3,R4} V2 {R1,R3} V3 {R2,R3,R5} R list V ID ( i ) Split-by- Vlist (ii) Split-by- Rlist Commit is Costly 13                 Mapping between Records and Versions Experiments in the Paper

14.Outline 14 Strawman Approach Data Representation Access/Storage Optimization

15.Access and Storage Trade Off: Partitioning Checkout Latency increases As “Irrelevant” records increase Records scatter across the table Neighborhood Cooccurrence R1 ENSP1 ENSP2 0 53 R2 ENSP1 ENSP2 225 53 R3 ENSP1 ENSP3 0 87 R4 ENSP4 ENSP5 426 0 R5 ENSP4 ENSP5 426 36 R6 ENSP1 ENSP3 169 87 R7 ENSP6 ENSP7 207 35 Protein1 Protein2 RID V1 {R1,R3,R4} V2 {R1,R3} V3 {R2,R3,R5} V4 {R2,R5,R6} V5 {R1,R3,R7} R list V ID Checkout V5 15 Access EntireTable (7 Records)

16.Access and Storage Trade Off: Partitioning Our A pproach : Partitioning Break up the table into partitions Neighborhood Cooccurrence R1 ENSP1 ENSP2 0 53 R3 ENSP1 ENSP3 0 87 R4 ENSP4 ENSP5 426 0 R7 ENSP6 ENSP7 207 35 Protein1 Protein2 RID V1 {R1,R3,R4} V2 {R1,R3} V5 {R1,R3,R7} R list V ID Checkout V5 Neighborhood Cooccurrence R2 ENSP1 ENSP2 225 53 R3 ENSP1 ENSP3 0 87 R5 ENSP4 ENSP5 426 36 R6 ENSP1 ENSP3 169 87 Protein1 Protein2 RID V3 {R2,R3,R5} V4 {R2,R5,R6} R list V ID     16 Access Smaller Table (4 Records) Reduce checkout latency with more storage cost   Repeated

17.Access and Storage Trade Off: Cost Model Average Checkout C ost   Checkout cost for version the size of the partition belongs to : checkout cost across all versions, divided by # of versions   Storage Cost   Total # of records in all partitions Problem Formulation: Given a storage threshold , find a partitioning scheme that minimizes such that   NP-Hard! (Reduction from 3-Partition) 17

18.Access and Storage Trade Off: T wo Extremes All in One Table Smallest S torage Cost   Each Version as One Table Smallest Average Checkout Cost   18

19.Access and Storage Trade Off: T wo Extremes All in One Table Smallest S torage Cost   Each Version as One Table Smallest Average Checkout Cost           Partition Per Version Single Partition   O ptimal Scheme 19 Intuition: Group Similar Versions

20.Partitioning: LyreSplit Clustering Approach Group versions with similar records together Need to examine every single record; Costly                 7 6 4 8 10 6 8 6 7 8 30 12 10 LyreSplit : Operate on Version Graph Encodes the derivation information Indicates similarity among versions 20 vs. ~1K Versions ~10M Records Light-Weight  

21.Partitioning: High-level Idea of LyreSplit               7 6 4 8 10 6 8 6 7 8 30 12 10 - coherent: Current         Single Partition -coherent ? NO Split     ( , )- Approximation G uarantee   21

22.              7 6 4 8 10 6 8 6 7 8 30 12 10               7 6 8 10 6 8 6 7 8 30 12 10               7 6 8 10 8 6 7 8 30 12 10 Partitioning: LyreSplit Illustration 22

23.Partitioning: Experimental Setting Dataset Versioning workload from Decibel[1] Algorithm Our proposed algorithm – LyreSplit LyreSplit (Varying – coherence parameter) Two clustering-based algorithm from Nscale [2] AGGLO (Varying – partition capacity) KMeans ( Varying – # of partitions)   [1] Decibel : The relational dataset branching system. PVLDB 2016 [2] NScale : neighborhood-centric large-scale graph analytics in the cloud . VLDB Journal 2016 23

24.Comparison of Partitioning Algorithms : Effectiveness Access and Storage Trade Off Storage: 2.3GB Checkout: 2.9s 24

25.Comparison of Partitioning Algorithms : Efficiency 10 3 X faster on average 25 Running Time Comparison 1M 5M 10M Dataset Size (# of Records)

26.Benefits of Partitioning: S torage vs. Access Comparison With and Without Partitioning Storage: 2X I ncrease Checkout: 21X Reduction 26 1M 5M 10M 1M 5M 10M

27.Summary 27 Bolt-on versioning for relational databases [ OrpheusDB ] Optimized data representation [Split-by- rlist ] D esigned partition scheme [ LyreSplit ]

28.http://orpheus-db.github.io/ [Fully Open-Source] 28

29.Proposal: “Reuse” Relational Database Traditional SQL Identify Version Aggregates Across Versions Checkout Commit Diff Translate and Rewrite Query Optimize Performance H ow data is stored 29