传统的磁盘都是HDD机械硬盘,随机访问的速度内存是HDD的成千上万倍,随着硬件技术的发展和普及,快速随机访问磁盘SSD甚至是PCI-E SSD,还有Intel的黑科技,接近内存速度的非易失存储设备,传统数据库随机数据访问算法面临很多的优化空间,本文介绍了基于LSM算法优化的有很多优化策略。

注脚

展开查看详情

1.Optimizing Database Algorithms for Random-Access Block Devices Risi Thonangi PhD Defense Talk Advisor: Jun Yang

2.Background: Hard-Disk Drives Hard Disk Drives (HDDs) Magnetic platters for storage Mechanical moving parts for data access I/O characteristics Fast sequential access & slow random access Read/write symmetry Disadvantages Slow random access High energy costs Bad shock absorption 2

3.Background: Newer Storage T echnologies Flash memory, phase-change m emory, ferroelectric RAM and MRAM Solves HDD’s disadvantages of high energy usage, bad shock absorption, etc. Representative: f lash memory Floating-gate transistors for storage Electronic circuits for data access I/O Characteristics – fast random access & Expensive writes USB Flash Drive Other technologies (phase-change m emory, etc.) exhibit similar advantages and I/O characteristics 3

4.Random-Access B lock D evices Devices that are Block based Support fast random access but have costlier writes Popular example – Solid-State Drives (SSDs) Use flash memory for storage Have advantages of flash memory Quickly replacing HDDs in consumer & enterprise storage Other examples Cloud storage & key-value store based storage Phase-change memory 4

5.Motivation & Problem Statement Can’t we use random-access block device as a drop-in replacement for HDD? They do lead to better performance But can be suboptimal because existing algorithms are not optimized for them Problem statement :– Optimize database algorithms for I/O characteristics of random-access & read/write asymmetry 5

6.Related work For SSDs Indexes BFTL, FlashDB, LA-tree, FD-tree, etc. Query processing techniques PAX style storage & FlashJoin, B-File for maintaining samples Transaction processing In-Page Logging, Log-structured storage & optimistic concurrency control, etc. We propose new techniques & algorithms that systematically exploit both random access and read/write asymmetry 6

7.Outline Permutation problem Merge policies for Log-Structured Merge (LSM) tree Concurrency control in indexes Conclusion & future work [VLDB 2013] “Permuting Data on Random-Access Block Storage”. Risi Thonangi, Jun Yang [CIKM 2012] “A Practical Concurrent Index for Solid-State Drives”. Risi Thonangi, Shivnath Babu, Jun Yang [Techreport 2015] “Optimizing LSM-tree for Random- Access Block Storage”. Risi Thonangi, Jun Yang 7

8.Permutation problem Permutation:- reorder records such that the output address of a record is a function of its input address Ex.: converting between matrix layouts, resorting multi-dimensional aggregates, etc. For HDDs, external merge sort is popular for permutation Augment each record with its output address, and sort by it # passes = logarithmic in input size # block I/Os = : input, memory, block sizes (all in # records) External merge sort – shortcomings Augmentation creates larger records Does not exploit structure of the permutation   8

9.A naïve algorithm Just write out records in output order, and read as needed Need two blocks of memory: one for write, one for read Too many block reads— —because records are scattered Much worse than sorting— In fact, general permutation is as hard as sorting [Vitter 2001]   9 ….. Memory state Input Output

10.Address space A record address is encoded by an -digit number with radices The (0-based) sequence # of the record within the file is Example: address space with radices   10 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 2 2 2 3 3 3 0 0 0 1 1 1 2 2 2 3 3 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 3-digit addresses records

11.ADP (address-digit permutation) Permutation of records is defined by a permutation of address digits Example Permutation in address space   Input address space Output address space 11 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 2 2 2 3 3 3 0 0 0 1 1 1 2 2 2 3 3 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2 2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Permutation in record space  

12.Single-radix ADP For simplicity of this presentation, assume A ll digits have radix , so Memory size (# of records) Block size (# of records) Presented in the thesis – the general case with mixed radices and sizes that are non-powers       Block boundary 12

13.A really simple ADP Suppose no digits cross the block boundary I.e., no record needs to move outside its own block One pass, with one memory block, will do! Generalize? Carefully select memory-full of “ action records ” at a time, so they are clustered in blocks in both input and output In this simple ADP above, records in each block form a group of action records 13 Input Output

14.Basic one-pass ADP: selecting “action digits” Choose , “ action digits ” In-digits —those within block boundary in the input address space, plus Entering digits —out-digits in the input that become in-digits in the output Denote non-action digits by A group of action records = those whose input addresses share the same setting for digits in   In-digits:   Entering digits:       14

15.Basic one-pass ADP: algorithm For each possible setting of : read the group of action records sharing ; permute them in memory; write action records to output blocks Memory requirement: # entering digits   Clustered in input blocks!   Clustered in output blocks!   15 0 00 *** 0 01 *** 0 10 *** 0 11 *** 1 00 *** 1 01 *** 1 10 *** 1 11 *** 00 0*** 00 1*** 01 0*** 01 1*** 10 0*** 10 1*** 11 0*** 11 1***             Entering digits:  

16.More entering digits than we can handle? Exploit the read/write asymmetry Filtered reads : allow a block to be read multiple times in one “pass” Use multiple passes Perform a series of simpler permutations whose composition is the desired composition A cost-based optimizer to determine the passes: Presented in the thesis – a provably optimal algorithm … which “balances” aggressiveness of filtered reads across passes 16 Problem : given input data and a permutation, find a “plan”—possibly involving multiple passes and/or filtered reads—that minimizes cost  

17.Permuting vs. sorting for ADP Sorting vs. our algorithm (without using filtered reads) In either case, each pass has reads and writes Sorting takes passes Depends on —bigger input means more passes We take passes Depends on the permutation, but does not depend on Note (# entering digits) , so with practical configurations, we can complete any ADP in fewer than two passes no matter how big the input is ! (Filtered reads further exploit r/w asymmetry to lower cost)   17

18.Experiment: # of passes 5-attribute dataset adapted from TPC-W Consider all possible ADPs of the 5-digit address space I .e., all ways to resort data Show distribution of # passes needed as we increase data size Sorting passes increase with data size We never take passes!   18 % of permutations ADP SORTING ADP SORTING ADP SORTING ADP SORTING ADP SORTING ADP SORTING

19.Conclusion Introduced address-digit permutations (ADPs) Capturing many useful data reorganization tasks Designed algorithms for ADPs on random-access block storage Exploiting fast random accesses, read/write asymmetry Beating sort! Results not covered in this talk Optimizations that read/write larger runs of blocks Mixed radices, memory/blocks sizes that are non-powers of the radices More experiments, including permuting data stored on SSDs & Amazon S3 19

20.Outline Permutation problem Merge policies for Log-Structured Merge (LSM) tree Concurrency control in indexes Conclusion & future work [VLDB 2013] “Permuting Data on Random-Access Block Storage”. Risi Thonangi, Jun Yang [CIKM 2012] “A Practical Concurrent Index for Solid-State Drives”. Risi Thonangi, Shivnath Babu, Jun Yang [Techreport 2015] “Optimizing LSM-tree for Random- Access Block Storage”. Risi Thonangi, Jun Yang 20

21.Background: LSM-tree Index structure Levels L 0 , L 1 , …, L n-1 with geometrically increasing sizes L 0 – always in memory L 1 , …, L n-1 – tightly packed B-trees Processing updates to LSM-tree Store them in L 0 If no space in L i , invoke merge between L i and L i+1 … L 0 L 1 L 2 Move data from smaller to larger levels LSM-tree – features High update throughput Fast access to recent data

22.Motivation Existing LSM-tree implementations optimize for HDDs Minimize random access Variants are also popular for SSDs, as LSM-tree avoids in-place updates Can we do bett e r? Optimize LSM-tree for random-access block d evices Minimize writes Understand and improve merge policies

23.Step 1: modify LSM-tree to preserve blocks Structure L 0 – remains the same Persistent level L i – list of data blocks Pointers to blocks stored in a B-tree Data blocks – Not required to be sequentially collocated Not necessarily full Operations Block Preserving Merge (BPM) Preserves untouched blocks from input L i+1 L i+1 L i Save blocks Merge blocks 23

24.Step 2: understand merge policies Full policy Merge all blocks to L i+1 Round-Robin (RR) policy Merge a fraction of blocks to L i+1 Select blocks to merge in round-robin fashion L i+1 L i fraction   fraction   … 24 It seems RR is simply spreading the work of Full over time ( δ fraction at a time) — would it be better?

25.Full vs RR – performance Sample experiment to compare Full and RR 3 levels with L 0 = 1Mb & fanout = 10 Uniform workload Steady-state case RR is surprisingly better than Full Selects from high density region always Merge process sustains this behavior for uniform workloads Worst-case behavior for RR could be bad if the distribution is not friendly 25

26.Merge policies – Partial policy Merge a portion of blocks to L i+1 Select the best portion that minimizes I/O cost Can be done in steps   L i+1 L i L i+1 L i Fewest block overlaps in this range Round-robin selection 26

27.Partial’s performance guarantee In the worst case : RR: each merge can write all blocks in L i+ 1 Partial: each merge can write no more than blocks in L i+1 is maximum allowed size of L i+1  

28.Step 3: an even better merge policy? Observation – internal levels are almost always full in RR & Partial policies High occupancy levels have more “resistance”: costlier to merge into Leads to more overlapping blocks during merge RR and Partial policies are too greedy Idea: gain long-term savings by applying Full now and then?

29.Towards Mixed policy Feasibility study LSM tree with 3 levels L 0 = 1Mb & fanout = 10 Merge policy `Test’ s.t. L 0 to L 1 merge – always Partial L 1 to L 2 merge – always Full Test policy beats Partial in some cases Example: costs per level for index size = 20Mb Costs at L 2 – Test is slightly worse than Partial Costs at L 1 – Test outperforms Partial by a large margin Total savings is much better than Partial policy Uniform distribution Index size = 20Mb Apply Full policy when L i+1 is small. Extra costs in Full merge will be offset by future savings at upper levels. 29