传统的磁盘都是HDD机械硬盘,随机访问的速度内存是HDD的成千上万倍,随着硬件技术的发展和普及,快速随机访问磁盘SSD甚至是PCI-E SSD,还有Intel的黑科技,接近内存速度的非易失存储设备,传统数据库随机数据访问算法面临很多的优化空间,本文介绍了基于LSM算法优化的有很多优化策略。

注脚

1.Optimizing Database Algorithms for Random-Access Block Devices Risi Thonangi PhD Defense Talk Advisor: Jun Yang

2.Background: Hard-Disk Drives Hard Disk Drives (HDDs) Magnetic platters for storage Mechanical moving parts for data access I/O characteristics Fast sequential access & slow random access Read/write symmetry Disadvantages Slow random access High energy costs Bad shock absorption 2

3.Background: Newer Storage T echnologies Flash memory, phase-change m emory, ferroelectric RAM and MRAM Solves HDD’s disadvantages of high energy usage, bad shock absorption, etc. Representative: f lash memory Floating-gate transistors for storage Electronic circuits for data access I/O Characteristics – fast random access & Expensive writes USB Flash Drive Other technologies (phase-change m emory, etc.) exhibit similar advantages and I/O characteristics 3

4.Random-Access B lock D evices Devices that are Block based Support fast random access but have costlier writes Popular example – Solid-State Drives (SSDs) Use flash memory for storage Have advantages of flash memory Quickly replacing HDDs in consumer & enterprise storage Other examples Cloud storage & key-value store based storage Phase-change memory 4

5.Motivation & Problem Statement Can’t we use random-access block device as a drop-in replacement for HDD? They do lead to better performance But can be suboptimal because existing algorithms are not optimized for them Problem statement :– Optimize database algorithms for I/O characteristics of random-access & read/write asymmetry 5

6.Related work For SSDs Indexes BFTL, FlashDB, LA-tree, FD-tree, etc. Query processing techniques PAX style storage & FlashJoin, B-File for maintaining samples Transaction processing In-Page Logging, Log-structured storage & optimistic concurrency control, etc. We propose new techniques & algorithms that systematically exploit both random access and read/write asymmetry 6

7.Outline Permutation problem Merge policies for Log-Structured Merge (LSM) tree Concurrency control in indexes Conclusion & future work [VLDB 2013] “Permuting Data on Random-Access Block Storage”. Risi Thonangi, Jun Yang [CIKM 2012] “A Practical Concurrent Index for Solid-State Drives”. Risi Thonangi, Shivnath Babu, Jun Yang [Techreport 2015] “Optimizing LSM-tree for Random- Access Block Storage”. Risi Thonangi, Jun Yang 7

8.Permutation problem Permutation:- reorder records such that the output address of a record is a function of its input address Ex.: converting between matrix layouts, resorting multi-dimensional aggregates, etc. For HDDs, external merge sort is popular for permutation Augment each record with its output address, and sort by it # passes = logarithmic in input size # block I/Os = : input, memory, block sizes (all in # records) External merge sort – shortcomings Augmentation creates larger records Does not exploit structure of the permutation   8

9.A naïve algorithm Just write out records in output order, and read as needed Need two blocks of memory: one for write, one for read Too many block reads— —because records are scattered Much worse than sorting— In fact, general permutation is as hard as sorting [Vitter 2001]   9 ….. Memory state Input Output

10.Address space A record address is encoded by an -digit number with radices The (0-based) sequence # of the record within the file is Example: address space with radices   10 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 2 2 2 3 3 3 0 0 0 1 1 1 2 2 2 3 3 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 3-digit addresses records

11.ADP (address-digit permutation) Permutation of records is defined by a permutation of address digits Example Permutation in address space   Input address space Output address space 11 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 2 2 2 3 3 3 0 0 0 1 1 1 2 2 2 3 3 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2 2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Permutation in record space  

12.Single-radix ADP For simplicity of this presentation, assume A ll digits have radix , so Memory size (# of records) Block size (# of records) Presented in the thesis – the general case with mixed radices and sizes that are non-powers       Block boundary 12

13.A really simple ADP Suppose no digits cross the block boundary I.e., no record needs to move outside its own block One pass, with one memory block, will do! Generalize? Carefully select memory-full of “ action records ” at a time, so they are clustered in blocks in both input and output In this simple ADP above, records in each block form a group of action records 13 Input Output

14.Basic one-pass ADP: selecting “action digits” Choose , “ action digits ” In-digits —those within block boundary in the input address space, plus Entering digits —out-digits in the input that become in-digits in the output Denote non-action digits by A group of action records = those whose input addresses share the same setting for digits in   In-digits:   Entering digits:       14

15.Basic one-pass ADP: algorithm For each possible setting of : read the group of action records sharing ; permute them in memory; write action records to output blocks Memory requirement: # entering digits   Clustered in input blocks!   Clustered in output blocks!   15 0 00 *** 0 01 *** 0 10 *** 0 11 *** 1 00 *** 1 01 *** 1 10 *** 1 11 *** 00 0*** 00 1*** 01 0*** 01 1*** 10 0*** 10 1*** 11 0*** 11 1***             Entering digits:  

16.More entering digits than we can handle? Exploit the read/write asymmetry Filtered reads : allow a block to be read multiple times in one “pass” Use multiple passes Perform a series of simpler permutations whose composition is the desired composition A cost-based optimizer to determine the passes: Presented in the thesis – a provably optimal algorithm … which “balances” aggressiveness of filtered reads across passes 16 Problem : given input data and a permutation, find a “plan”—possibly involving multiple passes and/or filtered reads—that minimizes cost  

17.Permuting vs. sorting for ADP Sorting vs. our algorithm (without using filtered reads) In either case, each pass has reads and writes Sorting takes passes Depends on —bigger input means more passes We take passes Depends on the permutation, but does not depend on Note (# entering digits) , so with practical configurations, we can complete any ADP in fewer than two passes no matter how big the input is ! (Filtered reads further exploit r/w asymmetry to lower cost)   17

18.Experiment: # of passes 5-attribute dataset adapted from TPC-W Consider all possible ADPs of the 5-digit address space I .e., all ways to resort data Show distribution of # passes needed as we increase data size Sorting passes increase with data size We never take passes!   18 % of permutations ADP SORTING ADP SORTING ADP SORTING ADP SORTING ADP SORTING ADP SORTING

19.Conclusion Introduced address-digit permutations (ADPs) Capturing many useful data reorganization tasks Designed algorithms for ADPs on random-access block storage Exploiting fast random accesses, read/write asymmetry Beating sort! Results not covered in this talk Optimizations that read/write larger runs of blocks Mixed radices, memory/blocks sizes that are non-powers of the radices More experiments, including permuting data stored on SSDs & Amazon S3 19

20.Outline Permutation problem Merge policies for Log-Structured Merge (LSM) tree Concurrency control in indexes Conclusion & future work [VLDB 2013] “Permuting Data on Random-Access Block Storage”. Risi Thonangi, Jun Yang [CIKM 2012] “A Practical Concurrent Index for Solid-State Drives”. Risi Thonangi, Shivnath Babu, Jun Yang [Techreport 2015] “Optimizing LSM-tree for Random- Access Block Storage”. Risi Thonangi, Jun Yang 20

21.Background: LSM-tree Index structure Levels L 0 , L 1 , …, L n-1 with geometrically increasing sizes L 0 – always in memory L 1 , …, L n-1 – tightly packed B-trees Processing updates to LSM-tree Store them in L 0 If no space in L i , invoke merge between L i and L i+1 … L 0 L 1 L 2 Move data from smaller to larger levels LSM-tree – features High update throughput Fast access to recent data

22.Motivation Existing LSM-tree implementations optimize for HDDs Minimize random access Variants are also popular for SSDs, as LSM-tree avoids in-place updates Can we do bett e r? Optimize LSM-tree for random-access block d evices Minimize writes Understand and improve merge policies

23.Step 1: modify LSM-tree to preserve blocks Structure L 0 – remains the same Persistent level L i – list of data blocks Pointers to blocks stored in a B-tree Data blocks – Not required to be sequentially collocated Not necessarily full Operations Block Preserving Merge (BPM) Preserves untouched blocks from input L i+1 L i+1 L i Save blocks Merge blocks 23

24.Step 2: understand merge policies Full policy Merge all blocks to L i+1 Round-Robin (RR) policy Merge a fraction of blocks to L i+1 Select blocks to merge in round-robin fashion L i+1 L i fraction   fraction   … 24 It seems RR is simply spreading the work of Full over time ( δ fraction at a time) — would it be better?

25.Full vs RR – performance Sample experiment to compare Full and RR 3 levels with L 0 = 1Mb & fanout = 10 Uniform workload Steady-state case RR is surprisingly better than Full Selects from high density region always Merge process sustains this behavior for uniform workloads Worst-case behavior for RR could be bad if the distribution is not friendly 25

26.Merge policies – Partial policy Merge a portion of blocks to L i+1 Select the best portion that minimizes I/O cost Can be done in steps   L i+1 L i L i+1 L i Fewest block overlaps in this range Round-robin selection 26

27.Partial’s performance guarantee In the worst case : RR: each merge can write all blocks in L i+ 1 Partial: each merge can write no more than blocks in L i+1 is maximum allowed size of L i+1  

28.Step 3: an even better merge policy? Observation – internal levels are almost always full in RR & Partial policies High occupancy levels have more “resistance”: costlier to merge into Leads to more overlapping blocks during merge RR and Partial policies are too greedy Idea: gain long-term savings by applying Full now and then?

29.Towards Mixed policy Feasibility study LSM tree with 3 levels L 0 = 1Mb & fanout = 10 Merge policy `Test’ s.t. L 0 to L 1 merge – always Partial L 1 to L 2 merge – always Full Test policy beats Partial in some cases Example: costs per level for index size = 20Mb Costs at L 2 – Test is slightly worse than Partial Costs at L 1 – Test outperforms Partial by a large margin Total savings is much better than Partial policy Uniform distribution Index size = 20Mb Apply Full policy when L i+1 is small. Extra costs in Full merge will be offset by future savings at upper levels. 29

30.Towards Mixed policy Threshold for Full vs Partial policies Depends on the workload Uniform distribution Normal distribution 30

31.Mixed policy For steady state case only Upper most merge Partial because L 0 is always in memory Lower most merge and – cost of index maintenance for Partial and Full policies, respectively Internal level merges Full policy if L i+1 is smaller than a threshold i+1 Learning the parameters Experiment based learning algorithm   L 0 L 1 L 2 L 3 Partial Full , if L 2 < 2 Partial, otherwise.   Full, if < Partial, otherwise.   31

32.Results Settings 0 = 1Mb Fanout = 10 Uniform distribution Experiment – vary index size Mixed is the overall winner RR is close to Partial Sharp improvement for Full & Mixed policies when number of levels increase Experiment – vary record size Block-Preserving Merge can fetch considerable savings   32 100Mb 4 levels 3 levels

33.Conclusion Optimized LSM tree structure for Random Access Block Devices Block-Preserving Merge for saving blocks during merge Studied performance of merge policies – Full, RR and Partial RR is surprisingly quite good Introduced Mixed policy More I/O savings 33

34.Outline Permutation problem Merge policies for Log-Structured Merge (LSM) tree Concurrency control in indexes Conclusion & future work [VLDB 2013] “Permuting Data on Random-Access Block Storage”. Risi Thonangi, Jun Yang [CIKM 2012] “A Practical Concurrent Index for Solid-State Drives”. Risi Thonangi, Shivnath Babu, Jun Yang [Techreport 2015] “Optimizing LSM-tree for Random- Access Block Storage”. Risi Thonangi, Jun Yang 34

35.Indexing schemes for SSDs B+tree is sub-optimal for SSDs Insertions/deletions require in-place updates Thumb rule for index design on SSDs – avoid small in-place updates Buffer insertions and deletions Batch-reorganize the index when buffer overflows Indexing schemes proposed for SSDs BFTL, LA-tree, FD-tree, … Have bad response times during index reorganizations 35

36.FD-tree Index structure Logarithmic method :- levels with geometrically increasing sizes Top level cached in memory Fractional cascading :- pointers between levels Processing updates to FD-tree Store them in L 0 If L 0 is full, merge L 0 to L 1 Continue merges to lower levels until all levels are within their size limits L 2 is within size limit – stop merge I D I I I I I I I I I I I I I I I D I I I I I I I I I I I I I I I I I I I I I I I I I I I I Not designed for efficient concurrent accesses during merge ! 36

37.FD+tree Modified and improved version of FD-tree Allows designing an efficient concurrency scheme Other advantages – deletion support , level skipping , and tighter performance guarantees FD+tree’s merge Calculate, in advance, the number of levels to merge Merge all levels in a single shot FD+tree’s advantages Fewer write I/ Os than FD-tree Maintain a valid index structure during merge: useful for concurrency access I D I I I I I I I I I I I I I I I I I I I I I I I I I I I I 37

38.FD+FC – Full Concurrency for FD+tree Goals :- Don’t want to use extra space No coarse-grained locking of the tree Idea :- Maintain a wavefront to track progress of merge Delete blocks in old levels that were already merged Lookups check wavefront and determine which levels to search I D I I I I I I I I I I I I I I I I I I I I I I I I I I I I wavefront 38

39.FD+FC (contd.) Technical challenges Merge implementation “Record at a time” vs. “ block at a time ” Reclamation of blocks from old levels Block is to be reclaimed only after next block is started processing Ensures all children of the block are done being processed by the merge 11 13 18 25 29 39 20 22 27 28 30 42 Cannot reclaim yet: 28 would be orphaned! Can reclaim now 39

40.FD+tree & FD+FC – more details Level skipping Utilizes main memory more efficiently Proper deletion support + stronger performance guarantees Merges do not read lock blocks 40

41.Experimental results FD+FC vs. FD+XM – global lock FD+DS – space doubling Synthetic workloads FD+FC is better for both inserts and lookups FD+DS fares poorly for inserts 41

42.Conclusion Concurrency control is important for SSD Indexes Good concurrency control requires both carefully rethinking index operations (FD+tree), and designing fine-grained but low-overhead protocols (FD+FC) 42

43.Outline Permutation problem Merge policies for Log-Structured Merge (LSM) tree Concurrency control in indexes Conclusion & future work [VLDB 2013] “Permuting Data on Random-Access Block Storage”. Risi Thonangi, Jun Yang [CIKM 2012] “A Practical Concurrent Index for Solid-State Drives”. Risi Thonangi, Shivnath Babu, Jun Yang [Techreport 2015] “Optimizing LSM-tree for Random- Access Block Storage”. Risi Thonangi, Jun Yang 43

44.Conclusion & Future Work Studied optimizations to database algorithms for Random Access Block Devices Random access block devices as drop in replacement is good but sub-optimal Optimizing database algorithms to random access and read/write asymmetry can fetch us considerably more savings – both cost & performance Future work Optimizing for multi-channel parallelism in random-access block devices Utilizing on-disk computing resources for data processing tasks Optimizing for Phase change RAM Where should we fit PC-RAM in the storage architecture? Explore how high to push the specializations in the system architecture We’ve shown the benefit of specializing access methods and query processing algorithms What about query optimization? 44

45.Thank you 45

46.Concurrency schemes – some proposals Exclusive merge (FD+XM) Lock the complete tree for every op (incl. a long merge) Simple but little concurrency benefit Doubling space (FD+DS) Don’t delete old levels until merge completes Incurs twice the space cost Inefficient main memory utilization Some concurrency control is still needed I D I I I I I I I I I I I I I I I I I I I I I I I I I I I I 46

47.Experimental results Synthetic workloads FD+FC is better for both inserts and lookups FD+DS fares poorly for inserts TPC-C like workloads FD+FC worst-case R p is less than a second FD+DS worst-case R p increases as size of database grows 47

48.Results As index size grows Mixed policy is at least as good as Partial An extra level with lesser occupancy is better Even when index size is larger Uniform Normal Block-preserving Merge is very efficient as index record size increases Uniform 48

49. Modified LSM-tree To save blocks during merge Merge policies – Full, Round Robin, Partial Mixed merge policy Combine Full and Partial for increased I/O savings Results Conclusion 49

50.Indexing schemes for SSDs B+tree is sub-optimal for SSDs Insertions/deletions require in-place updates 4 8 8 9 4 6 1 3 Insert – 7 Thumb rule for index design on SSDs :- Buffer insertions and deletions Batch-reorganize the index Indexing schemes proposed for SSDs :- BFTL, LA-tree, FD-tree, PIO B-tree, … Avoid small in-place updates 50

user picture
  • 献良
  • 非著名互联网公司工程师

相关Slides

  • 美团配送自成立以来,业务经历了多次跨越式发展。业务的飞速增长,对系统的整体架构和基础设施提出了越来越高的挑战,同时也不断驱动着技术团队深刻理解业务、准确定位领域模型、高效支撑系统扩展。如何在业务高速增长、可用性越来越高的背景下实现系统架构的快速有效升级?如何保证复杂业务下的研发效率与质量?本次分享将为大家介绍美团配送的方法与实践。

  • 本次分享的是付钱拉金融系统的架构演进过程,付钱拉作为一个金融云开放平台,对外提供互联网化包装的金融服务。作为一个与资金密切相关的系统,系统的高可用性以及交易资金的安全性尤为重要。付钱拉系统架构经过了几轮演进,这个过程中填过很多坑,积累了许多实战经验。

  • 随着公募基金体量的逐年上涨,好买系统规模变得越来越大,原本将所有的业务单元集中部署在一个或若干个大型机上的体系结构,已经越来越无法满足业务需求,更是随着腾讯理财通接入后的快速发展,我们经历了分布式与平台化的架构调整和多次重要升级的变革。 本次分享将从讲述了好买财富的第三方基金交易系统在演变过程中所遇到的问题与挑战,并包括全金融理财产品业务背景下对系统复杂度带来的关系。

  • 微服务架构下的应用是由一组松耦合的相互协调的服务所组成。这些服务内部通常使用独立的数据库来维护状态,服务与服务之间是通过轻量级的通讯协议进行交互的。如何协调这些服务之间的分布式事务一致性成为微服务架构需要解决的一个重要问题。 本次演讲结合业界普遍采用的Saga技术,以及ServiceComb Saga项目,与大家分享Saga分布式事务最终一致性解决方案以及相关实践经验。