A Study of LSM Tree

下载 12

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/TiDB/PingCAP_Infra_Meetup_93_linjinhe_A_Study_of_LSM_Tree_18518?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

TiDB

发布于

6年前

5577

人观看

#信息技术

本次分享的主要内容包括： 1.LSM-Tree 的基本原理，包括 LSM-Tree 的文件组织结构、Point Query、Range Query 和 Compaction。 2.LSM-Tree 存在的问题和相关的优化方法。主要有两方面： • 读放大。目前的优化思路是通过 filter 来减少不必要的 I/O，比如 bloom filter、SuRF。 • Compaction 造成的负载抖动和写放大。一方面，可以通过软硬件结合的方式，将 compaction 的任务交给专门的 coprocessor 来做，将 compaction 带来的负面影响尽可能隔离开。另一方面，就是从数据结构和算法上，尽可能减少写放大，比如 PebbleDB 和 WiscKey。 3.最后简单总结了一下：LSM-Tree 的优化，基本都是在读放大、写放大和空间放大这三者间做 trade-off。理论上没法同时让这三者达到最优（有点像分布式系统的 CAP 定理）。

展开查看详情

1 .A Study of LSM-Tree 林金河

2 .Applications of LSM-Tree ❏ NoSQL ❏ Bigtable ❏ HBase ❏ Cassandra ❏ Scylla ❏ MongoDB ❏ Storage Engine ❏ LevelDB ❏ RocksDB ❏ MyRocks ❏ NewSQL ❏ TiDB(TiKV) ❏ CockroachDB

3 .Beginning of LSM-Tree ❏ 1996: The Log-Structured Merge-Tree (LSM-Tree) ❏ Memory is small ❏ Disk is slow for random access(r/w) ❏ Originally designed for fast-growing History table ❏ Data and indexes ❏ Write heavy ❏ Read sparse

4 .Beginning of LSM-Tree ❏ Out-of-place update ❏ Optimized for write ❏ Sacrifice read ❏ Not optimized for space ❏ Require data reorganization (merge/compaction)

5 .Modern Structure of LSM-Tree ❏ Read ❏ Point Query ❏ Range Query ❏ Write(Insert/Delete/Update) ❏ Compaction ❏ Leveled ❏ Tired

6 .Point Query ❏ Returns immediately when something found ❏ Read amplification, worst-case I/O: 2 * (N - 1 + files num of level-0) ❏ Optimization ❏ Page cache/Block cache ❏ Bloom filter

7 .Bloom Filter ❏ Simple ❏ Space-efficient ❏ False positive is possible, buf false negative is impossible ❏ Range query is not supported

8 .Range Query SELECT * FROM t WHERE key >= 23 AND key < 40; for (itr->Seek(23); itr->Valid(); itr->Next()) { if (itr->key() < 40) { ... } else ... } ❏ Must seek every sorted run ❏ Bloom filter not support range query ❏ Optimization ❏ Parallel Seeks ❏ Prefix bloom filter(RocksDB) ❏ SuRF (SIGMOD 2018)

9 .SuRF: Succinct Range Filter ❏ SuRF: Practical Range Query Filtering with Fast Succinct Tries ❏ Fast Succinct Tries ❏ Level-Ordered Unary Degree Sequence ❏ LOUDS-Dense ❏ The upper levels using a fast bitmap-based encoding scheme. ❏ Choosing performance over space. ❏ LOUDS-Sparse ❏ The lower levels using the space-efficient encoding scheme. ❏ Choosing space over performance. ❏ LOUDS-DS

10 .SuRF: Succinct Range Filter ❏ Trie => SuRF-Base => SuRF-Hash/SuRF-Real ❏ Example: (“SIGAI”, “SIGMOD”, “SIGOPS”)

11 .SuRF: Succinct Range Filter

12 .Read Operation Summary ❏ Read amplification ⇒ Use filters to reduce unnecessary I/O ❏ Space-efficient ❏ Lookup performance ❏ Low false positive, no false negative ❏ Bloom filter/Prefix bloom filter ❏ SuRF ❏ Others: Cuckoo Filter...

13 .Compaction - Tiered vs Leveled ❏ Each level has N sorted runs (overlapped). ❏ Compaction merges all sorted runs in one level to create a new sorted run in the next level. ❏ Minimizes write amplification at the cost of read and space amplification. ❏ Each level is one sorted run. ❏ Compaction into Ln merges data from Ln-1 into Ln. ❏ Compaction into Ln rewrites data that was previously merged into Ln. ❏ Minimizes space amplification and read amplification at the cost of write amplification.

14 .Tiered + Leveled ❏ Less write amplification than leveled and less space amplification than tiered. ❏ More read amplification than leveled and more write amplification than tired. ❏ It is flexible about the level at which the LSM tree switches from tiered to leveled.

15 .Tiered + Leveled + Partition ❏ Default compaction of LevelDB and RocksDB.

16 .Problems of compaction ❏ Write amplification ❏ System jitter(CPU、I/O) ❏ SSD wear out ❏ Invalid block cache/page cache ❏ Compaction rate is a problem ❏ Too fast - write amplification ❏ Too slow - read amplification and space amplification.

17 .Multi-Thread Compaction

18 .Pipelined Compaction ❏ Pipelined Compaction for the LSM-tree ❏ The compaction procedure: ❏ Step 1: Read data blocks. ❏ Step2: Checksum. ❏ Step3: Decompress. ❏ Step4: Merge sort. ❏ Step5: Compress. ❏ Step6: Re-checksum. ❏ Step7: Write to the disk. ❏ Read -> Compute -> Write ❏ It is difficult to devide the stages evenly. ❏ If data blocks must flows through multiple processors, it will result in low CPU cache performance. Let S2~S6 as on stage will be more CPU cache friendly.

19 .Compaction Buffer ❏ Re-enabling high-speed caching for LSM-trees ❏ LSbM-tree: Re-enabling Buffer Caching in Data Management for Mixed Reads and Writes

20 .Coprocessor ❏ Co-KV: A Collaborative Key-Value Store Using Near-Data Processing to Improve Compaction for the LSM-tree ❏ 当数据库遇见FPGA：X-DB异构计算如何实现百万级TPS？

21 .PebblesDB ❏ PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees ❏ Guards divide the key space (for that level) into disjoint units ❏ Less write amplification, more read amplification and space amplification.

22 .PebblesDB Compaction ❏ Similar to Tired + Partition

23 .WiscKey ❏ WiscKey: Separating Keys from Values in SSD-conscious Storage ❏ Compaction = sorting + garbage collection ❏ Only keys are required to be sorted ❏ Keys are usually smaller than values ❏ Key-Value separation ⇒ Decouple sorting and garbage collection

24 .WiscKey ❏ Key-Value separation’s challenges and optimizations ❏ An additional I/O may be required for each query ❏ WiscKey’s LSM-Tree is small and can be easily cached in memory ❏ Range queries require random I/O ❏ Prefetch: leverages the parallel I/O characteristic of SSD ❏ Garbage Collection ❏ Invalid keys are reclaimed by compaction ❏ Value log needs a special garbage collector ❏ Crash Consistency ❏ Atomicity of inserted key-value pairs is complicated ❏ Combine value log and WAL

25 .Summary ❏ Designing Access Methods: The RUM Conjecture ❏ Read, Update, Memory – Optimize Two at the Expense of the Third ❏ Read Amplification vs Write Amplification vs Space Amplification

26 .Q&A

3点赞

3收藏

12下载