20_06 rocksandra update

本文主要介绍了Cassandra基于RocksDB的存储引擎的最新改进

展开查看详情

1.ROCKSANDRA UPDATE Pengchao Wang @ Instagram

2.AGENDA 1 Pluggable Storage Engine and Rocksandra 2 Highlights of 2018~2019 improvements 3 Learnings from production 2

3.ROCKSANDRA

4.MOTIVATIONS 60ms 25ms 5ms 4

5.ROCKSANDRA • Rocksandra is knew storage engine implemented on top of RocksDB for • Less JVM GC • Low tail latency • High throughput 5

6.ROCKSANDRA 0.9% 0.5% 0.1% 6

7.P99 LATENCY 60ms 20ms 7

8.CURRENT STATUS IN INSTAGRAM • 70% C* QPS is on Rocksandra (10s of millions) • 100% on CPU and IO bound clusters • Disk bound cluster is under migration 8

9.FEATURE PARITY Features supported: Features will be supported later • Most of non-nested data types • Multi-partition query • Nested data types • Table schema • Snapshot • Counters • Point query • Cleanup • Range tombstone • Range query • Truncate • Materialized views • Mutations • Partition deteltion * • SASI • Timestamp • SSTableloader * • Row level tombstone • TTL • Secondary indexes * • Anti-entropy repair • Deletions/Cell tombstones • Streaming 9

10.IMPROVEMENTS HIGHLIGHTS

11.IMPROVEMENTS HIGHLIGHTS 1 High density storage support 2 Fast Streaming 3 Fast Cleanup 4 Space amplification improvement 11

12.HIGH DENSITY STORAGE 12

13.HIGH DENSITY STORAGE Partitioned Index and Filter INDEX/FILTER INDEX/FILTER DATA DATA 13

14.HIGH DENSITY STORAGE Partitioned Index and Filter TOP INDEX TOP INDEX INDEX/FILTER INDEX/FILTER DATA DATA 14

15.HIGH DENSITY STORAGE 56% 13% CPU utilization (%) 15

16.HIGH DENSITY STORAGE 30ms 2ms P99 Read Latency (ms) 16

17.FAST STREAMING Rocksandra Streaming dc1-node13 Sender Iterate the key range using RocksDB API Send key-value pairs through network Receiver dc2-node37 Serialize the key-value pairs into SST files Ingest into RocksDB 17

18.FAST STREAMING Ingest Behind ONLINE WRITE STREAMING Normal Ingest 18

19.FAST STREAMING Ingest Behind ONLINE WRITE STREAMING Ingest Behind 19

20.FAST STREAMING 1.1GB/s Streaming Incoming Bytes 20

21.FAST STREAMING >10 hours 3~4 hours 21

22.FAST CLEANUP nodetool cleanup calculate lost ranges rocksdb.DeleteFilesInRange Drop 70% data rocksdb.RangeDelete Compaction Drop rest 30% data 22

23.FAST CLEANUP DeleteFilesInRange 23

24.FAST CLEANUP DeleteFilesInRange X X X 24

25.FAST CLEANUP > 30 minutes < 20 seconds 25

26.SPACE AMPLIFICATION Closing the Gap C* LZ4 Rocksandra LZ4 Rocksandra ZSTD Before +25% Block Size After +11% -17% 0 42.5 85 127.5 170 26

27.2 YEARS LEARNINGS IN PRODUCTION

28.LEARNING: THROUGHPUT P99 Read Latency 7 5.722 6 P99 Read Latency (ms) 5 4 2.759 3 1.916 2 1.331 1.109 1.109 1.331 0.924 1 0 0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 QPS Cassandra 3.0.15 Rock sandra 28

29.LEARNING: HIGH THROUGHPUT = LESS $$$ Cassandra Rocksandra 450 225 174 150 100 50 Usecase A Usecase B Usecase C 29