Using High Capacity Flash based Storage in Extremely Large Databases

极端的数据库系统一直享受着基于闪存的存储子系统稳步改进的好处。因此,这些设备正在慢慢取代基于HDD的应用,在超大的数据库系统中,它们可以扩展到数万台设备。在这条道路上,为了加速硬盘的衰落,进行了许多设计权衡。示例包括基于闪存的SSD供应商将美元/GB和GB/单位容量作为关键设计指标。我们查看了几个设计权衡的累积影响,并说明了我们在性能连续性、数据保护、成本优化和容量优化方面面临的一些挑战。

展开查看详情

1. Using High Capacity Flash Storage In Extremely Large Database Systems Keith Muller Halıcıoğlu Data Science Institute, UCSD Technology, Research and Innovation, Teradata April 3, 2019 1

2. Agenda • Targeted Market Review • Some Basic Background • Relevant System Trends • Where we are today – approach and measurements • NVME 1.4 Sets and Endurance groups overview • Does it make sense to implement storage tiering using NVMe sets and endurance groups on large capacity flash devices? 2

3. Large Systems: Effective Size Scaling Tradeoffs Some Obvious Examples • Minimize Stranded Resources • Inefficiencies are very significant $ at scale Various Costs (+OPEX) • Focus is on the role of Large Capacity SSD’s • Single database system starting at around • 500 SSD Storage devices and up • 60 2-Socket Servers (2160 CPU’s) and up Storage Capacity Availability • Touch on various optimizations, an emphasis on: • Rack space Density • Performance/U • Capacity/U • … • Stranded Performance • Impact of Technology Implementations • Minimizing Degraded Performance Performance & Capacity Density 3

4. Why Focus On High Capacity Storage? Costs! Server Storage Server Storage 4

5. Flash Flash Flash Flash Quick Refresh: Architecture & OP Flash Flash Flash Flash Flash FTL Inherent Overhead Core Flash Flash Flash Flash Base Reserved Capacity Flash Flash Flash Flash Write Performance: OP % Flash Flash Flash Flash Unmapped Reserved paced by available flash Capacity Flash Flash Flash Flash Flash FTL block/page + other Core overhead operations Flash Flash Flash Flash (wear leveling etc.) Host Interface N Core Flash Flash Flash Flash V Flash Flash Flash Flash M Lower Write Endurance E Host Interface Flash Flash Flash Flash Flash FTL Core Core Advertised Capacity Flash Flash Flash Flash Larger Advertised Lower $/Advertised Flash Flash Flash Flash Performance Amplification Higher Write Lower Write Flash Flash Flash Flash Capacity Capacity Flash Flash Flash Flash Flash FTL Core Flash Flash Flash Flash 5 Representative Capacity Distribution Flash Architecture Flash Flash Flash Flash

6.Large Cap Enterprise SSD’s – All Looks Good… Right? 6 Averaged: 4K IOPS, 256KB sequential

7.Large Cap Enterprise SSD’s – All Looks Good… Right? 7 Averaged: 4K IOPS, 256KB sequential

8. Looking Forward: Enterprise Tiering Estimates Metric Processor Latency Capacity Read/Write Ratio Fixed ~1/1 ns On Core L1/2 Cache ~1 ns KB’s 1:1 CPU Cost/Capacity Fixed ~10/10 ns Performance On Die L3 Cache ~10 ns 1:1 MB’s Capacity ~100/~100 ns Main Memory ~100 ns Fixed 1:1 Memory TB’s ~500/~500 NVDimm ~500 ns ns 1:1 Performance NVME Each ~1xxxK/~5xxK 5 DWPD and greater ~1-3+ TB’s IOPS PCIe Gen ~200,000 ns ~1.5:1 to ~3:1 4 ~200 us Capacity NVME Each ~1xxxK/~1xxK ~1 DWPD and less ~30 TB’s IOPS ~6:1 to ~12:1 ~15,000,000 Each SA Capacity ~0.12K/~0.11K ns ~15 ms >12 TB’s 8 S HDD IOPS ~1:1

9. The Past: HDD Performance: R/W Ratio By Disk Extent 1.20 1.10 100% Read 10% Capacity 1.00 1.0 0.90 Normalized IO Rate 0.80 0.70 0.60 0.50 0.40 100% Write 100% Write 0.30 100% Read Full Capacity 10% Capacity Full Capacity 0.65 0.91 0.20 0.68 10 0.10 40 0.00 100 90 70 80 70 60 50 40 % of Capacity Accessed Read % 30 20 100 10 0 9 Many Filesystems were designed assuming this model Single Disk, 96 KB, Random, QD=4

10. High Capacity HDD’s in this market segment? ...PAIN ~35 C ~50 C 10

11. Current NVME SSD Performance: R/W Ratio By OP% 11 Single Disk, 32 KB, Random, QD=16

12. OP % 10 DWPD 3 DWPD 1 DWPD 12 Note: OP% to DWPD varies by supplier

13.13 140% OP 10% OP 7% OP 140% OP 10% OP 7% OP

14.Did Storage Tiering Mitigate Performance & Capacity Tradeoffs? 14 Slide from 2002 – looking at HDD’s

15. Basic Tiered Storage Example – Shared 24 Drive Tray Capacity Tier: Distributed RAID 8+2 with 7% OP Performance Tier: RAID -1 with 10 DWPD drives … … … … … … … … … … … … … … … … … … … … … … … Drive 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 No: Protected: 15.36 TB -> 145.2 TB; 7.68 TB -> 74.2 TB Protected: 1.6TB -> 9.6 TB; 3.2 TB -> 19.2 TB • Capacity Tier - Distributed RAID 8+2: Optimized for Lower $/GB (+ Capacity Density) • Challenged write performance when not full stripe & not flash block aligned • Significant degradation in performance with drive loss • Performance Tier - RAID 1 Optimized for Write traffic • Low degradation in write performance with drive loss 15

16. R/W IOPS Ratios Under RAID 16 96 KB, Random, 32KB Aligned QD = ~8/device

17.17

18. Aside: Mitigating Degraded Write Performance Impact • Stripe and flash block aligned I/O helps performance 18

19. Motivation: Multi-Tenant Flash Flash Flash Flash Noisy Neighbor Example Flash FTL Core Flash Flash Flash Flash Flash Flash Flash Flash User 1 Flash Flash Flash Flash Flash Flash Flash Flash Flash FTL Flash Flash Flash Flash User 2 Core Flash Flash Flash Flash Host Interface N Core Flash Flash Flash Flash V User 3 M Flash Flash Flash Flash E Host Interface Core Flash Flash Flash Flash Flash FTL Core Flash Flash Flash Flash User 4 Flash Flash Flash Flash Flash Flash Flash Flash • Collisions on Flash Die, uneven distributions of work, Read/Write mixes, etc. Flash FTL Flash Flash Flash Flash Core Flash Flash Flash Flash Reference: Solving Latency Challenges with NVM 19 Representative Express SSD’s at Scale; Petersen & Huffman; FMS 2017 Flash Architecture Flash Flash Flash Flash

20. NVMe 1.4 Sets and Flash Flash Flash Flash Endurance Groups NVME Set 1 Flash FTL Flash Flash Flash Flash Endurance group A Core • Defined in the NVMe 1.4 spec (2H 2019) Flash Flash Flash Flash Flash Flash Flash Flash • NVMe Set • NVM that is physically and logically isolated from NVME Set 2 Flash Flash Flash Flash Endurance group B NVM in other NVM Sets Flash FTL Flash Flash Flash Flash • Dedicated NAND resources, channels, FTL, etc. Core Flash Flash Flash Flash (device architecture dependent) Host Interface N Core • Workload isolation: one set has no impact on V Flash Flash Flash Flash other sets (hopefully) M Flash Flash Flash Flash E Host Interface • Carries out its own writes and background Core Flash Flash Flash Flash operations independently Flash FTL Core • Drive appears like several smaller drives NVME Set 3 Flash Flash Flash Flash Endurance group C • Endurance group: wear level management Flash Flash Flash Flash • Set independent levels of OP and usable Flash Flash Flash Flash capacity (hopefully) NVME Set 4 Flash Flash Flash Flash Flash FTL • May contain one or more NVMe sets Endurance group D Core Flash Flash Flash Flash Example: 20 Four uniform NVMe sets & groups Representative Flash Flash Flash Flash Flash Architecture

21.Can We Do Storage Tiering Flash Flash Flash Flash Using NVMe Sets and Groups? NVME Set 1 Endurance group A Flash FTL Core Flash Flash Flash Flash Flash Flash Flash Flash • Write Optimized Tier Flash Flash Flash Flash • Example: OP in group is: 5 - 10 DWPD NVME Set 2 Flash Flash Flash Flash Endurance group B • Capacity Optimized Tier Flash FTL Flash Flash Flash Flash Core • Example: OP in group is: 1 – 3 DWPD Host Interface Flash Flash Flash Flash N Core • How many sets/groups? V Flash Flash Flash Flash M Flash Flash Flash Flash • Measurements and future advisories suggest Host Interface E the NVMe interface may be over-subscribed Core Flash FTL Flash Flash Flash Flash • Trade offs (architecture dependent): Core Flash Flash Flash Flash NVME Set 3 • Endurance for workloads Endurance group C Flash Flash Flash Flash • Write IOPS/capacity • Tier capacity ratios Flash Flash Flash Flash • Will the implementation allow partial failures to NVME Set 4 Flash Flash Flash Flash Flash FTL be isolated within an endurance group? Endurance group D Core Flash Flash Flash Flash • Example: loss of a flash die or FTL core 21 Representative Flash Architecture Flash Flash Flash Flash

22. 1-A … … … … … … … … … … … … … … … … … … … … … … … … 2-B 3-C … … … … … … … … … … … … … … … … … … … … … … 4-D 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Drive No: • Example: 24 Drives X 4 Sets/drive = 48 Performance + 48 Capacity; Single drive to stock • Opportunity for lower set/group loss impact • whole drive loss impacts two storage tiers but with likely lower impact each • Less capacity to rebuild • Opportunity for more efficient utilization of NVMe interface on many workloads if the 22 NVMe interface QOS allows stranded bandwidth to float among NVMe sets

23. Thank You! Questions? Suggestions? Things I got wrong or missed? 23