Pulsar Storage on BookKeeper——Joe Francis&Rajan Dhabalia


1.Pulsar Storage on BookKeeper Seamless Evolution June 17, 2020 Joe Francis joef@verizonmedia.com Rajan Dhabalia rdhabalia@verizonmedia.com Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.

2.Speakers Joe Francis Rajan Dhabalia Director, Verizon Media Principal Software Engineer, Verizon Media 2

3.Agenda ● Pulsar in Verizon Media ● Benchmarking for production use ● Pulsar IO Isolation ● BookKeeper with different storage devices ● Case-study: Kafka use case on Pulsar ● Future 3

4.Verizon Media & Pulsar ● Developed as a hosted pub-sub service within Yahoo/VMG ○ open-sourced in 2016 ● Global deployment ○ 6 DC (Asia, Europe, US) ○ full mesh replication ● Mission critical use cases ○ Serving applications ○ Lower latency bus for use by other low latency services ○ Write availability 4

5.Benchmarking for production ● Most benchmark numbers do not test production scenarios ○ Messaging systems work well when ■ data fits in memory ■ no disk I/O in critical path (write or read) ● Pulsar was designed to work well under real world work load.. ○ Lagging consumers, replay ■ Backlog read from disks will occur. ○ Disks and brokers crash/fail ■ Pulsar ack guarantee: data is synced to disk on 2+ hosts ○ Latencies remain unaffected by load variations ■ backlog reads (I/O isolation) ■ failures (instantaneous recovery) ● Cost matters ○ Compute ($) vs Storage ($$) ● Benchmark for production use !!! 5

6.Data paths Application Application Application Producer Consumer Consumer ack ds a Re Broker ld Co ( Cache: RAM) ack ack RAM RAM Data Data Journal Journal Bookie Bookie 6

7.BookKeeper IO Isolation 7

8.Pulsar Journey 8

9.First Generation Storage - HDD - JOURNAL-Device HDD with RAID10 - DATA-Device HDD with RAID10 - Index: Interleaved index files - HDD - Fast low latency sequential writes on HDD with battery backed RAID controller - Random seek time is much longer for HDD - Economical - Journal Device - Fast sequential writes - Ledger Device - Sequential writes on single entry-log data file for multiple streams - Most of the IOPs is utilized for - Backlog draining (cold reads) - Reads and writes on Index files 9

10.Optimizing random IOs for Indexing - Index on interleaved file - One index file for each topic - Random IO while updating index - Scaling number of topics increases random IOs and file handles - Index on Rocks DB - LSM based embedded key-value store - Used as a library within bookie process; no additional operational efforts - Less write-amplification and better compression - Drastically reduces random IOPs for indexing - Small footprint ( < 10 GB); mostly in RAM 10

11.Second Generation: SSD/NVMe - JOURNAL-Device NVMe/SSD - DATA-Device NVMe/SSD - Index: RocksDB SSD/NVMe - SSD provides better performance for sequential and random I/O - NVMe supports large command queue (64K) with parallel IO Journal Device - Bookie can use multiple journal directories to utilize parallel write on NVMe - Achieve 3x Pulsar throughput with low latency, compared to HDD Ledger Device - Significantly faster random reads than HDD - Faster backlog draining while doing cold reads for multiple topics 11

12.Storage Device: Sequential Vs Random IO 12

13.Storage Device: Performance Vs Cost 13

14.Storage Evolution & Pulsar Adaptation: PMEM PMEM ● Highest performing block storage device ● Ultra fast, super high throughput with consistent low latency ● Expensive; well suited as small device for WRITE intensive use cases Journal Device ● WAL/journal is proven design in Databases ○ transactional storage and recovery ○ high throughput ● Write optimized append only structure ● Does not require much storage and keeps short lived transactional data ● Using PMEM for journal device ○ adds < 5% cost for each bookie ○ Increases Pulsar throughput 5x times, and with low publish latency 14

15. Pulsar Performance with Different BK-Journal Device Performance configuration ● Enabled fsync on every published message ● Publish throughput with backlog draining ● SLA: 5ms (99%lie latency) ○ HDD: 120MB ○ SSD: 200MB ○ NVMe: 350MB ○ PMEM: 600MB 15

16.Case-study: Migrate Kafka Use Case to Pulsar ● Cost and Throughput ■ Using PMEM for journal adds < 5% more cost per host but reduce overall cost and cluster footprints ■ Achieve 5x more throughput with 99%-ile @ <5ms write latency ● Cluster footprint ■ Kafka cluster : 33 Kafka Brokers ■ Pulsar cluster: 10 bookies and 16 brokers ● Pulsar broker is a stateless component and costs 1/4x than bookie ■ Overall Pulsar cluster resources ½ of the Kafka cluster 16

17.Case-study: Migrate Kafka Use Case to Pulsar USE CASES APACHE PULSAR APACHE KAFKA Throughput with low latency Cost Geo-replication Queuing Committing messages 17

18.Future ● Use PMDK API to access persistent memory ○ bypass the file system ○ better throughput ● Tiered Storage for historical data use cases ○ relaxed latency requirements ○ cheaper cost ○ Use cases ■ ML model training ■ audit, forensics 18

19. Thank you Joe Francis joef@verizonmedia.com Rajan Dhabalia rdhabalia@verizonmedia.com

StreamNative 是一家围绕 Apache Pulsar 和 Apache BookKeeper 打造下一代流数据平台的开源基础软件公司。秉承 Event Streaming 是大数据的未来基石、开源是基础软件的未来这两个理念,专注于开源生态和社区的构建,致力于前沿技术。