Five years of operating a large scale globally replicated Pulsar installation— Francis&Ludwig Pu


1.Five Years of Operating a Large Scale Globally Replicated Pulsar Installation June 18, 2020 Ludwig Pummer Joe Francis

2.Who are we Ludwig Pummer Principal Production Engineer Verizon Media Joe Francis Director, Core Platforms Verizon Media 2

3.What do we do ● Operate a hosted pub-sub service within VMG ○ open-sourced as Pulsar ● Global presence ○ 6 DC (Asia, Europe, US) ○ full mesh replication ● Business critical use cases ○ Serving use cases ○ Lower latency bus for other low latency service 3

4.Use cases ● Application integration ○ Server-to-server control, status, notification messages ● Persistent queue ○ Buffering, feed ingestion, task distribution ● Message bus for large scale data stores ○ Durable log ○ Replication within and across geo-locations 4

5.Trajectory 2015 2016 2020 1 tenant 20 tenants 100+ tenants 2 clusters, 2 DC 12 clusters, 6 DC 18 clusters, 6 DC 60K wps @ 2KB 500K wps avg 1.3M wps avg / 3M peak 60K rps 1.1M rps avg 2M rps avg / 6M peak <100 topics 1.4M topics 2.8M topics 5

6.Scaling up a Cluster More deliveries (increase fanout) Add Brokers More publishes Add Bookies, Brokers More storage Add Bookies Massively more topics Add Clusters (SuperCluster) SuperCluster (peer cluster) 6

7.Storage & I/O auto-balancing wps by namespace wps by bookkeeper 7

8.SuperCluster PIP 8: peer clusters dc1 dc2 dc3 dc1 peer dc1b dc2 peer dc2b dc1b dc2b ns1: {dc1, dc2, dc3} ns2: {dc1b, dc2b, dc3} ns3: {dc1, dc2b} ns4: {dc1, dc1b, dc2} 8

9.Provisioning Model New Tenant: ● Average message size ● Peak publishes per second ● Steady-state deliveries per second (fan out) ● Per cluster/DC ● x509 principal Calculate: ● Broker messages/sec ● Broker bandwidth ● Bookie MB/sec 9

10.Hardware Evolution: Brokers Pre-2015 2015 2020 Dual Xeon E5620 (Gulftown) Dual Xeon E5-2620 (Ivy Dual Xeon E5-2620 (Ivy 8-core Bridge) 12-core Bridge) 12-core 24GB RAM 48GB RAM 96GB RAM 1G NIC 10G NIC 10G or 25G NIC 10

11.Hardware Evolution: BookKeepers Pre-2015 2015 2020 Dual Xeon E5-2630 (Sandy Dual Xeon E5-2620 (Ivy Dual Xeon Gold 6240 Bridge) 12-core Bridge) 12-core 36-core 32GB RAM 64GB RAM 192GB RAM 1G NIC 10G NIC 25G NIC 12 x 300GB 15K RPM SAS 10 x 4TB 7.2k RPM SATA 4x 4TB NVMe drives 2 x 120GB SSD 2x 128GB Optane Persistent (1 x HW RAID-10 of 10 drives Memory (2 x HW RAID-10 of 6 drives) 1 x HW RAID-1 of 2 SSDs) Variants: Variants: 4x 8TB NVMe HW RAID-1 of 2 add’l 4TB SW RAID-1 of 2 larger SSD 11

12.Hardware Evolution: ZooKeepers Pre-2015 2015 2020 Dual Xeon E5620 (Gulftown) Dual Xeon E5-2620 (Ivy Dual Xeon E5-2620 (Ivy 8-core Bridge) 12-core Bridge) 12-core 24GB RAM 64GB RAM 64GB RAM 1G NIC 10G NIC 10G NIC 240 GB SSD 2x 240 GB SSD 2x 960 GB SSD 12

13.Footprint Brokers: Mix of 2015+ config 550 brokers Heterogeneous even within cluster Bookies: Mostly 2015 config 450 bookies Heterogeneous across clusters. Homogeneous within cluster (or isolation group) 13

14.JVM Garbage Collector: G1GC Pulsar 1.x heap usage gc pause 14

15.JVM Garbage Collector: G1GC Pulsar 2.x heap usage gc pause 15

16.JVM Garbage Collector: ZGC heap usage gc pause 16

17.Statistics and Monitoring 17

18.Statistics and Monitoring: Too much data 18

19. Statistics and Monitoring: Tenant metrics collector Brokers /admin/destinations topic monitor 19

20.Statistics and Monitoring: Tenant metrics 20

21.Deployment ● No downtime. Manage risk. ● Staged. Sequenced. ● Low parallelism ● Managed by Screwdriver jobs ○ ● Screwdriver launches Ansible-like tool 21

22.Deployment Sequence Hosts 1. ZooKeeper (not prod) 2. BookKeeper 3. Broker 22

23.Broker Isolation Policies { "namespaces" : [ "tenant-one/.*" ], "primary" : [ "broker6[7-9]" ], "secondary" : [ "none" ], "auto_failover_policy" : { "policy_type" : "min_available", "parameters" : { "min_limit" : "0", "usage_threshold" : "100" } } } 23

24.Broker Isolation Policies Uses ● High Profile/Reserved capacity ● Misbehaving tenants ● Debugging 24

25.BookKeeper Storage Utilization Factors and Configuration impacting cluster storage utilization ● Number of topics x write throughput ● MinLedgerRolloverTime, MaxLedgerRolloverTime, CursorRolloverTime ● Over-replication ● Increased write-quorum ● Increased topic TTL ● Increased retention period ● Crossing BookKeeper compaction threshold ● Compaction thresholds and intervals 25

26.BookKeeper Rack Awareness Rack as a failure domain Historical Rack Aware implementation 2 Rack vs N Racks 26

27.Host Failure Response BookKeeper down: handled automatically, no problem slow (hardware issue): enable ForceReadOnly, wait for repair Broker down: handled automatically, no problem GC too busy: restart topic issue: unload topic, unload bundle, restart broker 27

28.Operational Tools pulsar-admin ● primary troubleshooting & investigative tool orphan ledger cleanup tool ● delete ledgers not referenced by any topic ad-hoc python scripts using kazoo and ML protobuf ● abandoned namespace/topic cleanup ● storage utilization analysis 28

29.Thank you.

StreamNative 是一家围绕 Apache Pulsar 和 Apache BookKeeper 打造下一代流数据平台的开源基础软件公司。秉承 Event Streaming 是大数据的未来基石、开源是基础软件的未来这两个理念,专注于开源生态和社区的构建,致力于前沿技术。