Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar
展开查看详情
1.Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar Vincent Xie (Bestpay), Jia Zhai (StreamNative)
2.About us Vincent (Weisheng) Xie Jia Zhai ❏ Current Director @ Orange Financial ❏ Co-Founder of StreamNative ❏ Previous Tech lead of ML engineering ❏ Apache Pulsar PMC Member team @ Intel ❏ Apache BookKeeper PMC Member
3.Agenda ❏ Background ❏ Apache Pulsar ❏ Unified Data Processing ❏ Our Practices ❏ Q&A
4.Background Intro
5.Orange Financial Orange Financial Services Group (Chinese: 甜橙金融), formerly known as Bestpay, is an affiliate company of China Telecom. It reached 1.13 trillion CNY ($18.37 Billion) transaction volume in 2018, with 500 million registered users and 41.9 million active users. Subsidiaries: Bestpay - a mobile wallet and payment app Jieqian - a consumer loan service Orange Wealth Orange Insurance Orange Credit Orange Financial Cloud
6.
7.Source: iiMedia Research Inc.
8.High Industry Penetration Rate Source: China Unionpay
9.Source: RSA
10.Challenges ❏ High concurrency ❏ > 50M transactions, 1 billion events a day (peek: 35K/s) ❏ Low latency demand ❏ response < 200ms ❏ Large number of batch jobs and streaming jobs
11.“A merchant’s total transaction volume ($) within the past month (30days) (current transaction included)” = sum($past_29days) + sum($today_upto_current) batch streaming
12.Architecture V1 API Gateway
13.Architecture V1 - Lambda Speed/Streaming Layer API Serving Gateway Layer Batch Layer
14.Drawbacks ❏ S/W stacks complexity ❏ Realtime / Offline / Serving stacks ❏ Multiple clusters to maintain (Kafka / Hive / Spark / Flink) ❏ Different skill sets to manipulate (Scala / Java / SQL) ❏ Segmented Logics ❏ Historical/Current ❏ Data redundancy ❏ Multiple duplications to move over
15.Introduce Apache Pulsar
16.What is Apache Pulsar?
17. “Flexible Pub/Sub Messaging Backed by durable log storage”
18.Pulsar - A cloud-native architecture Stateless Serving Durable Storage
19.Pulsar - Segment Centric Storage ❏ Topic Partition (Managed Ledger) ❏ The storage layer for a single topic partition ❏ Segment (Ledger) ❏ Single writer, append-only ❏ Replicated to multiple bookies
20.Pulsar - Pub/Sub
21.Pulsar - Topic Partitions
22.Pulsar - Segments
23.Pulsar - Stream
24.Pulsar - Stream as a unified view on data
25.Pulsar - Two levels of reading API ❏ Pub/Sub (Streaming) ❏ Read data from brokers ❏ Consume / Seek / Receive ❏ Subscription Mode - Failover, Shared, Key_Shared ❏ Reprocessing data by rewinding (seeking) the cursors ❏ Segment (Batch) ❏ Read data from storage (bookkeeper or tiered storage) ❏ Fine-grained Parallelism ❏ Predicate pushdown (publish timestamp)
26.Unified Data Processing on Pulsar
27.Architecture V2 Spark Structured Streaming Spark SQL API Gateway
28.Architecture V2 Spark Structured Streaming Spark SQL API Gateway ❏ Single Data Store (Pulsar) ❏ Single Computing Engine (Spark) ❏ Unified API
29.Pulsar-Spark ❏ Deeply integrated with Pulsar schema ❏ Pulsar topics as Structured Streams ❏ Pulsar Connectors for Spark Structured Streaming ❏ Pulsar Connectors for Spark SQL https://github.com/streamnative/pulsar-spark