Stream&Segment - best way to access events in Pulsar

展开查看详情

1.Stream/Segment - best way to access events in Pulsar Neng Lu streamnative.io

2.Who Am I ❏ StreamNative Software Engineer ❏ Ex-Twitter ❏ Contributed to Apache Projects - Heron, Pulsar ❏ Interested in event streaming technologies

3.Pulsar 1.X

4.Apache Pulsar “Flexible Pub/Sub Messaging Backed by Durable Log Storage”

5.Pulsar 2.X

6.Apache Pulsar “Cloud-native Messaging and Event Streaming Platform”

7.Pulsar Use Cases ❏ Unified Event Center/Bus (Queuing + Streaming) ❏ Billing Service ❏ Push Notification ❏ Worker Queue ❏ Logging Pipeline ❏ IoT ❏ Streaming-first, unified data processing

8.Data Processing with Apache Pulsar

9.Data Processing Categories ❏ Batch ❏ The amount of data is huge ❏ Can run on a huge cluster ❏ Fine-grained fault tolerance

10.Data Processing Categories ❏ Batch ❏ Streaming ❏ The amount of data is huge ❏ Long running jobs ❏ Can run on a huge cluster ❏ Time critical ❏ Fine-grained fault tolerance ❏ scalability as well as fault tolerant

11.Data Processing Categories ❏ Batch ❏ Streaming ❏ The amount of data is huge ❏ Long running jobs ❏ Can run on a huge cluster ❏ Time critical ❏ Fine-grained fault tolerance ❏ scalability as well as fault tolerant ❏ Interactive ❏ Time critical ❏ Medium data size ❏ Rerun on failures

12.Data Processing Categories ❏ Batch ❏ Streaming ❏ The amount of data is huge ❏ Long running jobs ❏ Can run on a huge cluster ❏ Time critical ❏ Fine-grained fault tolerance ❏ scalability as well as fault tolerant ❏ Interactive ❏ Serverless ❏ Time critical ❏ Simple, light-weight processing ❏ Medium data size ❏ Processing data with high ❏ Rerun on failures velocity

13.Apache Pulsar Layered Architecture Stateless Serving Durable Storage

14.Pulsar Messaging API ❏ Read data from brokers with different Subscription Modes ❏ Consume / Seek / Receive ❏ Reprocessing data by rewinding (seeking) the cursors

15.Subscription Mode ❏ Exclusive ❏ Failover ❏ Shared ❏ Key_Shared

16.Pulsar Segment API ❏ Read data from storage (bookkeeper or tiered storage) ❏ Fine-grained Parallelism ❏ Predicate pushdown (publish timestamp)

17.Segment Centric Storage ❏ Topic Partition (Managed Ledger) ❏ The storage layer for a single topic partition ❏ Segment (Ledger) ❏ Single writer, append-only ❏ Replicated to multiple bookies

18.Tired Storage ❏ Long retention ❏ Low cost ❏ Easy to access

19.Apache Pulsar Data APIs Producer Consumer Messaging API Broker 1 Broker 2 Broker 3 Segment API Bookie1 Bookie2 Bookie3 Bookie4 Bookie5 S3 GCS HADOOP

20.Pulsar - Infinite Event Stream Storage

21.Pulsar - Topic

22.Pulsar - Topic Partitions

23.Pulsar - Segments

24.Pulsar - Stream

25.Pulsar - Infinite Event Stream Storage

26.Benefits ❏ Unlimited Topic Partition Storage ❏ Instant Scaling without Data Rebalancing ❏ Broker Failure Recovery ❏ Bookie Failure Recovery ❏ Cluster Expansion ❏ Low latency reading for messaging data ❏ High throughput reading for batch data ❏ Reduced cost for whole data storage

27.Pulsar SQL Case

28.Pulsar Flink Case 1 1 1 Flink 9 8 7 6 5 4 3 2 1 2 1 0 Job1 Flink Job2

29.Conclusion ❏ Apache Pulsar is a cloud-native messaging streaming system ❏ Multi layered architecture ❏ Segment centric storage ❏ Two levels of reading API: Pub/Sub + Segment ❏ Apache Pulsar provides a unified view of data

StreamNative 是一家围绕 Apache Pulsar 和 Apache BookKeeper 打造下一代流数据平台的开源基础软件公司。秉承 Event Streaming 是大数据的未来基石、开源是基础软件的未来这两个理念,专注于开源生态和社区的构建,致力于前沿技术。