Apache Pulsar下一代实时消息引擎,基于Apache BookKeeper,以segment为粒度做数据存储,既保证了数据存储的性能,也确保消息数据持久性,架构的弹性伸缩,无论Broker还是Bookie意外退出,都不会让数据访问和一致性出现问题。作为一个消息引擎,它统一了发布-订阅的主题模式和队列两种语意模型,99%的消息延迟都能在5毫秒以内,单个分区可以支持高达180万消息/秒;此外,它还原生支持轻量级的消息处理函数,让一些简单的消息处理逻辑不再依赖外部处理工具。

1. Apache Pulsar 实时数据处理理中 消息,计算和存储的统⼀一 演讲者/streamlio 翟佳

2. 4 What’s the state of the art

3. 5 What’s the state of the art

4. 6 Apache Pulsar — Unify Messaging Computing Pulsar Broker Pulsar Functions Segment Store Stream Table BookKeeper

5. 7 Why Apache Pulsar? Durability Ordering Delivery Guarantees Data replicated and Guaranteed ordering At least once, at most synced to disk once and effectively once Geo-replication Multi-tenancy Low Latency Out of box support for A single cluster can Low publish latency of geographically support many tenants 5ms at 99pct distributed and use cases applications Unified messaging High throughput Highly scalable model Can reach 1.8 M Can support millions of Support both messages/s in a topics Streaming and single partition Queuing in a single model

6. 8 Pulsar Architecture Producer Consumer Separate layers between brokers bookies • Broker and bookies can Pulsar Broker 1 Pulsar Broker 1 Pulsar Broker 1 be added independently • Traffic can be shifted very quickly across Bookie 1 Bookie 2 Bookie 4 Bookie 5 Bookie 3 brokers Apache BookKeeper • New bookies will ramp up on traffic quickly Apache Pulsar

7. 9 Messaging Messaging Computing Pulsar Broker Pulsar Functions Segment Store Stream Table BookKeeper

8. 10 Messaging - Concepts

9. 11 Messaging - Namespace

10. 12 Messaging - Queuing & Streaming (kafka, kinesis, …) (SQS, ActiveMQ, RabbitMQ, …)

11. 13 Messaging - ACK Cumulative Individual

12. 14 Messaging - Retention

13. 15 Storage Messaging Computing Pulsar Broker Pulsar Functions Segment Store Stream Table BookKeeper

14. 16 Storage - Apache BookKeeper • A replicated log storage • Low-latency durable writes • Simple repeatable read consistency • Highly available • Store many logs per node • I/O Isolation

15. 17 Storage - Apache BookKeeper

16. 18 Storage - Segment Centric

17. 19 Storage - Segment/Stream/Table Messaging Computing Pulsar Broker Pulsar Functions Segment Store Stream Table BookKeeper

18. 20 Compute Messaging Computing Pulsar Broker Pulsar Functions Segment Store Stream Table BookKeeper

19. 21 Compute Representation Abstract View f(x) Incoming Messages Output Messages

20. 22 Lessons learnt A significant percentage of transformations are simple ETL/Reactive Services/Classification/Real-time Aggregation Event Routing/Microservices The emergence of Serverless Simple Function API Run per event Composition APIs to do complex things Wildly popular

21. 23 Whats needed: Stream-Native Compute Insight gained from serverless Simplest possible API Method/Procedure/Function Multi Language API Scale developers Stream native concepts Input/Output/Log as topics Flexible runtime Simple standalone applications vs system managed applications

22. 24 Pulsar Functions — API SDK less API import java.util.function.Function; public class ExclamationFunction implements Function<String, String> { @Override public String apply(String input) { return input + "!"; } } SDK API import org.apache.pulsar.functions.api.PulsarFunction; import org.apache.pulsar.functions.api.Context; public class ExclamationFunction implements PulsarFunction<String, String> { @Override public String process(String input, Context context) { return input + "!"; } }

23. 25 Pulsar Functions Running as a standalone application bin/pulsar-admin functions localrun \ --input persistent://sample/standalone/ns1/test_input \ --output persistent://sample/standalone/ns1/test_result \ --className org.mycompany.ExclamationFunction \ --jar myjar.jar Runs as a standalone process Run as many instances as you want. Framework automatically balances data Run and manage via Mesos/K8/Nomad/your favorite tool

24. 26 Pulsar Functions: Use Cases Sensor devices generate tons of data Lot of local actions Edge Computing Simple filtering, threshold detection, regex matching, etc Resource Constrained Limited scope for Full blown schedulers/Job Managers Models computed via offline analysis Model Incoming requests should be classified using the model Serving Function is a natural representation for the classification action Model itself can be stored in Bookkeeper

25. 27 Pulsar Functions Unify Messaging and Compute cluster into one Function executed for every message of input topic Supports multiple topics as inputs Runtime User Controlled Guarantees: ATMOST_ONCE / ATLEAST_ONCE / EFFECTIVE_ONCE Built-in State Management: Unified Stream & State Store with BookKeeper. Simplified application development

26. 28 Unified Streaming Solution Messaging Computing Spark DATA Pulsar Broker Pulsar Functions Flink DATA DATA HDFS DATA 。。。 Segment Store Stream Table DATA BookKeeper

27. 29 Messaging Benchmark https://github.com/openmessaging/openmessaging-benchmark

28. 30 Benchmark • Testing goals • Throughput & latency under different conditions • Min 2 guaranteed copies • Running on 3 EC2 VMs with local SSDs

29. 31 Kafka settings • Topic settings replicationFactor=3 min.insync.replicas=2 log.flush.interval.ms= # Using default: means no fsyncs • Kafka producer config acks=all linger.ms=1 batch.size=131072