Interactive querying of streams using Apache Pulsar--Jerry Peng


1. © 2020 SPLUNK INC. Interactive Querying of Streams Using Apache Pulsar™ Pulsar Summit | June 2020 Jerry Peng Principal Software Engineer | Apache {Pulsar, Heron, Storm} committer and PMC member

2. © 2020 SPLUNK INC. Agenda 1) General use cases 2) Existing architectures 3) Apache Pulsar overview 4) Pulsar SQL 5) Concrete use case ( 6) Demo! 7) Questions?

3. © 2020 SPLUNK INC. What are Streams? Continuous flows of data… Almost all data originate in this form

4. © 2020 SPLUNK INC. Interactive Querying of Streams? Querying both latest and historical data

5. © 2020 SPLUNK INC. How is it useful? ● Speed (i.e. data-driven processing) ○ Act faster ● Accuracy ○ In many contexts the wrong decision may be made if you do not have visibility that includes the most current data ○ For example, historical data is useful to predict a user is interested in buying a particular item, but if my analytics don’t also know that the user just purchased that item two minutes ago they’re going to make the wrong recommendation ● Simplification ○ Single place to go to access current and historical data

6. © 2020 SPLUNK INC. Debugging General use cases ● Errors and Exception ● Troubleshooting systems and networks ● Have we seen these errors before?

7. © 2020 SPLUNK INC. Monitoring (Audit logs) General use cases ● Answering the “What, When, Who, Why” ● Suspicious access patterns ● Example ○ Auditing CDC logs in financial institutions

8. © 2020 SPLUNK INC. Exploring General use cases ● Raw or enriched data ● Really simplifies access if data is all in one location

9. © 2020 SPLUNK INC. Lots of use cases General use cases ● Data analytics ● Business Intelligence ● Real-time dashboards ● etc…

10. © 2020 SPLUNK INC. Stream processing patterns Data Ingestion Data Processing / Querying Messaging Compute Data Data Storage Storage Results Storage Serving

11. © 2020 SPLUNK INC. Existing Solutions Messaging Real-time compute Data Stream Cloud Pub/Sub Apache Storm, Apache Flink, Apache Heron, etc. Storage HDFS Cloud Storage Querying Apache Hadoop MR, Apache Spark, Presto, etc.

12. © 2020 SPLUNK INC. Problems with existing solutions ● Multiple Systems ● Duplication of data ○ Data consistency. Where is the source of truth? ● Latency between data ingestion and when data is queryable


14. © 2020 SPLUNK INC. Apache Pulsar™ Flexible Messaging + Streaming System backed by a durable log storage

15. © 2020 SPLUNK INC. Apache Pulsar as a Event Store 1 5

16. © 2020 SPLUNK INC. Apache Pulsar Overview

17. © 2020 SPLUNK INC. Architecture Function Processing Worker Worker Consumer Producer Messaging Consumer Producer Broker Broker Broker Consumer Producer Consumer Bookie Bookie Bookie Bookie Bookie Event storage Multi-layer, scalable architecture Independent layers for processing, serving and storage Messaging and processing built on Apache Pulsar Storage built on Apache BookKeeper

18. © 2020 SPLUNK INC. Segment Centric Storage ● In addition to partitioning, messages are stored in segments (based on time and size) ● Segments are independent from each others and spread across all storage nodes ● What this means for Pulsar SQL? ○ Allows SQL engine to read multiple bookies and leverage disk I/O and bandwidth of multiple machines even if the data is in one partition

19. © 2020 SPLUNK INC. Writes ● Every segment/ledger has an ensemble ● Each entry in ledger has a ○ Write quorum ■ Nodes of the ensemble to which it is written (usually all) ○ Ack quorum ■ Nodes of the write quorum that must respond for that entry to be acknowledged (usually a majority) ● What this means for Pulsar SQL? ○ Allows users to configure the number of replicas SQL engine can read from ○ Trade off between read bandwidth and storage cost

20. © 2020 SPLUNK INC. Apache Bookkeeper™ Internals ● Separate IO path for reads and writes ● Optimized for writing, tailing reads, catch-up reads ● What this means for Pulsar SQL? ○ Queries often involving scanning the data. ○ Read-a-head cache in BK allows for fast sequential reads

21. © 2020 SPLUNK INC. Tiered Storage Unlimited topic storage capacity Achieves the true “stream-storage”: keep the raw data forever in stream form

22. © 2020 SPLUNK INC. Tiered Storage ● Leverage cloud storage services to offload cold data — Completely transparent to clients ● Extremely cost effective — Backends (S3) (Coming GCS, HDFS) ● Example: Retain all data for 1 month — Offload all messages older than 1 day to S3 ● What this means for Pulsar SQL? ○ Pulsar SQL can query not only data in store in Bookies but also offloaded into a cloud storage service 2 2

23. © 2020 SPLUNK INC. Schema Registry ● Store information on the data structure — Stored in BookKeeper ● Enforce data types on topic ● Allow for compatible schema evolutions ● JSON, Avro, and Protobuf supported ● What this means for Pulsar SQL? ○ Allows data to be structured so that it becomes queryable by a SQL language

24. © 2020 SPLUNK INC. Pulsar SQL Interactive SQL queries over data stored in Pulsar Query old and real-time data 2 4

25. © 2020 SPLUNK INC. Pulsar SQL / 2 ● Based on Presto by Facebook — ● Presto is a distributed query execution engine ● Fetches the data from multiple sources (HDFS, S3, MySQL, …) ● Full SQL compatibility 2 5

26. © 2020 SPLUNK INC. Pulsar SQL / 3 ● Pulsar connector for Presto ○ Read data directly from BookKeeper — bypass Pulsar Broker ■ Can also read data offloaded to Tiered Storage (S3, GCS, etc.) ○ Many-to-many data reads ■ Data is split even on a single partition — multiple workers can read data in parallel from single Pulsar partition ■ Time based indexing — Use “publishTime” in predicates to reduce data being read from disk 2 6

27. © 2020 SPLUNK INC. Pulsar SQL Architecture

28. © 2020 SPLUNK INC. Benefits ● Do not need to move data into another system for querying ● Read data in parallel ○ Performance not impacted by partitioning ○ Increase throughput by increasing write quorum ● Newly arrived data able to be queried immediately

29. © 2020 SPLUNK INC. Compared to other message buses? ● Other messaging platforms have Presto integrations ● Typically uses a consumer to read data from brokers ● Topic/partition served by a single broker (limiting disk IO and network bandwidth)

StreamNative 是一家围绕 Apache Pulsar 和 Apache BookKeeper 打造下一代流数据平台的开源基础软件公司。秉承 Event Streaming 是大数据的未来基石、开源是基础软件的未来这两个理念,专注于开源生态和社区的构建,致力于前沿技术。