Elastic Data Processing with Apache Pulsar and Apache Flink

展开查看详情

1. Elastic Data Processing with Apache Pulsar and Apache Flink Kaixing Zhao, Apache Flink Committer Sijie Guo, Apache Pulsar Committer 2019-03-23 Hangzhou

2. Agenda • What is Apache Flink? • Flink use cases and ecosystem • Pulsar - Segmented Stream Architecture • Pulsar - Tiered Storage • Pulsar + Flink Integration

3.Pulsar == Segmented Streams

4.Batch - HDFS

5.Stream - Pub/Sub

6.A Flink View on Computing “Batch processing is a special case of Stream processing”

7.A Pulsar View on Data

8. Topic Producers Topic Consumers Time

9.Partitions P0 Producers P1 P2 Consumers P3 Time

10. Segments P0 Segment 1 Segment 2 Segment 3 Producers P1 Segment 1 Segment 2 Segment 3 Segment 4 P2 Segment 1 Segment 2 Segment 3 Consumers P3 Segment 1 Segment 2 Segment 3 Time

11. Stream P0 Segment 1 Segment 2 Segment 3 Producers P1 Segment 1 Segment 2 Segment 3 Segment 4 P2 Segment 1 Segment 2 Segment 3 Consumers P3 Segment 1 Segment 2 Segment 3 Time

12. Stream Producers Stream Segment 1 Segment 2 Segment 3 Segment 4 Consumers Time

13. Access Patterns ✓ Write
 写 Stream Segment 1 Segment 2 Segment 3 Segment 4 ✓ Tailing Read
 追尾 Time ✓ Catchup Read
 追赶读

14. Access Patterns ✓ Write
 Write 写 Stream Segment 1 Segment 2 Segment 3 Segment 4 ✓ Tailing Read
 追尾 Time ✓ Catchup Read
 追赶读

15. Access Patterns Tailing Read ✓ Write
 Write 写 Stream Segment 1 Segment 2 Segment 3 Segment 4 ✓ Tailing Read
 追尾 Time ✓ Catchup Read
 追赶读

16. Access Patterns Tailing Read ✓ Write
 Write 写 Stream Segment 1 Segment 2 Segment 3 Segment 4 ✓ Tailing Read
 追尾 Time Catchup Read ✓ Catchup Read
 追赶读

17. Write Kafka Pulsar Broker Broker Broker Broker Broker Broker Partition Partition Partition (Leader) (Follower) (Follower)

18. Tailing Read Kafka Pulsar Broker Broker Broker Broker Broker Broker Partition Partition Partition (Leader) (Follower) (Follower)

19. Catchup Read Kafka Pulsar Broker Broker Broker Broker Broker Broker Partition Partition Partition (Leader) (Follower) (Follower)

20. Infinite Stream ✓ Write
 写 Bookies Brokers Stream Segment 1 Segment 2 Segment 3 Segment 4 ✓ Tailing Read
 追尾 Time ✓ Catchup Read
 追赶读

21. Infinite Stream ✓ Write
 写 Tiered Storage Bookies Brokers Stream Segment 1 Segment 2 Segment 3 Segment 4 ✓ Tailing Read
 追尾 Time ✓ Catchup Read
 追赶读

22. Tiered Storage • Offloader • When: size-based, time-based, or triggered by pulsar-admin • How: copy a segment to tiered storage, and delete it from bookkeeper • Access: broker knows how to read the data back, or bypass read 
 the offloaded segments directly • Available Offloaders • Cloud Offloder : AWS, GCS, Azure, … • HDFS, Ceph, …

23. Stream as a Unified View on Data Segment Readers Producers Stream Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Consumers Time

24. Data Processing on Pulsar Bounded Stream Bounded Stream Stream Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Time Unbounded Stream Unbounded Stream

25.Pulsar + Flink

26. Goals • Flink + Pulsar • Streaming Connectors • Source Connectors • PulsarCatalog: Schema Integration • PulsarStateBackend • Pulsar for the unified view of Data, Flink for the unified view of Computing

27.Done

28.Streaming Source -> Streaming Sink

29.Streaming Source -> Streaming Table Sink