Apache Flink® 1.7 and Beyond Part II

Apache Flink® 1.7 and Beyond Part II

1. End-to-end SQL Only Pipelines Hive Meta Store • Support for external catalogs (Confluent Schema Registry, Hive Meta Store) Input schema information Output schema information • Data definition language (DDL) SQL Table Source Table Sink Query

2. Capability Spectrum Batch Streaming analytics Event-driven applications offline real time Flink

3. Flink as a Library • Deploying Flink applications should be as easy as starting a process • Bundle application code and Flink into a single image • Process connects to other application processes and figures out its role • Expose Flink’s managed state to the embedding application P1 P2 P3 P4 New process

4. Reactive vs. Active • Active mode • Flink is aware of underlying cluster framework • Flink allocate resources • E.g. existing YARN and Mesos integration • Reactive mode • Flink is oblivious to its runtime environment • External system allocates and releases resources • Flink scales with respect to available resources • Relevant for environments: Kubernetes, Docker, as a library

5. Dynamic Scaling • Latency • Throughput • Resource utilization • Connector signals

6. Batch-Streaming Unification • No fundamental difference between batch and stream processing • Batch allows optimizations because data is bounded and ”complete” • Batch and streaming still separately treated from task level upwards • Working toward a single runtime for batch and streaming workloads

7. Flink Scheduler • Lazy scheduling (batch case) src build side • Deploy tasks starting from the sources build side join join • Whenever data is produced start consumers probe probe side • Scheduling of idling tasks → resource under- src side src utilization

8. Batch Scheduler (1) • More efficient scheduling by taking dependencies into account src build side (2) • E.g. probe side is only scheduled after build side has been join build side join processed probe probe side side src src (2) (3)

9. Extendable Scheduler Scheduler • Make Flink’s scheduler extendable & pluggable • Scheduler considers dependencies and reacts to signals from Streaming Scheduler Batch Scheduler ExecutionGraph • Specialized scheduler for different use cases Speculative Scheduler

10. Flink’s Shuffle Service • Tasks own produced result partitions • Containers cannot be freed until result is consumed • One implementation for streaming and batch loads Container Result partition

11. External & Persistent Shuffle Service • Result partitions are written to an external shuffle service • Containers can be freed early • Different implementations based on use case External shuffle service (e.g. Yarn, DFS)

12. TL;DL • Flink 1.7.0 added many new features around SQL, connectors and state evolution • A lot of new features in the pipeline • Join the community! • Subscribe to mailing lists • Participate in Flink development • Become active


14.Flink Forward San Francisco 2019