Database Streaming with Apache Pulsar

以Apache Pulsar为例,分享一下如何基于Pulsar打造流式的数据Pipeline,包括其中三个重要的生态组件:IO Connectors,Schema,以及Pulsar SQL。让用户和开发人员可以用更实时的方式进行数据分析。
展开查看详情

1.Database Streaming with Apache Pulsar Sijie Guo — twitter: @sijieg wechat: guosijie_

2.Who am I • Sijie Guo • Pulsar, BookKeeper, Hive, HBase, … • Streamlio Cofounder • Yahoo -> Twitter -> Streamlio • 华中科⼤ -> 中科院计算所

3.Agenda 1. The Beauty of Change Data Capture 2. Streaming MySQL - Connectors/Functions, Schema 3. Interactive Queries - Presto SQL

4.Agenda 1. The Beauty of Change Data Capture 2. Streaming MySQL - Connectors/Functions, Schema 3. Interactive Queries - Presto SQL

5.Traditional Batch ETL Pipeline

6.Traditional Batch ETL Pipeline High Latency Huge number of Jobs Error prone Manual Schema Updates

7.Current Mess

8.Pulsar as the backbone

9.Getting data to Pulsar

10.Option 1 - Dural Writes

11.Option 1 - Dural Writes Data consistency??? :(

12.Option 2 - Event Sourcing

13.Option 2 - Event Sourcing Violates read-your-writes consistency :(

14.Option 3 - Change Data Capture

15. “In databases, a design pattern that captures individual database changes to a stream of changes”

16.Option 3 - Change Data Capture Yeah ! :)

17.Agenda 1. The Beauty of Change Data Capture 2. Streaming MySQL - Connectors/Functions, Schema 3. Interactive Queries - Presto SQL

18.Pulsar IO Framework

19.Mysql —> HDFS

20.Mysql —> Pulsar

21.Function Overview Light weight Compute Framework Language Native (Java, Python) Flexible deployment (Thread, Process, Container, K8S) Pulsar Functions Managed State ETL, Filtering, Routing, …

22.Function in Action

23.Function Example public class WordCountFunction implements Function<String, Void> { @Override public Void process(String input, Context context) { Arrays.asList(input.split(“\\s+”)) .forEach(word -> context.incrCounter(word, 1)); return null; } }

24.Function Submission $ bin/pulsar-admin functions create \ --jar target/my-functions.jar \ --classname org.example.functions.MyFunction \ Submission --cpu 8 \ --ram 8589934592 \ --disk 10737418240 \ —-inputs input_topic —-output output_topic $ bin/pulsar-admin functions local run \ Local Run --jar target/my-functions.jar \ --classname org.example.functions.MyFunction \ —-inputs input_topic —-output output_topic

25.Function Deployment Function Function Function wordcount-1 wordcount-1 transform-2 Function Function Pod 1 wordcount-1 transform-2 Worker Function dataroute-1 Worker Node 1 Pod 2 Broker 1 Broker 1 Broker 1 Node 1 Node 2 Pod 3 Thread Process K8S

26.IO Connectors A light weight connector framework built on functions runtime

27.Running Connectors $ ./bin/pulsar-admin source create \ --tenant test \ --namespace ns1 \ Source --name twitter-source \ --destinationTopicName twitter_data \ --sourceConfigFile examples/twitter.yml \ --source-type twitter $ ./bin/pulsar-admin sink create \ Sink --tenant public \ --namespace default \ --name cassandra-test-sink \ --sink-type cassandra \ --sinkConfigFile examples/cassandra-sink.yml \ --inputs test_cassandra

28.Connector Config configs: roots: "localhost:9042" keyspace: "pulsar_test_keyspace" columnFamily: "pulsar_test_table" keyname: "key" columnName: "col"

29.Debezium Overview Open source CDC Records row-level changes via WAL Supports MYSQL, MongoDB, PostgreSQL, Oracle, 
 Debezium SQL Server

StreamNative 是一家围绕 Apache Pulsar 和 Apache BookKeeper 打造下一代流数据平台的开源基础软件公司。秉承 Event Streaming 是大数据的未来基石、开源是基础软件的未来这两个理念,专注于开源生态和社区的构建,致力于前沿技术。
关注他