Database Streaming with Apache Pulsar
展开查看详情
1.Database Streaming with Apache Pulsar Sijie Guo — twitter: @sijieg wechat: guosijie_
2.Who am I • Sijie Guo • Pulsar, BookKeeper, Hive, HBase, … • Streamlio Cofounder • Yahoo -> Twitter -> Streamlio • 华中科⼤ -> 中科院计算所
3.Agenda 1. The Beauty of Change Data Capture 2. Streaming MySQL - Connectors/Functions, Schema 3. Interactive Queries - Presto SQL
4.Agenda 1. The Beauty of Change Data Capture 2. Streaming MySQL - Connectors/Functions, Schema 3. Interactive Queries - Presto SQL
5.Traditional Batch ETL Pipeline
6.Traditional Batch ETL Pipeline High Latency Huge number of Jobs Error prone Manual Schema Updates
7.Current Mess
8.Pulsar as the backbone
9.Getting data to Pulsar
10.Option 1 - Dural Writes
11.Option 1 - Dural Writes Data consistency??? :(
12.Option 2 - Event Sourcing
13.Option 2 - Event Sourcing Violates read-your-writes consistency :(
14.Option 3 - Change Data Capture
15. “In databases, a design pattern that captures individual database changes to a stream of changes”
16.Option 3 - Change Data Capture Yeah ! :)
17.Agenda 1. The Beauty of Change Data Capture 2. Streaming MySQL - Connectors/Functions, Schema 3. Interactive Queries - Presto SQL
18.Pulsar IO Framework
19.Mysql —> HDFS
20.Mysql —> Pulsar
21.Function Overview Light weight Compute Framework Language Native (Java, Python) Flexible deployment (Thread, Process, Container, K8S) Pulsar Functions Managed State ETL, Filtering, Routing, …
22.Function in Action
23.Function Example public class WordCountFunction implements Function<String, Void> { @Override public Void process(String input, Context context) { Arrays.asList(input.split(“\\s+”)) .forEach(word -> context.incrCounter(word, 1)); return null; } }
24.Function Submission $ bin/pulsar-admin functions create \ --jar target/my-functions.jar \ --classname org.example.functions.MyFunction \ Submission --cpu 8 \ --ram 8589934592 \ --disk 10737418240 \ —-inputs input_topic —-output output_topic $ bin/pulsar-admin functions local run \ Local Run --jar target/my-functions.jar \ --classname org.example.functions.MyFunction \ —-inputs input_topic —-output output_topic
25.Function Deployment Function Function Function wordcount-1 wordcount-1 transform-2 Function Function Pod 1 wordcount-1 transform-2 Worker Function dataroute-1 Worker Node 1 Pod 2 Broker 1 Broker 1 Broker 1 Node 1 Node 2 Pod 3 Thread Process K8S
26.IO Connectors A light weight connector framework built on functions runtime
27.Running Connectors $ ./bin/pulsar-admin source create \ --tenant test \ --namespace ns1 \ Source --name twitter-source \ --destinationTopicName twitter_data \ --sourceConfigFile examples/twitter.yml \ --source-type twitter $ ./bin/pulsar-admin sink create \ Sink --tenant public \ --namespace default \ --name cassandra-test-sink \ --sink-type cassandra \ --sinkConfigFile examples/cassandra-sink.yml \ --inputs test_cassandra
28.Connector Config configs: roots: "localhost:9042" keyspace: "pulsar_test_keyspace" columnFamily: "pulsar_test_table" keyname: "key" columnName: "col"
29.Debezium Overview Open source CDC Records row-level changes via WAL Supports MYSQL, MongoDB, PostgreSQL, Oracle, Debezium SQL Server