Introducing HerdDB - a distributed JVM embeddable database built upon Apache BookKeeper——Enrico

展开查看详情

1.HerdDB A Distributed JVM Embeddable Database built upon Apache BookKeeper Enrico Olivelli Senior Software Engineer @MagNews.com @EmailSuccess.com Pulsar Summit 2020

2.Agenda • Why HerdDB ? • Embedded vs Replicated • HerdDB: Data Model • HerdDB: How it works • Using Apache BookKeeper as WAL • Q&A Mme 2

3.Why HerdDB? HerdDB is a Distributed Database System, designed from the ground up to run inside the same process of client applications. The project started inside EmailSuccess.com, a Mail Transfer Agent (like sendmail). EmailSuccess is a Java application and it initially used MySQL to store message state. In 2016 we started the development of EmailSuccess Cluster and we needed a better solution for storing message state. Requisites: • Embedded: The database should run inside the same Java process (like SQLLite) • Replicated: The database should scale out and offer high-availability automatically Solutions were already present on the market, but nothing could meet both of the requirements . 3

4.Some HerdDB users Currently HerdDB is used in production: - As Primary SQL DB for EmailSuccess.com standalone - As Replicated DB for EmailSuccess.com cluster - As Metadata Service on BlobIt – Binary Large Object Storage built over Apache BookKeeper (https://github.com/diennea/blobit) - As Configuration and Certificate store for CarapaceProxy HTTP server (https://github.com/diennea/carapaceproxy) - As SQL Database for Apache Pulsar Manager (https://github.com/apache/pulsar-manager) - In HerdDB Collections Framework flavour at https://magnews.com - As MySQL replacement in a few other non Open Source internal products at https://diennea.com 4

5.Another Embedded Database for the JVM Embedding the database with the application brings these advantages: - No additional deployment and management tools for users. - The database is hidden to the user, you have to manage «only» the application. - The application is tested with only one version of the Database (the same as in production). - Easier to test for developers (run the DB in-memory for unit tests). Challenges: - Memory management: the DB cannot use all of the heap of the client application. - Have sensible defaults. - Enable automatic management routines by default. - Handle upgrade procedures automatically, without manual intervention. - Pay much attention to wire protocol changes and to disk formats. 5

6.Replication for an Embedded Database Replication: high-availability and scalability. Benefits: - Work even in case of temporary or permanent loss of machines. - Keep data closer to the clients. - No shared disks or network file systems. - Scale out by adding machines to the application. Challenges: - Handle gracefully temporary failures; long GC pauses, application restart. - Understand storage topology: WAL and cold Data have separate replication mechanisms. - Need some external source of truth: Apache ZooKeeper. 6

7.HerdDB - Data model: TableSpaces The database is partitioned into TableSpaces, a TableSpace is a group of Tables. You can run transactions and joins that span tables of the same TableSpace. A TableSpace must reside entirely on every machine assigned to it. TableSpace assets: TableSpace - Tables Transac'ons - Indexes Table 1 Table 2 PK PK Index Index - Open Transactions Secondary Indexes Secondary Indexes - Write-ahead log Write-ahead-log - Checkpoint data 7

8.HerdDB - Data Model: Tables A Table is a simple key-value dictionary (byte[] -> byte[]). Clients use JDBC API and SQL language. TableManager On top of the key-value structure we have a SQL layer: Primary Key Page 1 (immutable) Index KEY1 -> RECORD1v1 - Key = Primary Key KEY1 -> PAGE 1 KEY2 -> RECORD2v1 KEY2 -> PAGE 2 KEY3 -> RECORD3v1 KEY3 -> PAGE 1 - Value = All of the other columns Secondary We are using Apache Calcite for SQL work: Parser and Planner. index Page 2 (writable) VALUE1 -> KEY1 KEY2 -> RECORD2v2 VALUE2 > KEY1 Core Data Structures (TableManager): VALUE3 -> KEY2 - Data Page: a bunch of records, indexes by PK (hashmap) - Primary Key Index: a map that binds a PK value to the id of a data page - Dirty Page Buffer: a Data Page that is open for writes - Secondary indexes: they map a value to a set of PKs HerdDB Collections Framework uses directly the Key Value structure. 8

9.HerdDB - How it works: the Write Path TableManager Write Path flow for INSERT/UPDATE operations: Primary Key Page 1 (immutable) 1) Validate the operation. Index KEY1 -> RECORD1v1 KEY1 -> PAGE 1 KEY2 -> RECORD2v1 2) Write to the WAL. KEY2 -> PAGE 1 KEY3 -> RECORD3v1 KEY3 -> PAGE 1 3) If the record is a new record or it was on another page, write the new version to the Dirty Page Buffer and update the PK index. Secondary index 3b) If the record was already on the Dirty Page Buffer, simply replace it. VALUE1 -> KEY1 Page 2 (writable) VALUE2 > KEY1 You can have multiple versions of the same record, old versions are discarded VALUE3 -> KEY2 during maintenance operations, usually during checkpoints. When you are working inside a Transaction all of the writes are applied to an internal buffer local to the Transaction and they are not visible to other Transactions. In this case writes to the Dirty Page Buffer are defferred to the «commit» operation. 9

10.HerdDB - How it works: the Write Path TableManager Write Path flow for INSERT/UPDATE operations: Primary Key Page 1 (immutable) 1) Validate the operation. Index KEY1 -> RECORD1v1 KEY1 -> PAGE 1 KEY2 -> RECORD2v1 2) Write to the WAL. KEY2 -> PAGE 1 KEY3 -> RECORD3v1 KEY3 -> PAGE 1 3) If the record is a new record or it was on another page, write the new version to the Dirty Page Buffer and update the PK index. Secondary index 3b) If the record was already on the Dirty Page Buffer, simply replace it. VALUE1 -> KEY1 Page 2 (writable) VALUE2 > KEY1 You can have multiple versions of the same record, old versions are discarded VALUE3 -> KEY2 during maintenance operations, usually during checkpoints. When you are working inside a Transaction all of the writes are applied to an internal buffer local to the Transaction and they are not visible to other TableManager Transactions. Primary Key Page 1 (immutable) In this case writes to the Dirty Page Buffer are defferred to the «commit» Index KEY1 -> RECORD1v1 KEY1 -> PAGE 1 KEY2 -> RECORD2v1 operation. KEY2 -> PAGE 2 KEY3 -> RECORD3v1 KEY3 -> PAGE 1 Secondary index Page 2 (writable) VALUE1 -> KEY1 KEY2 -> RECORD2v2 VALUE2 > KEY1 10 VALUE3 -> KEY2

11.HerdDB - How it works: the Write Path If the latest version of the record is already on a writable page, then we can simply replace the content of the record. No need to update the Primary Key index. TableManager Primary Key Page 1 (writable) Index KEY1 -> RECORD1v1 KEY1 -> PAGE 1 KEY2 -> RECORD2v1 KEY2 -> PAGE 1 KEY3 -> RECORD3v1 KEY3 -> PAGE 1 11

12.HerdDB - How it works: the Write Path If the latest version of the record is already on a writable page, then we can simply replace the content of the record. No need to update the Primary Key index. TableManager TableManager Primary Key Primary Key Page 1 (writable) Page 1 (writable) Index KEY1 -> RECORD1v1 Index KEY1 -> RECORD1v1 KEY1 -> PAGE 1 KEY2 -> RECORD2v1 KEY1 -> PAGE 1 KEY2 -> RECORD2v2 KEY2 -> PAGE 1 KEY3 -> RECORD3v1 KEY2 -> PAGE 1 KEY3 -> RECORD3v1 KEY3 -> PAGE 1 KEY3 -> PAGE 1 12

13.HerdDB - How it works: Replication An HerdDB Cluster is made of: App App App - HerdDB nodes HerdDB HerdDB HerdDB Server Server Server - BookKeeper servers (Bookies) - ZooKeeper servers Bookie Bookie ZK Bookie It is common to run the Bookie inside the same process of the HerdDB node Is it also common to run the HerdDB node inside the same process of the client application. 13

14.HerdDB - How it works: Replication An HerdDB Cluster is made of: App App App - HerdDB nodes HerdDB HerdDB HerdDB Server Server Server - BookKeeper servers (Bookies) - ZooKeeper servers Bookie Bookie ZK Bookie It is common to run the Bookie inside App App App the same process of the HerdDB node HerdDB + Bookie HerdDB + Bookie ZK Is it also common to run the HerdDB HerdDB + Bookie node inside the same process of the client application. 14

15.HerdDB - How it works: Replication An HerdDB Cluster is made of: App App App - HerdDB nodes HerdDB HerdDB HerdDB Server Server Server - BookKeeper servers (Bookies) - ZooKeeper servers Bookie Bookie ZK Bookie It is common to run the Bookie inside App App App the same process of the HerdDB node HerdDB + Bookie HerdDB + Bookie ZK Is it also common to run the HerdDB HerdDB + App+ App+ Bookie HerdDB + HerdDB + Bookie node inside the same process of the Bookie client applica@on. App+ HerdDB + Bookie ZK 15

16.HerdDB - How it works: Replication A TableSpace is a Replicated State Machine: - The state of the machine is the set of tables and their contents. Read/Write Client - We have a list of replicas: one node is the leader node, the others are the followers. - Clients only talk to the leader node. ZooKeeper - The leader node writes state changes to the BookKeeper. Node1 - Followers read the tail of the log and apply every opera'on to leader the local copy of the tablespace, in the same order. - Nodes never talk to each other for write/read opera'ons: only for ini'al bootstrap of a new replica. Write Data Changes Apache BookKeeper guarantees the overall consistency: Read changes and apply locally Node2 - Fencing mechanism. follower - Last Add Confirmed Protocol. Bookie1 Apache ZooKeeper is the glue and the source of truth: Bookie3 - Service Discovery. Tail the log Node3 Bookie2 - Metadata management. follower - Coordina'on. 16

17.HerdDB - How it works: Replication Two levels of Replication: - Write ahead log: Apache BookKeeper. App App App - TableSpace data structures: HerdDB runtime. BookKeeper client on the leader node: HerdDB HerdDB HerdDB - Writes directly to each Bookie in the choosen ensemble. Server Server Server - Spreads data among bookies. - Deals with temporary and permanent failure of bookies. Bookie Bookie HerdDB writes to the local disk in these cases: ZK Bookie - Checkpoint: the server consolidates the local copy of the data at a given point in time (log position). - Low memory: swap out current dirty pages to temporary files. WAL truncation happens with a time-based policy. Followers will only eventually store durably locally a copy of the whole tablespace. 17

18. HerdDB - How it works: Follower promotion and Fencing Apache BookKeeper guarantees the consistency of TableSpace data, by means of the fencing mechanism Read/Write Client What happens when a follower node N2 is promoted to be leader instead of N1 ? - N1 is current leader ZooKeeper N1 X - The client is sending all of the requests to N1 leader - But N1 has been partioned from the cluster, some network paths are not available any more: - But the client is still connected to it Write Data Changes - N1 is still able to talk to Bookies Read changes - N1 is not receiving notifications from ZK and apply locally N2 follower - Someone updates TableSpace metadata on ZK: the leader is now Bookie1 officially N2 Bookie3 - Please note that N1 does not receive any notification from ZK about the change of role Bookie2 18

19. HerdDB - How it works: Follower promotion and Fencing - N2 starts recovery and opens all of the ledgers with the ‘recovery’ Send Write flag Receive Client Failure - Bookies fence out N1 response «No more - N1 receives the «fenced» error response at the first write leader» Send new opera'on and stops N1 Write to - Any client s'll connected to N1 receive a «not leader» error from Old Node2 N1 and discover the new leader using ZK leader - Now new writes go to N2 Receive Failure - When N1 is back to normal opera'on state it recovers from the log response Open log with fencing and starts being a follower. «Fenced» N2 New Bookie1 leader Please note that opera'ons on the hot paths, reads and writes, do not Bookie3 access the metadata service, but the overall consistency is s'll guaranteed. Bookie2 19

20.Wrap up HerdDB is a good op@on for you: - If you want a Open Source SQL database with builVn automaVc replicaVon - If you want to embed the SQL database into your Java applicaVon Embedding a Replicated Database benefits: - Run and manage only the applicaVon, hide distributed Database complexity to system administrators - Run and test the applicaVon with the same Database you use in producVon HerdDB replicaVon is based on Apache BookKeeper and Apache ZooKeeper: - BookKeeper implements replicaVon for the WAL and guarantees the consistency of your data - ZooKeeper is the glue and the base for coordinaVon, metadata management and service discovery 20

21.Q&A 21

22.Thank you. https://herddb.com https://bookkeeper.apache.org Twitter: @eolivelli LinkedIn: https://www.linkedin.com/in/enrico-olivelli-984b7874/ We can chat on https://bookkeeper.apache.org/community/slack/

StreamNative 是一家围绕 Apache Pulsar 和 Apache BookKeeper 打造下一代流数据平台的开源基础软件公司。秉承 Event Streaming 是大数据的未来基石、开源是基础软件的未来这两个理念,专注于开源生态和社区的构建,致力于前沿技术。