- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
20_05 Transaction Management On Cassandra
文章介绍了什么是标量数据库、Cassandra的事务管理、基准和校验的结果
展开查看详情
1 .Transaction Management on Cassandra 10 Sep, 2019 at ApacheCon NA 2019 Hiroyuki Yamada CTO/CEO at Scalar, Inc. 1
2 . Who am I ? • Hiroyuki Yamada – Passionate about Database Systems and Distributed Systems – Ph.D. in Computer Science, the University of Tokyo – IIS the University of Tokyo, Yahoo! Japan, IBM Japan © 2019 Scalar, inc. 2
3 . Agenda • What is Scalar DB • Transaction Management on Cassandra • Benchmark and Verification Results © 2019 Scalar, inc. 3
4 . Maybe You Don’t Need ACID Transactions • ACID transactions are heavy – Especially when data is distributed • One of other solutions: – Make operations idempotent and retry them if atomicity is not required © 2019 Scalar, inc. 4
5 . What is Scalar DB https://github.com/scalar-labs/scalardb • A library that makes non-ACID distributed databases ACID-compliant – Cassandra is the first supported distributed database Transaction management, Recovery management, Java API © 2019 Scalar, inc. 5
6 . System Architecture with Scalar DB and Cassandra Web Applications HTTP End Users Client Command programs execution Achieves one-copy Serializable Transactions Scalar DB Key key = new Key( Scalar DB new TextValue(“id”, ”1”)); Result result = db.get(new Get(key)); // do something with result Put put = new Put(key).(…); db.put(put) Cassandra nodes DataStax © 2019 Scalar, inc. Java Driver 6
7 . Key Characteristics • Non-invasive approach – Any modifications to the underlying database are not required • High availability – Available as long as quorum of replicas are up – C* high availability is fully sustained by the client-coordinated approach • Horizontal scalability – Throughput scales linearly – C* high scalability is fully sustained by the client-coordinated approach • Strong Consistency – Replicas updated by transactions are always consistent and up-to-date © 2019 Scalar, inc. 7
8 . Transaction Management on Cassandra - Introduction • Based on Cherry Garcia protocol [ICDE’15] – Requires minimum set of features such as linearizable conditional update and the ability to store metadata • Scalar DB is one of the applications of the protocol – Use LWT for Linearizability – Manage transaction metadata in user record space • Implement enhancements – Protocol correction – No use of TrueTime API – Serializable support (SI is the default isolation level) © 2019 Scalar, inc. 8
9 . Transaction Metadata Management • WAL (Write-Ahead Logging) records are distributed Application data Transaction metadata Application data Transaction metadata (Before) (Before) User/Application Status Version TxID Record Status Version TxID (before) (before) (before) in user tables After image Before image Status Record Application data in coordinator TxID Status Other metadata (managed by users) table Transaction metadata (managed by Scalar DB) © 2019 Scalar, inc. 9
10 . Transaction Protocol - Overview • Optimistic concurrency control • Similar to 2 phase commit protocol – Prepare phase: prepare records – Commit phase 1: commit status – This is where a transaction is regarded as committed or aborted in normal cases – (Commit phase 2: commit records) • Lazy recovery – Uncommitted records will be rollforwarded or rollbacked based on the status of a transaction when the records are read © 2019 Scalar, inc. 10
11 . Transaction Protocol By Examples – Prepare Phase Client1’s memory space Cassandra UserID Balance Status TxID Version 1 100 C XXX 5 2 100 C YYY 4 Client1 Read Tx1: Transfer 20 from 1 to 2 1 80 P Tx1 6 2 120 P Tx1 5 UserID Balance Status TxID Version 1 100 C XXX 5 Atomic conditional 2 100 C YYY 4 Client2’s memory space update (LWT) UserID Balance Status TxID Version Update only if 1 100 C XXX 5 the version is the version I read Fail due to 2 100 C YYY 4 the condition mismatch Client2 Tx2: Transfer 10 from 1 to 2 1 90 P Tx2 6 2 110 P Tx2 5 © 2019 Scalar, inc. 11
12 . Transaction Protocol By Examples – Commit Phase 1 Cassandra Client1 with Tx1 UserID Balance Status TxID Version 1 80 P Tx1 6 2 120 P Tx1 5 Atomic conditional update (LWT) Update if the TxID TxID Status does not exist XXX C YYY C ZZZ A Tx1 C © 2019 Scalar, inc. 12
13 . Transaction Protocol By Examples – Commit Phase 2 Cassandra Client1 with Tx1 UserID Balance Status TxID Version 1 80 C Tx1 6 Atomic conditional update (LWT) 2 120 C Tx1 5 Update status if the record is prepared by the TxID TxID Status XXX C YYY C ZZZ A Tx1 C © 2019 Scalar, inc. 13
14 . Failure Handling by Examples • If TX1 fails before prepare phase – Just clear the memory space for TX1 • If TX1 fails after prepare phase and before commit phase 1 (no status is written in Status table) – Another transaction (TX3) reads the records and notices that the records are prepared and there is no status for it – TX3 tries to abort TX1 (TX3 tries to write ABORTED to Status with TX1’s TXID and rolls back the records) – TX1 might be on it’s way to commit status, but only one can win, not both • If TX1 fails (right) after commit phase 1 – Another transaction (TX3) tries to commit the records (rollforward) on behalf of TX1 when TX3 reads the same records as TX1 – TX1 might be on it’s way to commit records, but only one can win, not both © 2019 Scalar, inc. 14
15 . Benchmark Results • Achieved 90 % scalability in 100-node cluster (Compared to the Ideal TPS based on the performance of 3-node cluster) Workload1 (Payment) Workload2 (Evidence) Each node: i3.4xlarge (16 vCPUs, 122 GB RAM, 1900 GB NVMe SSD * 2), RF: 3 © 2019 Scalar, inc. 15
16 . Verification Results • Scalar DB has been heavily tested with Jepsen and our destructive tools – Note that Jepsen tests are created and conducted by Scalar • It has passed both tests for a long time • See https://github.com/scalar-labs/scalar-jepsen for more detail s en Jep sed Pas © 2019 Scalar, inc. 16
17 . Other Contributions for Apache Cassandra from Scalar • GroupCommitlogService – Yuji Ito – Group multiple commitlog writes at once – CASSANDRA-13530 • Jepsen tests for Cassandra – Yuji Ito, Craig Pastro – Maintain with the latest Jepsen – Rewrite with Alia clojure driver – https://github.com/scalar-labs/scalar-jepsen • Cassy – A simple and integrated backup tool – Just released under Apache 2 – https://github.com/scalar-labs/cassy © 2019 Scalar, inc. 17
18 . Cassy: A simple and integrated backup tool • Required to take a transactionally consistent backup © 2019 Scalar, inc. 18
19 . Future Work • DataStax driver 4.x support • Cassandra 4.x support – Hopefully nothing needs to be done • Other C* compatible databases integration – Scylla DB (waiting for LWT) – Cosmos DB (waiting for LWT) • (HBase adapter) © 2019 Scalar, inc. 19
20 . Questions ? © 2019 Scalar, inc. 20
21 . Optimization • Prepare in deterministic order – => First prepare always wins TX1: updating K1 and K2 and K3 TX2: updating K2 and K3 and K4 K1 K2 K3 K2 K3 K4 H: Consistent hashing H(K2) H(K1) H(K3) H(K4) Always prepare in this order !!! (otherwise, all Txs might abort. Ex. TX1 prepares K1,K2 and TX2 prepares K4,K3 in this order) © 2019 Scalar, inc. 21
22 . Snapshot Isolation • Strong isolation level but weaker than Serializable – Similar to “MVCC” – Oracle’s most strict isolation level (it’s called “Serializable”) • Read only sees a snapshot (=> non blocking reads) • Mostly strong enough but there are still some anomalies © 2019 Scalar, inc. 22
23 . Anomalies in Snapshot Isolation • Write Skew, Read-Only Transaction • Write skew example: – Account balances: X and Y (assume family account) – Initial state: X=70, Y=80 – Constraint: X + Y > 0 – TX1: X = X – 100, TX2: Y = Y - 100 – H: R1(X0, 70) R2(X0, 70) R1(Y0, 80) R2(Y0, 80)W1(X1, −30)C1 W2(Y2, −20)C2 X Y TX1 Ok for the 70-100 -> -30 80 constraint X0 Y0 Update X 70 80 -30 -20 TX2 X Y Ok for the 70 80-100 -> -20 constraint Update succeeds without conflict Update Y in Snapshot Isolation © 2019 Scalar, inc. 23
24 . Serializable Support • Convert all reads into writes (writing the same value) in a transaction © 2019 Scalar, inc. 24
25 . Protocol Correction • Commit records by non-atomically – => NO!!! • Someone else have already did it and have started and prepared a new transaction Start and Commit Commit A (without knowing A with TX1 is already Prepare A Status committed, and A is overwritten by a new TX2) TX1 TX2 Read A and commit A Start and Commit Commit A on behalf of TX1 Prepare A Status © 2019 Scalar, inc. 25