CEDAR: A Distributed Database for Scalable OLTP

CEDAR的设计目标

  • Separating read and write operations on different nodes
  • Scalable reading on multiple nodes
  • Writing to memory in one node only
  • No expensive distributed concurrent control or synchronization
  • “Deep” optimization for transactions
  • To optimize data transmissions, query execution plans, and executions
  • High-availability is guaranteed by log synchronization
展开查看详情

1.CEDAR: A Distributed Database for Scalable OLTP Weining Qian wnqian@dase.ecnu.edu.cn

2. The Internet economy 2 ¨ Content/traffic => money ¨ O2O: Online => Offline, or Offline => Online ¨ O2O of mission-critical apps: 互联网+ (Internet+) ¨ OLTP is inevitable

3. Phenomenal applications 3 ¨ Phenomenal - very remarkable; extraordinary. • 180,000 tps in year 2016

4. Phenomenal is common 4 ¨ 12306 during Spring Festival ¨ Black Friday promotion/second kill, ... ¨ The pressure is on backend (transaction | payment) systems ¨ Be inevitable and day-to-day more common ¨ All pressure will finally go to mission critical systems ¨ Essentially a High-throughput, scalable transaction processing problem

5. A brief history of DBMS 5 https://maxkanaskar.files.wordpress.com/2014/04/database-platform-history.png

6. One size fits all => One size fits none! 6

7. DBMS 7 Data model abstraction File DBMS Systems Data processing abstraction

8. NoSQL 8 Data model abstraction Distributed NoSQL FS Weak consistency

9. NewSQL 9 In-memory Relational computing model NewSQL Fast networking Transaction processing

10. OldSQL vs. NoSQL vs. NewSQL 10 OldSQL NoSQL NewSQL Data model Relational --- Relational Interface SQL Variance SQL Consistency/Concurrency control Strong Weak Strong Fault tolerance Strong Fine Strong Performance Poor Good Very good Scalability Poor Good Fine

11. How to scale a database system? 11 Scaling-up Sharding/partitioning Replication Caching https://www.cs.cmu.edu/~christos/courses/dbms.S14/slides/29scaling.pdf

12. The open-source OceanBase (0.4) 12

13. OceanBase 0.4 is not enough 13 Not enough for mission ¨ Simple transactions only critical apps in banks with strong-consistency, high- ¨ Weak availability support availability, and high- throughput complex transaction processing ¨ Single-point transaction requirements ¨ More optimization needed for query/storage ¨ Interface adaptibility

14. Features we need 14 ¨ Complex transactions ¨ High-performance: ¤ High-throughput ¤ Low latency ¨ High-availability ¨ Scalability ¨ Elasticity

15. Overview (CEDAR core) 15 Master T-Node (backup) (backup) Master T-Node Master T-Node (backup) (backup) Hybrid storage S-Node S-Node S-Node S-Node S-Node

16. Design choices 16 ¨ Separating read and write operations on different nodes ¨ Scalable reading on multiple nodes ¨ Writing to memory in one node only ¤ No expensive distributed concurrent control or synchronization ¨ “Deep” optimization for transactions ¤ To optimize data transmissions, query execution plans, and executions ¨ High-availability is guaranteed by log synchronization

17. Status = Baseline + Delta 17

18. Reading 18 • All readings need to access S-Node as well as T-Node

19. Writings 19 • All writings need to access S-Node as well as T-Node

20. Transaction management 20 • Transactions are only processed on the T-Node

21. Pros and cons 21 ¨ Pros ¨ Cons ¤ Massive storage ¤ Expensivedata ¤ Scalable read transmission ¤ Efficient transaction management

22. Performance is affected by 22 ¨ The lengths of locks affect ¤ Degrees of parallelization 35 30 Latency Time Used per Write (us) ¤ 25 ¨ Capability of S-Nodes 20 ¤ Throughput of readings 15 10 ¨ Capability of the T-Node 5 ¤ Throughput of writings 1 5 10 50 100 500 1000 Transaction Size in # Writes • Most cost is for short/simple transactions • They are easier to be scheduled

23. S-Node optimizations 23 Static data caching Parallel readings

24. T-Node 24 ¨ Storage node? • Balancing T-Node’s ¤ All computation resources computation for are for TP queries and transaction ¤ High communication cost processing ¨ Computation node? ¤ Low communication cost ¤ How much work should it do?

25. Transaction compilation Procedure Order(p_itemType int, p_custId int, p_orderAmount int) declare v_price, v_value double; declare v_itemId, v_stock, v_orderAmount int; 25 select itemId, price, stock into v_itemId, v_price, v_stock from item where itemType = p_itemType order by price desc limit 1; S1 if( v_stock > p_orderAmount ) update item set stock = stock - p_orderAmount where itemId = v_itemId; L1 Execution plan v_orderAmount = p_orderAmount; else update item set stock = stock - v_stock where itemId = v_itemId; v_orderAmount = v_stock; S1 RD(itemType, ...) S2 L2 L3 S3 end v_value = v_price * v_orderAmount; update customter set balance -= v_value L1 if(v_stock>p_orderAmount) where custId = p_custId; T2 L4 S4 T3 S2 RS(itemId, ...) S3 Transactions T2 UP(itemId, ...) T3 T4 L2 v_orderAmount=... L3 control dep. L4 v_value=v_price*... data dep. S4 RS(custId, balance) Dependency graph T4 UP(custId, balance) RD: read, UP: update, RS: read static data

26. T-Node optimization 26 Re-order T-Node ops Postpone conflict ops T RWDelta(A) S ReadStatic(A) T ReadDelta(A) T RWDelta(B) S ReadStatic(B) T RWDelta(C) S ReadStatic(A) T RWDelta(A) T RWDelta(D)

27. TPC-C, Smallbank, TATP Benchmarks 27 20.0k NewOrder-8 100.0k PG VD CSD • 500kBetter PG than PG VD CSD NewOrder-24 15.0k 80.0k • 400kBetter than VoltDB under Throughput (tps) Throughput (tps) Throughput (tps) 10.0k 60.0k complex workload 300k 40.0k • 200kWith significant 5.0k 20.0k advantage when data 100k 0.0 cannot 0 be naturally PG VD SD CSD 8 16 24 32 40 48 8 16 24 32 40 48 System Partitions (#) partitioned Partitions (#) Payment-24 NewOrder-24 Payment-8 NewOrder-8 35 3.5 30.0k Payment-8 Payment-24 30 3.0 Norm.CSD (CSD/VD) Nomr.VD (VD/CSD) Throughput (tps) 25 2.5 20.0k 20 2.0 15 1.5 10.0k 10 1.0 5 0.5 0 0.0 0.0 PG VD SD CSD 0 4 8 12 16 0 4 8 12 16 System Cross-Warehouse Transaction Ratio (%) Cross-Warehouse Transaction Ratio (%)

28. Indexing 28 l Index is organized as a table. l No distributed transactions. l Taking advantage of the load balancing, availability of system.

29. Indexing overview 29 l Initialization: Preparing for the start of index construction. l Bulk Loading l Local processing: collecting statistical information l Global processing: achieving load balancing based on an equi- depth histogram l Termination: Scheduling the task for replication of the index for high availability.