PingCAP-Infra-Meetup-89-What’is-new-in-latest-TiKV

屈鹏老师首先为大家介绍了 TiKV 最新版本的 3 个新的优化: •batch gRPC/Raft messages 特性可以将消息收集为一个 batch 批量发送,减少了网络相关的系统调用次数,达到了性能上的提升。 •threaded raftstore/apply 特性将之前系统中的两个单线程组件替换为多线程,同时避免了数据倾斜和饥饿,消除了 TiKV 在写入上的瓶颈。 •distributed GC 大幅重构了 GC 相关的代码,GC 的驱动者由客户端变成了 TiKV 自己,简化了客户端的编写难度,同时将 GC 速度加快了 3 倍。 最后屈鹏老师分享了几个正在开发中的优化,包括事务提交不取 timestamp 等等。
展开查看详情

1.What's new in latest TiKV qupeng@pingcap.com

2.Summary ● Overview of TiDB and TiKV ● New features and evaluation ○ batch grpc/raft messages ○ threaded raftstore & threaded asynchronous apply ○ distributed garbage collection ● Lots of groundwork for future features

3.Overview of TiDB Stateless SQL Layer Metadata / Timestamp request TiDB ... TiDB ... TiDB gRPC Placement gRPC Driver (PD) TiKV ... TiKV ... TiKV ... TiKV Control flow: Balance / Failover gRPC Raft Raft Raft Distributed Storage Layer

4.Overview of TiKV gRPC send/recv messages Server handle KV request handle pushdown KV API Coprocessor read requests 2pc / distributed transaction Transaction rowkey_ts ->value MVCC provide KV API on raft Raft KV data store in RocksDB RocksDB

5.Concepts ● Region ○ A continuous range of data ● Peer ○ Replica of a Region ● ts ○ Logical timestamp, generated by PD

6.TiKV Write Flow 1)begin to write 1 9 3)transaction constraint check 4)async call RaftKV APIs 5)replicate by Raft 2 Server 6)write to rocksdb 8)return from RaftKV APIs KV API Coprocessor 9)write success 8 3 Transaction MVCC 4 5 Raft KV Raft KV Raft KV 6 5 7 6 6 RocksDB RocksDB RocksDB

7.TiKV Read Flow 1)recv read request 1 7 2)dispatch to KV API or Coprocessor 3)send get snapshot request 4)check leader and get snapshot from 2 Server 2 RocksDB 5)return snapshot KV API 6 Coprocessor 6 6)read and return result 7)finish read Transaction 3 5 3 5 MVCC Raft KV 4 RocksDB

8.Bottlenecks before TiKV 3.0-beta gRPC send/recv messages High CPU usage Server handle KV request handle pushdown KV API Coprocessor read requests 2pc / distributed transaction Transaction rowkey_ts ->value MVCC GC isn’t quick enough provide KV API on raft Raft KV 2 single thread components data store in RocskDB RocksDB

9.gRPC CPU Usage issue - the old C/S model ● TiDBs communicate with TiKVs with unary call ● Every client goroutines can do the unary call ● Too many calls introduce high CPU usage unary request TiKV TiDB unary response TiKV

10.gRPC CPU Usage issue - the new C/S model ● TiDBs communicate with TiKVs with bi-streaming call ● One goroutine will collect all requests and batch them ● Another goroutine receive batched responses and extract them ● Using a map to track requests ● Wait a while if need when collecting requests batch request stream TiKV TiDB batch response stream TiKV map[req_id]Request

11.Batch gRPC message evaluation ● X axis is concurrency and Y axis is QPS

12.Batch gRPC message evaluation ● X axis is time and Y axis is the batch size

13.gRPC CPU Usage issue - raft transport ● call `send` to buffer a message into the raft client ● call flush to send the buffered messages to gRPC ① send ③ flush & inform raft client raftstore ② flush ① cache in gRPC thread Message buffer

14.gRPC CPU Usage issue - lots of flushes ● pseudo code to show how raftstore calls `send` and `flush` ● for every time a leader becomes active, it will call a `flush`

15.Batch raft messages - new raft transport ● raftstore calls `send` to send a message into gRPC worker directly ● raftstore calls `flush` to inform gRPC threads to do their work ● gRPC threads collect messages into a batch before send to network ① send ③ inform if need raft client raftstore ② flush ① send gRPC thread Message buffer

16.Batch raft messages - new raft transport ● pseudo code to show new internal logic of `flush` ● If the downstream gRPC thread is already woken up, skip the notify ● If the downstream gRPC thread is busy on other things, delay the notification to collect larger batch

17.Batch raft messages evaluation ● X axis is time and Y axis is the batch size

18.Single thread raftstore ● Pseudo code to show old single thread raftstore ● `on_base_tick` is called periodically, or after `on_message`

19.Threaded raftstore/apply design principals ● Balance between threads ○ group regions by id is not a good idea ● Avoid regions become hungry ○ interrupt/resume processing one busy region ● Cooperate with region split/merge

20.Threaded raftstore/apply core design ● gRPC thread calls `RaftRouter::send` to send raft messages to PeerFsm ● If the peer is not created, redirect the message to StoreFsm to create the PeerFsm ● Following messages can be sent to PeerFsm directly ● And then it notifies `BatchSystem` to fetch and process the PeerFsm send StoreFsm BatchSyste RaftRouter create m send PeerFsm gRPC thread notify raftstore thread

21.Threaded raftstore/apply core design ● Pseudo code to show the loop in every raftstore thread ● Things in asynchronous threads are simillar

22.threaded raftstore & apply evaluation

23.What’s GC ● Clean all stale MVCC versions ● TiKV uses percolator to implement MVCC ● garbage data is in default cf, write cf ● locks are in lock cf

24.GC drived by clients ● TiDB calculates a safe point ● TiDB resolves lock first to clean all locks before safepoint ● For all regions, TiDB sends GC RPC to TiKVs ● Cons ○ Can’t utilize all TiKVs ○ Need to be implemented in all tikv-clients

25.Distributed GC ● TiDB calculates a safe point ● TiDB resolves lock first to clean all locks before safepoint ● TiDB puts the safe point to PD ● TiKVs fetch the latest safe point, then do GC on regions whose leader is on the store ● TiKVs can collect region information with an observer registered in raftstore ● Evaluation: reduce GC time from 1 hour to 15 minutes for a 500G cluster

26.More improvements on the way ● Joint-consensus to speed up configuration change ● Avoid to fetch commit timestamp from PD ● Remove the scheduler thread

27.Thanks! We Are Hiring hire@pingcap.com

TiDB 是一款定位于在线事务处理/在线分析处理( HTAP: Hybrid Transactional/Analytical Processing)的融合型数据库产品,实现了一键水平伸缩,强一致性的多副本数据安全,分布式事务,实时 OLAP 等重要特性。
关注他