- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
PingCAP-Infra-Meetup-89-What’is-new-in-latest-TiKV
展开查看详情
1 .What's new in latest TiKV qupeng@pingcap.com
2 .Summary ● Overview of TiDB and TiKV ● New features and evaluation ○ batch grpc/raft messages ○ threaded raftstore & threaded asynchronous apply ○ distributed garbage collection ● Lots of groundwork for future features
3 .Overview of TiDB Stateless SQL Layer Metadata / Timestamp request TiDB ... TiDB ... TiDB gRPC Placement gRPC Driver (PD) TiKV ... TiKV ... TiKV ... TiKV Control flow: Balance / Failover gRPC Raft Raft Raft Distributed Storage Layer
4 .Overview of TiKV gRPC send/recv messages Server handle KV request handle pushdown KV API Coprocessor read requests 2pc / distributed transaction Transaction rowkey_ts ->value MVCC provide KV API on raft Raft KV data store in RocksDB RocksDB
5 .Concepts ● Region ○ A continuous range of data ● Peer ○ Replica of a Region ● ts ○ Logical timestamp, generated by PD
6 .TiKV Write Flow 1)begin to write 1 9 3)transaction constraint check 4)async call RaftKV APIs 5)replicate by Raft 2 Server 6)write to rocksdb 8)return from RaftKV APIs KV API Coprocessor 9)write success 8 3 Transaction MVCC 4 5 Raft KV Raft KV Raft KV 6 5 7 6 6 RocksDB RocksDB RocksDB
7 .TiKV Read Flow 1)recv read request 1 7 2)dispatch to KV API or Coprocessor 3)send get snapshot request 4)check leader and get snapshot from 2 Server 2 RocksDB 5)return snapshot KV API 6 Coprocessor 6 6)read and return result 7)finish read Transaction 3 5 3 5 MVCC Raft KV 4 RocksDB
8 .Bottlenecks before TiKV 3.0-beta gRPC send/recv messages High CPU usage Server handle KV request handle pushdown KV API Coprocessor read requests 2pc / distributed transaction Transaction rowkey_ts ->value MVCC GC isn’t quick enough provide KV API on raft Raft KV 2 single thread components data store in RocskDB RocksDB
9 .gRPC CPU Usage issue - the old C/S model ● TiDBs communicate with TiKVs with unary call ● Every client goroutines can do the unary call ● Too many calls introduce high CPU usage unary request TiKV TiDB unary response TiKV
10 .gRPC CPU Usage issue - the new C/S model ● TiDBs communicate with TiKVs with bi-streaming call ● One goroutine will collect all requests and batch them ● Another goroutine receive batched responses and extract them ● Using a map to track requests ● Wait a while if need when collecting requests batch request stream TiKV TiDB batch response stream TiKV map[req_id]Request
11 .Batch gRPC message evaluation ● X axis is concurrency and Y axis is QPS
12 .Batch gRPC message evaluation ● X axis is time and Y axis is the batch size
13 .gRPC CPU Usage issue - raft transport ● call `send` to buffer a message into the raft client ● call flush to send the buffered messages to gRPC ① send ③ flush & inform raft client raftstore ② flush ① cache in gRPC thread Message buffer
14 .gRPC CPU Usage issue - lots of flushes ● pseudo code to show how raftstore calls `send` and `flush` ● for every time a leader becomes active, it will call a `flush`
15 .Batch raft messages - new raft transport ● raftstore calls `send` to send a message into gRPC worker directly ● raftstore calls `flush` to inform gRPC threads to do their work ● gRPC threads collect messages into a batch before send to network ① send ③ inform if need raft client raftstore ② flush ① send gRPC thread Message buffer
16 .Batch raft messages - new raft transport ● pseudo code to show new internal logic of `flush` ● If the downstream gRPC thread is already woken up, skip the notify ● If the downstream gRPC thread is busy on other things, delay the notification to collect larger batch
17 .Batch raft messages evaluation ● X axis is time and Y axis is the batch size
18 .Single thread raftstore ● Pseudo code to show old single thread raftstore ● `on_base_tick` is called periodically, or after `on_message`
19 .Threaded raftstore/apply design principals ● Balance between threads ○ group regions by id is not a good idea ● Avoid regions become hungry ○ interrupt/resume processing one busy region ● Cooperate with region split/merge
20 .Threaded raftstore/apply core design ● gRPC thread calls `RaftRouter::send` to send raft messages to PeerFsm ● If the peer is not created, redirect the message to StoreFsm to create the PeerFsm ● Following messages can be sent to PeerFsm directly ● And then it notifies `BatchSystem` to fetch and process the PeerFsm send StoreFsm BatchSyste RaftRouter create m send PeerFsm gRPC thread notify raftstore thread
21 .Threaded raftstore/apply core design ● Pseudo code to show the loop in every raftstore thread ● Things in asynchronous threads are simillar
22 .threaded raftstore & apply evaluation
23 .What’s GC ● Clean all stale MVCC versions ● TiKV uses percolator to implement MVCC ● garbage data is in default cf, write cf ● locks are in lock cf
24 .GC drived by clients ● TiDB calculates a safe point ● TiDB resolves lock first to clean all locks before safepoint ● For all regions, TiDB sends GC RPC to TiKVs ● Cons ○ Can’t utilize all TiKVs ○ Need to be implemented in all tikv-clients
25 .Distributed GC ● TiDB calculates a safe point ● TiDB resolves lock first to clean all locks before safepoint ● TiDB puts the safe point to PD ● TiKVs fetch the latest safe point, then do GC on regions whose leader is on the store ● TiKVs can collect region information with an observer registered in raftstore ● Evaluation: reduce GC time from 1 hour to 15 minutes for a 500G cluster
26 .More improvements on the way ● Joint-consensus to speed up configuration change ● Avoid to fetch commit timestamp from PD ● Remove the scheduler thread
27 .Thanks! We Are Hiring hire@pingcap.com