PingCAP-Infra-Meetup-89-What’is-new-in-latest-TiKV

下载 1

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/TiDB/PingCAP_Infra_Meetup_89_What_is_new_in_latest_TiKV_14776?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

TiDB

发布于

6年前

4611

人观看

#信息技术

屈鹏老师首先为大家介绍了 TiKV 最新版本的 3 个新的优化： •batch gRPC/Raft messages 特性可以将消息收集为一个 batch 批量发送，减少了网络相关的系统调用次数，达到了性能上的提升。 •threaded raftstore/apply 特性将之前系统中的两个单线程组件替换为多线程，同时避免了数据倾斜和饥饿，消除了 TiKV 在写入上的瓶颈。 •distributed GC 大幅重构了 GC 相关的代码，GC 的驱动者由客户端变成了 TiKV 自己，简化了客户端的编写难度，同时将 GC 速度加快了 3 倍。最后屈鹏老师分享了几个正在开发中的优化，包括事务提交不取 timestamp 等等。

展开查看详情

1 .What's new in latest TiKV qupeng@pingcap.com

2 .Summary ● Overview of TiDB and TiKV ● New features and evaluation ○ batch grpc/raft messages ○ threaded raftstore & threaded asynchronous apply ○ distributed garbage collection ● Lots of groundwork for future features

3 .Overview of TiDB Stateless SQL Layer Metadata / Timestamp request TiDB ... TiDB ... TiDB gRPC Placement gRPC Driver (PD) TiKV ... TiKV ... TiKV ... TiKV Control flow: Balance / Failover gRPC Raft Raft Raft Distributed Storage Layer

4 .Overview of TiKV gRPC send/recv messages Server handle KV request handle pushdown KV API Coprocessor read requests 2pc / distributed transaction Transaction rowkey_ts ->value MVCC provide KV API on raft Raft KV data store in RocksDB RocksDB

5 .Concepts ● Region ○ A continuous range of data ● Peer ○ Replica of a Region ● ts ○ Logical timestamp, generated by PD

6 .TiKV Write Flow 1）begin to write 1 9 3）transaction constraint check 4）async call RaftKV APIs 5）replicate by Raft 2 Server 6）write to rocksdb 8）return from RaftKV APIs KV API Coprocessor 9）write success 8 3 Transaction MVCC 4 5 Raft KV Raft KV Raft KV 6 5 7 6 6 RocksDB RocksDB RocksDB

7 .TiKV Read Flow 1）recv read request 1 7 2）dispatch to KV API or Coprocessor 3）send get snapshot request 4）check leader and get snapshot from 2 Server 2 RocksDB 5）return snapshot KV API 6 Coprocessor 6 6）read and return result 7）finish read Transaction 3 5 3 5 MVCC Raft KV 4 RocksDB

8 .Bottlenecks before TiKV 3.0-beta gRPC send/recv messages High CPU usage Server handle KV request handle pushdown KV API Coprocessor read requests 2pc / distributed transaction Transaction rowkey_ts ->value MVCC GC isn’t quick enough provide KV API on raft Raft KV 2 single thread components data store in RocskDB RocksDB

9 .gRPC CPU Usage issue - the old C/S model ● TiDBs communicate with TiKVs with unary call ● Every client goroutines can do the unary call ● Too many calls introduce high CPU usage unary request TiKV TiDB unary response TiKV

10 .gRPC CPU Usage issue - the new C/S model ● TiDBs communicate with TiKVs with bi-streaming call ● One goroutine will collect all requests and batch them ● Another goroutine receive batched responses and extract them ● Using a map to track requests ● Wait a while if need when collecting requests batch request stream TiKV TiDB batch response stream TiKV map[req_id]Request

11 .Batch gRPC message evaluation ● X axis is concurrency and Y axis is QPS

12 .Batch gRPC message evaluation ● X axis is time and Y axis is the batch size

13 .gRPC CPU Usage issue - raft transport ● call `send` to buffer a message into the raft client ● call flush to send the buffered messages to gRPC ① send ③ flush & inform raft client raftstore ② flush ① cache in gRPC thread Message buffer

14 .gRPC CPU Usage issue - lots of flushes ● pseudo code to show how raftstore calls `send` and `flush` ● for every time a leader becomes active, it will call a `flush`

15 .Batch raft messages - new raft transport ● raftstore calls `send` to send a message into gRPC worker directly ● raftstore calls `flush` to inform gRPC threads to do their work ● gRPC threads collect messages into a batch before send to network ① send ③ inform if need raft client raftstore ② flush ① send gRPC thread Message buffer

16 .Batch raft messages - new raft transport ● pseudo code to show new internal logic of `flush` ● If the downstream gRPC thread is already woken up, skip the notify ● If the downstream gRPC thread is busy on other things, delay the notification to collect larger batch

17 .Batch raft messages evaluation ● X axis is time and Y axis is the batch size

18 .Single thread raftstore ● Pseudo code to show old single thread raftstore ● `on_base_tick` is called periodically, or after `on_message`

19 .Threaded raftstore/apply design principals ● Balance between threads ○ group regions by id is not a good idea ● Avoid regions become hungry ○ interrupt/resume processing one busy region ● Cooperate with region split/merge

20 .Threaded raftstore/apply core design ● gRPC thread calls `RaftRouter::send` to send raft messages to PeerFsm ● If the peer is not created, redirect the message to StoreFsm to create the PeerFsm ● Following messages can be sent to PeerFsm directly ● And then it notifies `BatchSystem` to fetch and process the PeerFsm send StoreFsm BatchSyste RaftRouter create m send PeerFsm gRPC thread notify raftstore thread

21 .Threaded raftstore/apply core design ● Pseudo code to show the loop in every raftstore thread ● Things in asynchronous threads are simillar

22 .threaded raftstore & apply evaluation

23 .What’s GC ● Clean all stale MVCC versions ● TiKV uses percolator to implement MVCC ● garbage data is in default cf, write cf ● locks are in lock cf

24 .GC drived by clients ● TiDB calculates a safe point ● TiDB resolves lock first to clean all locks before safepoint ● For all regions, TiDB sends GC RPC to TiKVs ● Cons ○ Can’t utilize all TiKVs ○ Need to be implemented in all tikv-clients

25 .Distributed GC ● TiDB calculates a safe point ● TiDB resolves lock first to clean all locks before safepoint ● TiDB puts the safe point to PD ● TiKVs fetch the latest safe point, then do GC on regions whose leader is on the store ● TiKVs can collect region information with an observer registered in raftstore ● Evaluation: reduce GC time from 1 hour to 15 minutes for a 500G cluster

26 .More improvements on the way ● Joint-consensus to speed up configuration change ● Avoid to fetch commit timestamp from PD ● Remove the scheduler thread

27 .Thanks! We Are Hiring hire@pingcap.com

3点赞

0收藏

1下载