华为工程师介绍使用HBase的问题和解决方案,比如加速HMaster启动,数据复制超时,RIT问题等等。并介绍基于HBase的OpenTSDB的改进。

注脚

展开查看详情

1. HBaseConAsia2018 hosted by HBase and OpenTSDB Practices at Huawei Pankaj Kumar, Zhongchaoqiang, Guoyijun, Zhiwei {Pankaj.kr, zhongchaoqiang, guoyijun, wei.zhi}@huawei.com

2. hosted by $ whoami HBase Tech Lead @ Huawei India Apache HBase Contributor 5 years of experience in Big Data related projects

3.HBase @ Huawei  Migrated from 1.0.2 version  1.3.1 version +  Secondary index  MOB  Multi split  Migrating to 2.1.x cluster this year 3

4.Content 01 HBase Practices 02 OpenTSDB Practices

5.HBase Practices Accelerate HMaster Startup Enhanced Replication Reliable Region Assignment

6.1.1 Accelerate HMaster Startup

7.Accelerate HMaster Startup Problem: HMaster not available for longer duration on failover/restart Deployment scenario:  Large cluster with 500+ nodes  5000+ Tables and 120000 + regions  10 namespaces Discovered problems in multiple areas in Master startups like below  Slow region locality computation on startup  Serial region locality calculation  Too much time spent in region locality calculation  HMaster aborting due to namespace initialization failure  Slow SSH/SCP  Similar to HBASE-14190  Table info loading was taking too much time  High Namenode latency  Many other services creating lots of load in NN 7

8.Accelerate HMaster Startup  Slow region locality computation on startup  Accelerate region Locality computation by computing in parallel  Detach region locality computation on startup  Similar solution HBASE-16570  HMaster aborting due to namespace initialization failure  Assign system table regions ahead of user table regions  Assign system tables to HMaster (configure all system tables to hbase.balancer.tablesOnMaster)  On cluster/master startup, process old HMaster SSH/SCP ahead of other RegionServer  SSH/SCP will replay the WAL and assign the system table regions first 8

9.Accelerate HMaster Startup  Table Info Loading on Master startup Namespace HDFS Path Example: Suppose there are two default t1/.tabledesc/.tableinfo.0000000001 t1/.tabledesc/.tableinfo.0000000002 namespace and total 5 tables with t1/.tabledesc/.tableinfo.0000000003 t2/.tabledesc/.tableinfo.0000000001 below path structure in HDFS hbase /hbase/data/hbase/acl/.tabledesc/.tableinfo.0000000001 /hbase/data/hbase/meta/.tabledesc/.tableinfo.0000000001 /hbase/data/hbase/namespace/.tabledesc/.tableinfo.0000000001 Operation Path Result Total RPC to NN List operation to get path till /hbase/data/* gets file status for all the 1 all the namespace namespaces List operation on each /hbase/data/default get all the file status of 2 ( = total number of namespace to get all the all the tables in each namespaces in the tables in each namespace namespace cluster) Total RPC to NameNode = List operation on each table /hbase/data/default/ get all the file status of 5 (= total number of to get all the tableinfo files t1/. tabledesc all the tableinfo files for tables in the cluster) 1 + namespace count for the table the table + 2 * table count Open call for each table’s /hbase/data/default/ get the stream to 5 (= total number of latest tableinfo file t1/. tableinfo file tables in the cluster) tabledesc/.tableinfo. 0000000003 Total RPC 13 9

10.Accelerate HMaster Startup Table Info Loading on Master startup  2011 RPC Calls ( for 10 namespace and 1000 tables in a cluster)  NN is busy then it will hugely impact the startup of HMaster. Solution: Reduce number of RPC to Namenode  HMaster makes a single call to get tableinfo path  Get LocatedFileStatus of all tableinfo paths based on pattern (/hbase/data/*/*/.tabledesc/.tableinfo* )  LocatedFileStatus will also contain block locations of tableinfo file along with FileStatus details  DFS client will directly connect to Datanode through FileSystem#open() using LocatedFileStatus, avoid NN RPC to get the block location of the tableinfo file Improvement: In a highly overloaded HDFS cluster, it took around 97 seconds to load 5000 tables info as compared to 224 seconds earlier. 10

11.1.2 Enhanced Replication

12.Adaptive call timeout Problem:  Replication may timeout when peer cluster is not able to replicate the entries  Can be solved by increasing hbase.rpc.timeout at source cluster  Impact other RPC request  In Bulkload replication scenario fixed RPC timeout may not guarantee bigger HFile copy  Refer HBASE-14937 Solution:  Source cluster should wait longer  New configuration parameter hbase.replication.rpc.timeout, default value will be same as hbase.rpc.timeout  On each CallTimeOutException increase this replication timeout value by fixed multiplier  Increase the replication to certain number of configured times 12

13.Cross realm support Problem:  Replication doesn’t work with Kerberos Cross Real Trust where principal domain name is not machine hostname  On new host addition  Add principal name for the newly added hosts in KDC  Generate a new keytab file  Update it across other hosts  Rigorous task for user to create and replace new keytab file Solution:  HBASE- 14866  Configure the peer cluster principal in replication peer config  Refer to HBASE-15254 (Open)  No need to configure in advance, fetch at runtime.  Make RPC call to peer HBase cluster and fetch the Principal  Make RPC connection based on this server principal 13

14.1.3 Reliable Region Assignment

15.RIT Problem Problem:  Region stuck in transition for longer duration due to some fault in cluster  Zookeeper node version mismatch  Slow RS response  Unstable Network  Client will not be able to perform read/write operation on those regions which are in transition  Balancer will not run  Region can’t be recovered until cluster restart Solution:  Recover the regions by reassigning them  Schedule a chore service  Run periodically and identify the region which stuck in transition from a longer duration (configurable threshold)  Recover the region by reassigning them  New HBCK command to recover regions which are in transition from longer duration 15

16.Double Assignment Problem Problem:  HMaster may assign region to multiple RegionServer in a faulty cluster environment  Call time out from a overloaded RegionServer  Old or new client may receive inconsistent data  Can’t be recovered until cluster restart Solution:  Multiply assigned regions should be closed and assign uniquely  Region server send server load details to HMaster through heartbeat  Schedule a chore service which run periodically and recover the regions  Collect each region server load from HMaster memory  Identify the duplicate regions from the region list  Validate the duplicate regions with HMaster Assignment Manager in-memory region state  Close the region from the old region server  Assign the region 16

17.Double Assignment Problem Example: HMaster AM’s in-memory region state r1:RS1 r2:RS3 r2:RS3 r4:RS1 r5:RS2 r6:RS3 r7:RS1 r8:RS2 Double Assignment Recovery Chore Found region r2 s multiply assigned to RS2 & RS3, so r2 will be closed from RS2 as per AM’s in memory state heartbeat heartbeat heartbeat Region Server (RS1) Region Server (RS2) Region Server (RS3) r1, r4, r7 r2, r5, r8 r3, r6, r2 17

18.OpenTSDB Improvement OpenTSDB Basics OpenTSDB Improvement

19.2.1 TSDB Basics

20.Time Series XX变化曲线图 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 风力 温度 水位 20

21.Time Series Data Point Time Series …… …… …… A time series is a series of numeric data points of some particular metric over time. - OpenTSDB Document 21

22.OpenTSDB Schema sys.cpu.user host=webserver01,cpu=0 1533640130 20 sys.cpu.user host=webserver01,cpu=0 1533640140 25 sys.cpu.user host=webserver01,cpu=0 1533640150 30 sys.cpu.user host=webserver01,cpu=0 1533640160 32 sys.cpu.user host=webserver01,cpu=0 1533640170 35 sys.cpu.user host=webserver01,cpu=0 1533640180 40 metric name tags timestamp valu e OpenTSDB uses a metric name and a group of tags for identifying time series. Tags are used to identify different data sources. 22

23.TSDB Characteristics • Write Dominate. Read rate is usually a couple orders of magnitude lower. • Most queries happens on latest data. • Most queries are for aggregate analysis instead of individual data point. • Primarily Inserts. Updates/deletions are rarely happens. 23

24.Basic Functionality For TSDB • Rollups and Downsampling • Pre-aggregates and Aggregates • Interpolation • Data Life Cycle Management 24

25.Single value model vs. Multi-value model Metric TimeStamp DeviceID DeviceType ZoneId Temperature Pressure WaterLine 时序数据库单值 vs.多值 Engine 20180101 12:00:00 20180101 12:00:00 ID001 ID002 TypeA TypeA 1 1 66.9 68.8 1.33 1.28 42.5 42.0 20180101 12:00:00 ID003 TypeA 1 67.3 1.35 41.7 20180101 12:01:00 ID001 TypeA 1 67.5 1.30 42.2 Metric Time stamp Tags Field Metric TimeStamp DeviceID DeviceType ZoneId Value Temperature 20180101 12:00:00 ID001 TypeA 1 66.9 Pressure 20180101 12:00:00 ID001 TypeA 1 1.33 WaterLine 20180101 12:00:00 ID001 TypeA 1 42.5 Temperature 20180101 12:01:00 ID002 TypeA 1 68.8 Pressure 20180101 12:01:00 ID002 TypeA 1 1.28 WaterLine 20180101 12:01:00 ID002 TypeA 1 42.0 Metric Time stamp Tags Metric Value 25

26.Time Series Storage In HBase DataPoint (T7) Time Series A (20180808-10) (Writing Block) KeyValue KeyValue KeyValue KeyValue KeyValue KeyValue DataPoint DataPoint DataPoint(T3) DataPoint(T4) DataPoint DataPoint (T1) (T2) (T5) (T6) Time Series A (20180808- KeyValue 09) Compacted DataPoints Time Series A (20180808- KeyValue 08) Compacted DataPoints Time Series A (20180808-07) KeyValue Compacted DataPoints Closed Blocks Time Series Separeted into multiple blocks. Each block hold one hour of data points. 26

27.OpenTSDB Table Design Time Series A (20180808-10) (Writing Block) KeyValue KeyValue KeyValue KeyValue KeyValue KeyValue DataPoint DataPoint DataPoint(T3) DataPoint(T4) DataPoint DataPoint(T6) (T1) (T2) (T5) 1 <= N <= 8 1BYTE 3BYTES 4BYTES 3BYTES 3BYTES 3BYTES 3BYTES SALT Metric ID Timestamp Tag Name ID Tag Value ID Tag Name ID Tag Value ID RowKey Format Timestamp ValueType ValueLength 2 BYTES Qualifier Format 27

28.OpenTSDB Compaction 1. Read All Data Points from the block of Last Hour KeyValue KeyValue KeyValue KeyValue KeyValue KeyValue KeyValue DataPoint(T1) DataPoint DataPoint DataPoint DataPoint(T5) DataPoint(T6) DataPoint(TX) (T2) (T3) (T4) 2. Compact locally KeyValue DataPoints of whole block 3. Write compact row, and delete all exist individual data points KeyValue Delete Marker DataPoints of whole block KeyValue KeyValue KeyValue KeyValue KeyValue KeyValue KeyValue DataPoint DataPoint(T2) DataPoint(T3) DataPoint DataPoint DataPoint DataPoint(TX) (T1) (T4) (T5) (T6) 28

29.2.1 OpenTSDB Improvement

30.Exist OpenTSDB Compaction Flow OpenTSDB compaction is TSD helpful for read, and could decrease total data amount, CompactionQueue but the side effects as follows: Metric3_Hour3 Add RowKey Compaction HTTP Handler Metric2_Hour3 Metric1_Hour3 Thread 1. OpenTSDB Compaction Metric3_Hour2 requires a read/compact/write cycle, causing extremely high traffic to RegionServers. Metric2_Hour2 Read and Metric1_Hour2 OpenTSDB Remove Compact Metric3_Hour1 Logic Metric2_Hour1 2. Write compact row and Metric1_Hour1 delete exist individual data points amplify write I/O. Put Get Put and Delete HBase 30

31.Understanding Write Amplification  Client(TSD) HFile HFile HFile MemStore  HFile  Client(TSD) 1. TSD client read time series data from Regionserver. 2. TSD write compacted row and delete marker to RegionServer. 3. HBase internal compaction. HFile MemStore Single Datapoint Delete Marker Compact Row 31

32.New OpenTSDB Compaction Flow OpenTSDB New HBase Compaction implementation for OpenTSDB HTTP Handler that could compact data points during hbase compaction Put running. HBase Making TSD focus on handing Region1 user read/write requests only. HFile Region2 Read One Row Compaction HFile Thread Region3 HFile ........ Write Compact OpenTSDB New Rows Compact HFile Logic RegionN 32

33.No More Extra Write Amplification  Client(TSD) No more extra write amplication caused by OpenTSDB data points compaction. HFile MemStore HFile HFile  HFile Compaction 33

34. Benchmark – Throughput comparison NOTE: TSDs were limited to 300,000 data points per second. After optimization, write throughtput has been improved significantly. 34

35.Benchmark – CPU And IO Comparison 35

36.Data Life Cycle Management Per Metric OpenTSDB  Delete old data automatically for reduce data load. HTTP Handler  HBase Table level TTL is a coarse- Put grained mechanism, But different metrics may have different TTL HBase requirements. Region1  A new HBase compaction HFile Region2 Read One implementation for per-metric Row Compaction level data life cycle management. HFile Thread Region3 HFile ........ Write Unexpired OpenTSDB New Rows TTL HFile Logic RegionN 36

37.OpenTSDB RPC Thread Model Using a two-level thread model design. Receive message, process message and response to client are all handled by one WorkThread. It causes the low CPU usage. Receive Msg WorkerThread RegionServer1 workerThread Process Msg workerThread RegionServer2 Response Msg TSD RegionServer3 37

38.RPC Thread Model Improvement Modify the thread model to a Three-Level design. Receiving message and handling message are finished in different threads. Better CPU usage. Worker Thread Read Thread Worker Thread write read Worker Thread Read Thread Worker Thread Boss Boss Thread Thread Queue ... ... ... Worker Thread Read Thread Worker Thread Benchmark: Before After Query latency got at least 3X 1 Query 60ms 59ms improvement for concurrent queries. 10 Concurrent Queries 476ms 135ms 50 Concurrent Queries 2356ms 680ms 38

39.Follow Us(关注我们) 华为云CloudTable 微信公众号 公有云HBase服务 NoSQL漫谈 39

40.Thank You. Copyright©2018 Huawei Technologies Co., Ltd. All Rights Reserved. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice.

user picture
为了让众多HBase相关从业人员及爱好者有一个自由交流HBase相关技术的社区,阿里巴巴、小米、华为、网易、京东、滴滴、知乎等公司的HBase技术研究人员共同发起了组建中国HBase技术社区。

相关文档