HBCK2: Concepts, trends and recipes for fixing issues within HBase 2

来自 Cloudera 的工程师 Wellington Chevreuil 给大家分享了 HBCK2 的最新进展。
HBCK1 其实是一个相对成熟的工具了,能检查整个集群所有的 Region 是否健康,对各种常见的情况也能得到很好的修复。由于 HBase-2.x 根据 Procedure-V2 重新设计了几乎所有的操作流程,因此理论上发生状态不一致的概率会大大降低,但考虑到代码实现上可能会有 bug,所以设计了 HBCK2 来修复这些异常状态。

目前,HBCK2 已经变成了一个非常轻量级的修复工具,代码被单独放在一个叫hbase-operator-tools 的仓库中。首先需要编译拿到 JAR 包,然后通过 HBase 命令去执行修复操作。核心的几个修复操作有:

  • assign 和 unassign region:

    hbase hbck -j ../hbase-hbck2-1.0.0-SNAPSHOT.jar assigns 1588230740

  • 发现 tableState 不一致时,可以用 setTableState 来实现修复。

  • bypass 选项可以跳过某些卡住的 Procedure

除了修复操作之外,集群需要一个支持全局检查的工具,目前仍然可以通过 HBCK1来做全局的检查,但 HBCK1 的修复功能已经被 disabled 掉,如果需要可以使用HBCK2 来修复。



2.HBCK2: Concepts, trends and recipes for fixing issues within HBase 2 Wellington Chevreuil HBase Committer Cloudera HBase SW Engineer

3.HBCK (1) - Little bit of history • Main tool for general inconsistencies in hbase-1.x • The Swiss Knife for operators • Packaged together with hbase main project • Provides both diagnosing and fixing commands • Some reports may be misleading, e.g., "holes in the region chain" • Some options can cause damages if not well understood, e.g., "-sidelineBigOverlaps", "-removeParents" • Commands often work independent of Master • Can introduce conflicts on meta information maintained by Master • Lack of implementation details on documentation/help guide

4.HBCK1 Commands user guide:

5.HBCK2 in a nutshell • Simpler tool • Less fix commands • No diagnosis command • Requires deeper HBase internal workings from operators • Shipped independently from hbase • Packaged with hbase-operators-tool project • https://github.com/apache/hbase-operator-tools • Can evolve on its own pace • New versions can be run without needing whole hbase upgrade • Master oriented (more later) • More detailed documentation about each command • Still a WIP • By the time of this presentation, there's still no official release for HBCK2

6.HBCK2 Concepts • AMv2 compliant • HBCK1 does not work with HBase 2 AssignmentManager re-implementation • Thinner, but more interactive commands • No such thing as hbck1 -fix command • Operators required to fix an issue at a time • Master oriented • Master must be online • Commands implementation should use Master HbckService as much as possible • However, new commands may initially require a client side implementation, then get ported to Master's HbckService facade • Fix only, requires other tools for issue diagnosing • Available only for 2.0.3 onwards, and 2.1.1 onwards

7.HBCK2 Commands user guide:

8.HBCK2 Usage trends • Master not completing initialisation • Meta/Namespace table "NOT online" issues • Table RIT issues • Procedures stuck • Table in wrong state • Missing regions in META • User induced via incompatible OfflineMetaRepair tool

9.HBCK2 for Operators: How do I get and run it? • Not released so far, requires local build • Requirements • JDK 1.8 or higher • Git • Maven • Checkout related apache github repository: • $ git clone https://github.com/apache/hbase-operator-tools.git • Build HBCK2 upon desired hbase version: • $ mvn -Dhbase.version=2.1.5 clean install • Above command will produce HBCK2 jar file under ./hbase-hbck2/target/, named hbase-hbck2-1.0.0-SNAPSNOT.jar (assuming current version is 1.0.0-SNAPSHOT) • Upload generated jar to the given hbase cluster and run it as below: • $ hbase hbck -j ../hbase-hbck2-1.0.0-SNAPSHOT.jar

10.HBCK2 for Operators: Recipes • Meta/Namespace table regions "NOT online" • Due to corruption or manual deletion of /hbase/MasterProcWALs files • Meta may miss info about RS assignment • Master logs show regions assigned to an old RS start code WARN org.apache.hadoop.hbase.master.HMaster: hbase:meta,,1.1588230740 is NOT online; state={1588230740 state=OPENING, ts=1550754721289, server=regionserver01.example.com,16020,1550676598448}; ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern until region onlined. • Run HBCK2 assigns command for META region 1588230740: • $ hbase hbck -j ../hbase-hbck2-1.0.0-SNAPSHOT.jar assigns 1588230740 • Similar issue may affect namespace and user tables regions • Affected regions names would be mentioned on log messages similar to above

11.HBCK2 for Operators: Recipes • Table RIT issues • Usually, due several RSes crashes/slowness while regions are transitioning WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=regionserver01.example.com,16020,1542314816394, table=hbase:acl, region=11bf6b18ddacdd864728e6cf1199b2a7 ... WARN org.apache.hadoop.hbase.ipc.RpcServer: Dropping timed out call: callId: 702 service: ClientService methodName: Mutate size: 272 connection: deadline: 1542316740911 param: region= hbase:meta,,1, row=hbase:acl,,1404406671604.11bf6b18ddacdd864728e6cf1199b2a7. connection: • Run HBCK2 assigns command for the given region encoded name 11bf6b18ddacdd864728e6cf1199b2a7: • $ hbase hbck -j ../hbase-hbck2-1.0.0-SNAPSHOT.jar assigns 11bf6b18ddacdd864728e6cf1199b2a7

12.HBCK2 for Operators: Recipes • Procedures stuck • While troubleshooting causes for RITs, check for procedures attempting to transition regions states: • $ echo "list_procedures" | hbase shell • Output for list_procedures shows WAITING_TIMEOUT and/or procedures running for days PID Name State Submitted Last_Update Parameters 6 org.apache.hadoop.hbase.master.assignment.UnassignProcedure WAITING_TIMEOUT 2019-03-29 11:15:06 2019-04-08 06:33:35 ... 7 org.apache.hadoop.hbase.master.procedure.DeleteTableProcedure RUNNABLE 2019-03-29 11:24:39 2019-03-29 11:24:39 ... • Other procedures fail to acquire lock owned by one of the stuck procedures: ERROR: org.apache.hadoop.hbase.procedure2.ProcedureAbortedException: f7910bfc9c9... owned by pid=6, CANNOT run 'this' (pid=347). • Run HBCK2 bypass command to get rid of stuck procedures: • $ hbase hbck -j ../hbase-hbck2-1.0.0-SNAPSHOT.jar bypass 6 7 hbase hbck -j ../hbase-hbck2-1.0.0-SNAPSHOT.jar bypass 6 7

13.HBCK2 for Operators: Recipes • Table in wrong state • Can happen after hanging enable/disable table procedures, or related sub-procedures • Bypassing procedures can lead to this as well • Table indefinitely in temporary states ENABLING/DISABLING • scan 'hbase:meta', {COLUMN => "table:state"} usertable column=table:state, timestamp=1555406568751, value=\x08\x03. • enable 'usertable' ERROR: Table tableName=usertable, state=ENABLING should be disabled! • Run HBCK2 setTableState to manually bring table state to one of the final ones ENABLED/DISABLED: • $ hbase hbck -j ../hbase-hbck2-1.0.0-SNAPSHOT.jar setTableState usertable DISABLED base hbck -j ../hbase-hbck2-1.0.0-SNAPSHOT.jar bypass 6 7

14.HBCK2 for Operators: Recipes • Missing regions in META • Operator induced when running incompatible tool OfflineMetaRepair (HBASE-21665) • Typically manifests as holes on the region chain, or in the case of namespace region missing, master fails initialisation • scan 'hbase:meta', {COLUMN => "table:state", ROWPREFIXFILTER => 'hbase:namespace'} ROW COLUMN+CELL 0 row(s) • Still under development through HBASE-22567, HBCK2 addMissingRegionsInMeta can be used to re-add missing regions: • $ hbase hbck -j ../hbase-hbck2-1.0.0-SNAPSHOT.jar addMissingRegionInMeta hbase:namespace • Still WIP, so syntax might change. • Check HBASE-22567 for latest developments e-hbck2-1.0.0-SNAPSHOT.jar bypass 6 7

15.HBCK2 for Contributors • Apache github repository: https://github.com/apache/hbase-operator-tools • HBCK2 defined as sub-module hbase-hbck2 of hbase-operator-tools • HBASE-21745 • Umbrella jira for tracking potential new HBCK2 features • Faced a new issue in HBase 2? Have a new idea for HBCK2 command? • Great! Contributions are welcome! • Start a [DISCUSS] mail thread on dev@hbase.apache.org • Post a comment on HBASE-21745 describing your idea e-hbck2-1.0.0-SNAPSHOT.jar bypass 6 7