在多租户环境中提高HBase可用性

在多租户环境中提高HBase可用性
展开查看详情

1.Improving HBase availability in a multi-tenant environment James Moore (jcmoore@hubspot.com) Kahlil Oppenheimer (kahlil@hubspot.com)

2.Outline ● Who are we? ● What problems are we solving? ● Reducing the cost of Region Server failure ● Improving the stochastic load balancer ● Eliminating workload driven failure ● Reducing the impact of hardware failure

3.Who are we? ● HubSpot’s Big Data Team provides HBase as a Service to our 50+ Product and Engineering teams. ● HBase is Hubspot’s canonical Data store and reliably stores all of our customers’ data

4.What problems we’re we solving ● In a typical month we lose ~3.5% of our Region Servers to total network failure ● 3500+ microservices interacting with 350+ tables ● 500+ TBs of raw data served from HBase ● We run an internally patched version of CDH 5.9.0 ● We’ve reached 99.99% availability in every cluster.

5.Reducing the cost of failure

6.What does a Region Server crash look like?

7.Distributed log splitting performance ● Minimum of 5 seconds ● Scales linearly with the number of HLog files ● Performance grows erratic with the number of HLogs

8.Region assignment & log replay ● This Process is Non-Linear ● Scales at least by Regions * HLogs

9.Improving mean time to recovery (MTTR) ● Running more and smaller instances can improve overall availability ● Migrating from d2.8xls to d2.2xls reduced our MTTR from 90 seconds to 30 seconds d2.8xl d2.2xl 244GB ram 61GB ram 24*2TB 7*2TB HDD 36 cores 8 cores

10.… But the AsyncProcess can get in the way ● AsyncProcess & AsyncRequestFutureImpl didn’t respect operation timeout ● Meta fetches didn’t respect operation timeout ● Open connections to an offline meta server could cause the client to block until socket failure far into the future.

11.What about stalled instances?

12.What is the cause? ● Degraded hardware performance? ○ Slow I/O ○ Stuck kernel threads ○ High packet loss ● Long GC pauses? ● Hot regions? ● Noisy user?

13.Strategy ● Eliminate usage related failures ● Tune GC bit.ly/hbasegc ● Monitor and proactively remove misbehaving hardware

14.Eliminating usage failures 1. Keep regions less than 30GB 2. Use the Normalizer to ensure tables do not have unevenly sized regions 3. Optimize the balancer for multi-tenancy 4. Usage limits and Guardrails

15.Improving the balancer

16.Problem: HBase is unstable, unbalanced

17.We investigated and found one issue... 1. Balancer stops if servers host same number of regions

18.We investigated and found one two issues... 1. Balancer stops if servers host same number of regions 2. Requests for tables aren’t distributed across cluster

19.We investigated and found one two three issues... 1. Balancer stops if servers host same number of regions 2. Requests for tables aren’t distributed across cluster 3. Balancer performance doesn’t scale well with cluster size

20. Quick refresher on stochastic load balancer There will be pictures...

21.

22.

23.

24.

25.The first issue 1. Balancer stops if servers host same number of regions

26.The first issue...turned out to be intentional 1. Balancer stops if servers host same number of regions Theory: design choice when balancer was simpler

27.So our solution is perhaps a design choice 1. Balancer stops if servers host same number of regions Theory: design choice when balancer was simpler Solution: disable the “slop” logic for the stochastic load balancer

28.The second issue 2. Requests for tables aren’t distributed across cluster

29.The second issue, involved some theorizing 2. Requests for tables aren’t distributed across cluster Theory: cluster has high table skew

为了让众多HBase相关从业人员及爱好者有一个自由交流HBase相关技术的社区,阿里巴巴、小米、华为、网易、京东、滴滴、知乎等公司的HBase技术研究人员共同发起了组建中国HBase技术社区。