New Journey of HBase in Alibaba & Cloud


1.New Journey of HBase in Alibaba and Cloud 八年磨一剑,HBase在阿里巴巴和云上的新征程 Chunhui Shen and Long Cao August 17,2018

2.Content 01 AliHB-Introduction of Alibaba HBase History,Tech Overview,Open Source,Core Scenarios 02 Recent Key Challenge & Improvements GC Trouble,Separation of Computing & Storage,Cold- Hot Data,Diagnostic System, Migration & Backup 03 HBase Ecosystem & Multi-model DB & Cloud KV,Tabular,SQL,Graph,Time Series,Geospatial , Search, Mixed Workloads,Cloud

3.01 AliHB-Introduction of Alibaba HBase

4.HBase History in Alibaba Data Burst Commercial Big Data Open Source Store System Cassandra Develop New Our Choice in 2010 MySQL、Oracle • Used Version • Why HBase – 0.20->0.90->0.92->0.94->0.98->1.1->2.0 – Began using since 2010 • The earliest case in 2010-2011 – Active community – Search Store – Hadoop ecosystem – Taobao History Order – Facebook successful case – Alipay Risk Management – Google famous paper: Big Table • Internal branch AliHB

5.Overview of AliHB • Performance • High-Performance Data Structure、Lock-Free、 Group IO • Feature • SQL、Secondary Index • Multi-Tenants、Cold-Hot Separation、Async API • Stability • High Availability Architecture • Faster MTTR • Verification in Double 11 Shopping Day • Efficient Maintenance • Effective Monitoring • Full Path Trace • No-pause migration • 12000+ Nodes,100+ Clusters ,200+ Million OPS,100+ PB Data • 20+ BU,6000+ Users, 100+ Production Changes per Day 5

6.Open Source and Community • Contributing to open source since 2011 • 3 PMC, 6 Committers in Alibaba • Sponsor the Chinese HBase Technology Community • Already Organized 2 HBase Meetup • At least one HBase Related tech article one day • Tens of thousands of readers now, and more are coming • Hosting HBase Con Asia 2018 • Promote the use of HBase through several conference talks • Hope more people to join in HBase Community 6

7.Core Scenarios in Alibaba Monitor, Log, AI Storage Recommendation Message, Orders, Feeds … Tracking, IoT Data… Search, BI Report… Ant Intelligent Security 旺旺(IM) Intelligent Customer Service Log Alipay Bills Cainiao Logistics Ali-HBase 7

8.02 Recent Key Challenge & Improvements

9.GC Trouble GC Problems Under100GB Memory Frequent Very Slow Service Slow Request Request Unavailable 9

10.GC Trouble Only for offline application Exploring a Thorough Solution Rewriting with C++ 10

11.GC Trouble Type Pause Time Frequency Allocation and reclaim the major memory YGC 100ms+ Once per 5 Secs by hbase itself, rather than JVM CMS 100~500ms Once per 5 Mins FGC 20s-180s Once per 7~60 Days CCSMap BucketCacheV2 New GC algorithm in AJDK ZenGC Type Pause Time Frequency YGC 5ms Once per 5 Secs CMS 100ms Once per 5 Hours Try best to reuse object(In Core Path) when FGC N/A N/A programming 11

12.GC Trouble New BucketCache in HBase-2.0 CCSMap in HBase-3.0

13.Separation of Computing & Storage Localized Deployment – Low IO latency with Short-Circuit Read – Unbalanced storage space, especially between clusters – Difficult to increase the usage ratio of CPU and Disk (both), especially when lots of scenarios – Cluster scaling is slow because of datanode decommission 13

14.Separation of Computing & Storage – Big shared storage, more balanced – Compute node can scale independently – Storage node can scale independently – Auto-scaling become feasible – Based on load statistics, smart schedule between clusters – Share compute resources with other applications Shared-Storage Deployment 14

15.Heterogeneous Cold-Hot Storage • HBase has the capability to hold all the data of whole life cycle • But in most cases, like monitor, trace, order, logistics • The recently generated data is often accessed, but occupy very little storage space • The history data is rarely visited, but occupy a lot of storage space • Common solution • Cold storage system for history data • Hot storage system for recent data • Move the data from hot storage system to cold storage system periodically 15

16.Heterogeneous Cold-Hot Storage • Easy To Use • Auto Tiered • Heterogeneous • Read Optimization 16

17.Diagnostic System 12000+ Nodes,100+ Clusters ,6000+ Users “Request Rush?” — Monitor “Big Region?” — Web UI “Full Disk?” — df “Bad Disk?” — tsar,demsg …… HBase Diagnostic Center 1. The unified entrance of trouble shooting 2. Experience/Solution => Function of Diagnostic System 17

18.Diagnostic System One extra server for all 2 No Agent Adding rule dynamically Runtime information Check all components 6 Only 10 seconds for a diagnosis 18

19.Diagnostic System Shared on Apsara HBase 50+ 80%+ Rules Accuracy HBase ZK/HDFS Hardware  Compaction  ZK Unavailable  Stuck  Insufficient disk space  Block Miss  Balance Abnormal  Slow Disk  NameNode Abnormal  Table Abnormal  Bad Disk  Full capacity of datanode  Region Offline  Too much TCP error  Inconsistent state between  Replication Delay  Slow ping two namenodes  Too many files  CPU hang  Too much Xceivers  High Meta Load  Load too high  Disk not mounted  Multi Assign  Port is unreachable  ……  ……  …… 19

20.Migration & Backup 20

21.Migration & Backup Independent with HBase • almost no impact to service • easy to upgrade • support multi versions • support the non-hbase target Second-level RPO Minute-level RTO 21

22.03 HBase Ecosystem & Multi-model DB & Cloud

23.Popularity changes per DB category

24.Ranking scores per category in percent

25.Data size per day

26.All in one Key Value Relational Doucument Graph Time Series Geospatial Tabular NoSQL

27.All in one OpenTSDB GeoMesa HBase Phoenix/AntsDB HBase JanusGraph Key Value Relational Doucument Graph Time Series Geospatial HBase Tabular NoSQL

28.Multi-model - Native Or Layer HBase Ecosystem DataStax CosmosDB Neo4j InfluxDB CockroachDB PG Multi-model Multi-model KV\Index KV\Index Storage 28

29.HBase Meet Cloud – Benefits Cloud Native New Hardware Flexibility Cost Savings (TCO) RDMA End up paying for Fast Add/Remove Flash features Resource GPU Flexibility Insight Non-volatile self-driven Fix bugs in time memory Reduce human Self-driven ……