HBase 吞吐量提升实践

HBase 吞吐量提升实践
展开查看详情

1. Lift the Ceiling of Throughputs Yu Li, Lijin Bin {jueding.ly, tianzhao.blj} @alibaba-inc.com

2. Agenda   n  What/Where/When   l  History  of  HBase  in  Alibaba  Search   n  Why   l  Throughputs  mean  a  lot   n  How   l  Lift  the  ceiling  of  read  throughputs   l  Lift  the  ceiling  of  write  throughputs   n  About  future  

3. HBase  in  Alibaba  Search   n  HBase  is  the  core  storage  in  Alibaba  search  system,  since  2010   n  History  of  version  used  online   l  2010~2014:  0.20.6à0.90.3à0.92.1à0.94.1à0.94.2à0.94.5   l  2014~2015:  0.94à0.98.1à0.98.4à0.98.8à0.98.12   l  2016:  0.98.12à1.1.2   n  Cluster  scale  and  use  case   l  Multiple  clusters,  largest  with  more  than  1,500  nodes   l  Co-located  with  Flink/Yarn,  serving  over  40Million/s  Ops  throughout  the  day   l  Main  source/sink  for  search  and  machine  learning  platform  

4. Throughputs  mean  a  lot   n  Machine  learning  generates  huge  workloads   l  Both  read  and  write,  no  upper  limit   l  Both  IO  and  CPU  bound   n  Throughputs  decides  the  speed  of  ML  processing   l  More  throughputs  means  more  iterations  in  a  time  unit   n  Speed  of  processing  decides  accuracy  of  decision  made   l  Recommendation  quality   l  Fraud  detection  accuracy  

5. Lift  ceiling  of  read  throughput   n  NettyRpcServer  (HBASE-17263)   l  Why  Netty?   n  Enlightened  by  real  world  suffering   l  HBASE-11297       n  Better  thread  model  and  performance   l  Effect           n  Online  RT  under  high  pressure:  0.92msà0.25ms   n  Throughputs  almost  doubled      

6. Lift  ceiling  of  read  throughput   n  NettyRpcServer  (HBASE-17263)   l  Why  Netty?   n  Enlightened  by  real  world  suffering   l  HBASE-11297       n  Better  thread  model  and  performance   l  Effect       n  Online  RT  under  high  pressure:  0.92msà0.25ms   n  Throughputs  almost  doubled      

7. Lift  ceiling  of  read  throughput  (con’t)   n  RowIndexDBE  (HBASE-16213)   l  Why   n  Seek  in  the  row  when  random  reading  is  one  of  the  main  consumers  of  CPU   n  All  DBE  except  Prefix  Tree  use  sequential  search.   l  How   n  Add  row  index  in  a  HFileBlock  for  binary  search.  (HBASE-16213)   l  Effect   n  Use  less  CPU  and  improve  throughput,  KeyValues<64B,  increased  >10%  

8. Lift  ceiling  of  read  throughput  (con’t)   n  End-to-end  read  path  offheap   l  Why   n  Advanced  disk  IO  capability  cause  quicker  cache  eviction   n  Suffering  from  GC  caused  by  on-heap  copy   l  How   n  Backport  E2E  read-path  offheap  to  branch-1  (HBASE-17138)   n  More  details  please  refer  to  Anoop/Ram’s  session   l  Effect   n  Throughput  increased  30%   n  Much  more  stable,  less  spike  

9. Lift  ceiling  of  read  throughput  (con’t)   n  End-to-end  read  path  offheap   l  Before                 l  After  

10. Lift  ceiling  of  write  throughput   n  MVCC  pre-assign  (HBASE-16698,  HBASE-17509/17471)   l  Why   n  Issue  located  from  real  world  suffering:  no  more  active  handler   n  MVCC  is  assigned  after  WAL  append   n  WAL  append  is  designed  to  be  RS-level  sequential,  thus  throughput  limited   l  How   n  Assign  mvcc  before  WAL  append,  meanwhile  assure  the  append  order   l  Original  designed  to  use  lock  inside  FSHLog  (HBASE-16698)   l  Improved  by  generating  sequence  id  inside  MVCC  existing  lock  (HBASE-17471)   l  Effect   n  SYNC_WAL  throughput  improved  30%,ASYNC_WAL  even  more  (>70%)  

11. Lift  ceiling  of  write  throughput  (cont’d)   n  Refine  the  write  path  (Experimenting)   l  Why   n  Far  from  taking  full  usage  of  IO  capacity  of  new  hardware  like  PCIe-SSD   n  WAL  sync  is  IO-bound,  while  RPC  handling  is  CPU-bound   l  Write  handlers  should  be  non-blocking:  do  not  wait  for  sync   l  Respond  asynchronously   n  WAL  append  is  sequential,  while  region  puts  are  parallel   l  Unnecessary  context  switch   n  WAL  append  is  IO-bound,  while  MemStore  insertion  is  CPU-bound   l  Possible  to  parallelize?  

12. Lift  ceiling  of  write  throughput  (cont’d)   n  Refine  the  write  path  (Experimenting)   l  How   n  Break  the  write  path  into  3  stages   l  Pre-append,  sync,  post-sync   l  Buffer/queue  between  stages   n  Handlers  only  handle  pre-append  stage,  respond  in  post-sync  stage   n  Bind  regions  to  specific  handler   l  Reduce  unnecessary  context  switch  

13. Lift  ceiling  of  write  throughput  (cont’d)   n  Refine  the  write  path  (Experimenting)   l  Effect  (Lab  data)   n  Throughput  tripled:  140K  à  420K  with  PCIe-SSD   l  TODO   n  Currently  PCIe-SSD  IO  util  only  reached  20%,  much  more  space  to  improve   n  Integration  with  write-path  offheap  –  more  to  expect   n  Upstream  the  work  after  it’s  verified  online  

14. About  Future   n  HBase  is  still  a  kid  –  only  10  years’old   l  More  ceilings  to  break   n  Improving,  but  still  long  way  to  go   n  Far  from  fully  utilizing  the  hardware  capability,  no  matter  CPU  or  IO   l  More  scenarios  to  try   n  Embedded-mode  (HBASE-17743)   l  More  to  expect   n  2.0  coming,  3.0  in  plan   n  Hopefully  more  community  involvement  from  Asia   l  More  upstream,  less  private  

15.Q  &  A   Thank  You!