展开查看详情
1. Lift the Ceiling of Throughputs Yu Li, Lijin Bin {jueding.ly, tianzhao.blj} @alibaba-inc.com
2. Agenda n What/Where/When l History of HBase in Alibaba Search n Why l Throughputs mean a lot n How l Lift the ceiling of read throughputs l Lift the ceiling of write throughputs n About future
3. HBase in Alibaba Search n HBase is the core storage in Alibaba search system, since 2010 n History of version used online l 2010~2014: 0.20.6à0.90.3à0.92.1à0.94.1à0.94.2à0.94.5 l 2014~2015: 0.94à0.98.1à0.98.4à0.98.8à0.98.12 l 2016: 0.98.12à1.1.2 n Cluster scale and use case l Multiple clusters, largest with more than 1,500 nodes l Co-located with Flink/Yarn, serving over 40Million/s Ops throughout the day l Main source/sink for search and machine learning platform
4. Throughputs mean a lot n Machine learning generates huge workloads l Both read and write, no upper limit l Both IO and CPU bound n Throughputs decides the speed of ML processing l More throughputs means more iterations in a time unit n Speed of processing decides accuracy of decision made l Recommendation quality l Fraud detection accuracy
5. Lift ceiling of read throughput n NettyRpcServer (HBASE-17263) l Why Netty? n Enlightened by real world suffering l HBASE-11297 n Better thread model and performance l Effect n Online RT under high pressure: 0.92msà0.25ms n Throughputs almost doubled
6. Lift ceiling of read throughput n NettyRpcServer (HBASE-17263) l Why Netty? n Enlightened by real world suffering l HBASE-11297 n Better thread model and performance l Effect n Online RT under high pressure: 0.92msà0.25ms n Throughputs almost doubled
7. Lift ceiling of read throughput (con’t) n RowIndexDBE (HBASE-16213) l Why n Seek in the row when random reading is one of the main consumers of CPU n All DBE except Prefix Tree use sequential search. l How n Add row index in a HFileBlock for binary search. (HBASE-16213) l Effect n Use less CPU and improve throughput, KeyValues<64B, increased >10%
8. Lift ceiling of read throughput (con’t) n End-to-end read path offheap l Why n Advanced disk IO capability cause quicker cache eviction n Suffering from GC caused by on-heap copy l How n Backport E2E read-path offheap to branch-1 (HBASE-17138) n More details please refer to Anoop/Ram’s session l Effect n Throughput increased 30% n Much more stable, less spike
9. Lift ceiling of read throughput (con’t) n End-to-end read path offheap l Before l After
10. Lift ceiling of write throughput n MVCC pre-assign (HBASE-16698, HBASE-17509/17471) l Why n Issue located from real world suffering: no more active handler n MVCC is assigned after WAL append n WAL append is designed to be RS-level sequential, thus throughput limited l How n Assign mvcc before WAL append, meanwhile assure the append order l Original designed to use lock inside FSHLog (HBASE-16698) l Improved by generating sequence id inside MVCC existing lock (HBASE-17471) l Effect n SYNC_WAL throughput improved 30%,ASYNC_WAL even more (>70%)
11. Lift ceiling of write throughput (cont’d) n Refine the write path (Experimenting) l Why n Far from taking full usage of IO capacity of new hardware like PCIe-SSD n WAL sync is IO-bound, while RPC handling is CPU-bound l Write handlers should be non-blocking: do not wait for sync l Respond asynchronously n WAL append is sequential, while region puts are parallel l Unnecessary context switch n WAL append is IO-bound, while MemStore insertion is CPU-bound l Possible to parallelize?
12. Lift ceiling of write throughput (cont’d) n Refine the write path (Experimenting) l How n Break the write path into 3 stages l Pre-append, sync, post-sync l Buffer/queue between stages n Handlers only handle pre-append stage, respond in post-sync stage n Bind regions to specific handler l Reduce unnecessary context switch
13. Lift ceiling of write throughput (cont’d) n Refine the write path (Experimenting) l Effect (Lab data) n Throughput tripled: 140K à 420K with PCIe-SSD l TODO n Currently PCIe-SSD IO util only reached 20%, much more space to improve n Integration with write-path offheap – more to expect n Upstream the work after it’s verified online
14. About Future n HBase is still a kid – only 10 years’old l More ceilings to break n Improving, but still long way to go n Far from fully utilizing the hardware capability, no matter CPU or IO l More scenarios to try n Embedded-mode (HBASE-17743) l More to expect n 2.0 coming, 3.0 in plan n Hopefully more community involvement from Asia l More upstream, less private
15.Q & A Thank You!