Apache 社区大神Andrew Purtell,介绍Phoenix+Hbase项目在OLTP应用中碰到的技术和性能问题,进行了非常深入的介绍和讲解,从原理和应用层设计的角度都给了一些建议和解释。



1.HBase+Phoenix for OLTP ​ Andrew Purtell ​ Architect, Cloud Storage @ Salesforce ​ Apache HBase VP @ Apache Software Foundation apurtell@salesforce.com apurtell@apache.org ​ @akpurtell v4  ​

2.whoami ​ Architect, Cloud Storage at Salesforce.com ​  Open Source Contributor, since 2007 •  Committer, PMC, and Project Chair, Apache HBase •  Committer and PMC, Apache Phoenix •  Committer, PMC, and Project Chair, Apache Bigtop •  Member, Apache Software Foundation Distributed Systems Nerd, since 1997

3.Agenda http://riverlink.org/wp-content/uploads/2014/01/grab-bag11.jpg

4.HBase+Phoenix for OLTP ​ Common use case characteristics ​  Live operational information ​  Entity-relationship, one row per instance, attributes mapped to columns ​  Point queries or short range scans ​  Emphasis on update ​ Top concerns given these characteristics ​  Low per-operation latencies ​  Update throughput ​  Fast fail ​  Predictable performance http://www.cn-vehicle.com/prodpic/2011-3-21-16-23-37.JPG

5.Low Per-Operation Latencies ​ Major latency contributors ​  Excessive work needed per query ​  Request queuing ​  JVM garbage collection Typical HBase 99%-ile latencies by operation ​  Network ​  Server outages ​  OS pagecache / VMM / IO NOTE: Phoenix supports HBase’s timeline consistent gets as of version 4.4.0

6.Limit The Work Needed Per Query ​ Avoid joins, unless one side is small, especially on frequent queries ​ Limit the number of indexes on frequently updated tables ​ Use covered indexes to convert table scans into efficient point lookups or range scans over the index table instead of the primary table ​  CREATE INDEX index ON table ( … ) INCLUDE ( … ) ​ Leading columns in the primary key constraint should be filtered in the WHERE clause ​  Especially the first leading column ​  IN or OR in WHERE enables skip scan optimizations ​  Equality or <, > in WHERE enables range scan optimizations Let Phoenix optimize query parallelism using statistics ​  Automatic benefit if using Phoenix 4.2 or greater in production

7.Tune HBase RegionServer RPC Handling hbase.regionserver.handler.count (hbase-site) ​  Set to cores x spindles for concurrency Optionally, split the call queues into separate read and write queues for differentiated service ​  hbase.ipc.server.callqueue.handler.factor •  Factor to determine the number of call queues: 0 means a single shared queue, 1 means one queue for each handler ​  hbase.ipc.server.callqueue.read.ratio (hbase.ipc.server.callqueue.read.share in 0.98) •  Split the call queues into read and write queues: 0.5 means there will be the same number of read and write queues, < 0.5 for more read than write, > 0.5 for more write than read ​  hbase.ipc.server.callqueue.scan.ratio (HBase 1.0+) •  Split read call queues into small-read and long-read queues: 0.5 means that there will be the same number of short-read and long-read queues; < 0.5 for more short-read, > 0.5 for more long-read

8.Tune JVM GC For Low Collection Latencies Use the CMS collector ​  -XX:+UseConcMarkSweepGC ​ Keep eden space as small as possible to minimize average collection time. Optimize for low collection latency rather than throughput. ​  -XX:+UseParNewGC – Collect eden in parallel ​  -Xmn512m – Small eden space ​  -XX:CMSInitiatingOccupancyFraction=70 – Avoid collection under pressure ​  -XX:+UseCMSInitiatingOccupancyOnly – Turn off some unhelpful ergonomics Limit per request scanner result sizing so everything fits into survivor space but doesn’t tenure ​  hbase.client.scanner.max.result.size (in hbase-site.xml) •  Survivor space is 1/8th of eden space (with -Xmn512m this is ~51MB ) •  max.result.size x handler.count < survivor space

9.Disable Nagle for RPC ​ Disable Nagle’s algorithm ​  TCP delayed acks can add up to ~200ms to RPC round trip time ​  In Hadoop’s core-site and HBase’s hbase-site •  ipc.server.tcpnodelay = true •  ipc.client.tcpnodelay = true ​  In HBase’s hbase-site •  hbase.ipc.client.tcpnodelay = true •  hbase.ipc.server.tcpnodelay = true ​  Why are these not default? Good question

10.Limit Impact Of Server Failures ​ Detect regionserver failure as fast as reasonable (hbase-site) ​  zookeeper.session.timeout <= 30 seconds – Bound failure detection within 30 seconds (20-30 is good) ​ Detect and avoid unhealthy or failed HDFS DataNodes (hdfs-site, hbase-site) ​  dfs.namenode.avoid.read.stale.datanode = true ​  dfs.namenode.avoid.write.stale.datanode = true

11.Server Side Configuration Optimization for Low Latency Skip the network if block is local (hbase-site) ​  dfs.client.read.shortcircuit = true ​  dfs.client.read.shortcircuit.buffer.size = 131072 – important to avoid OOME ​ Ensure data locality (hbase-site) ​  hbase.hstore.min.locality.to.skip.major.compact = 0.7 (0.7 <= n <= 1) ​ Make sure DataNodes have enough handlers for block transfers (hdfs-site) ​  dfs.datanode.max.xcievers >= 8192 ​  dfs.datanode.handler.count – match number of spindles

12.Schema Considerations Block Encoding Use FAST_DIFF block encoding ​  CREATE TABLE … (  ​ …  ​ ) DATA_BLOCK_ENCODING=‘FAST_DIFF’ ​  FAST_DIFF encoding is automatically enabled on all Phoenix tables by default ​  Almost always improves overall read latencies and throughput by allowing more data to fit into blockcache ​  Note: Can increase garbage produced during request processing

13.Schema Considerations Salting Use salting to avoid hotspotting ​  CREATE TABLE … (  ​ …  ​ ) SALT_BUCKETS = N ​  Do not salt automatically. Use only when experiencing hotspotting ​  Once you need it, for optimal performance the number of salt buckets should approximately equal the number of regionservers

14.Schema Considerations Primary key (row key) design ​ Primary key design is the single most important design criteria that drives performance ​ Make sure that what ever you're filtering on in your most common queries drives your primary key constraint design ​ Filter against leading columns in the primary key constraint in the WHERE clause, especially the first ​ Further advice is use case specific, suggest writing user@phoenix.apache.org with questions

15.Optimize Writes For Throughput ​ Optimize UPSERT for throughput ​  UPSERT VALUES •  Batch by calling it multiple times before commit() •  Use PreparedStatement for cases where you're calling UPSERT VALUES again and again ​  UPSERT SELECT •  Use connection.setAutoCommit(true), pipelines scan results out as writes without unnecessary buffering •  If your rows are small, consider increasing phoenix.mutate.batchSize •  Number of rows batched together and automatically committed during UPSERT SELECT, default 1000

16.Fast Fail For applications where failing quickly is better than waiting Phoenix query timeout (hbase-site, client side) ​  phoenix.query.timeoutMs – max tolerable wait time HBase level client retry count and wait (hbase-site, client side) ​  hbase.client.pause = 1000 ​  hbase.client.retries.number = 3 ​  If you want to ride over splits and region moves, increase hbase.client.retries.number substantially (>= 20) ​ RecoverableZookeeper retry count and retry wait (hbase-site, client side) ​  zookeeper.recovery.retry = 1 (no retry) ​ ZK session timeout for detecting server failures (hbase-site, server side) ​  zookeeper.session.timeout <= 30 seconds (20-30 is good)

17.Timeline Consistent Reads For applications that can tolerate slightly out of date information HBase timeline consistency (HBASE-10070) ​  With read replicas enabled, read-only copies of regions (replicas) are distributed over the cluster ​  One RegionServer services the default or primary replica, which is the only replica that can service writes ​  Other RegionServers serve the secondaries replicas, follow the primary RegionServer and only see committed updates. The secondary replicas are read-only, but can serve reads immediately while the primary is failing over, cutting read availability blips from ~seconds to ~milliseconds Phoenix supports timeline consistency as of 4.4.0 1.  Deploy HBase 1.0.0 or later 2.  Enable timeline consistent replicas on the server side 3.  ALTER SESSION SET CONSISTENCY = 'TIMELINE’ or set the connection property ‘Consistency’ to “timeline” in the JDBC connect string or set ‘phoenix.connection.consistency’ = “timeline” in client hbase-site for all connections

18.OS Level Tuning For Predictable Performance ​ Turn transparent huge pages (THP) off ​  echo never > /sys/kernel/mm/transparent_hugepage/enabled ​  echo never > /sys/kernel/mm/transparent_hugepage/defrag ​ Set vm.swappiness = 0 ​ Set vm.min_free_kbytes to at least 1GB (8GB on larger memory systems) ​ Disable NUMA zone reclaim with vm.zone_reclaim_mode = 0

19.EXPLAINing Predictable Performance ​ Check the computed physical plan using EXPLAIN ​ Consider rewriting queries when: ​  Prefer operations on SERVER, not CLIENT •  SERVER ops are distributed over the servers and execute in parallel •  CLIENT ops execute within the single client JDBC driver •  Consider tweaking your query to increase the use of SERVER side operations ​  Scanning strategy is TABLE SCAN, prefer RANGE SCAN or SKIP SCAN •  Filter against leading columns in the primary key constraint, PK may need redesign •  Possibly means you need to introduce a global index that covers your query or local index •  If you have an index but the optimizer is missing it, try hinting the query (SELECT /*+ INDEX(<table> <index>) */ …) If using JOINs, please read https://phoenix.apache.org/joins.html

20.What’s New? Exciting new or upcoming features for OLTP use cases ​ Functional Indexes ​ Spark integration ​  Make RDDs or data frames out of fast Phoenix queries for Spark streaming and batch workflows Transactions (WIP) ​  Using Tephra (http://tephra.io/), supports REPEATABLE_READ isolation level ​  Need to manage the Tephra Transaction Manager as a new system component ​  Work in progress, but close to release, try out the ‘txn’ branch Calcite Integration for federated query (WIP) ​  Interoperate and federate queries over other Apache Calcite adopters (Drill, Hive, Samza, Kylin) and any data source with Calcite driver support (Postgres, MySQL, etc.) ​  See the ‘calcite’ branch

21.What’s New? Exciting new or upcoming features for OLTP use cases ​ Query Server ​  Builds on Calcite’s Avatica framework for thin JDBC drivers ​  Offloads query planning and execution to different server(s) – The “fat” parts of the Phoenix JDBC driver become hosted on a middle tier, enablig scaling independent from clients and HBase ​  Shipping now in 4.5+ ​  Still evolving, so no backward compatibility guarantees yet ​  A more efficient wire protocol based on protobufs is WIP

22.Thank you Q&A apurtell@salesforce.com apurtell@apache.org @akpurtell

user picture