时序化HBase

哪些应用场景需要时序处理系统?时序数据存储系统的基本需求?如何利用HBase构建一个有效的时序数据处理存储系统?最理想状态下基于HBase的时序数据库应该具备哪些特性?OpenTSDB是一个基于HBase实现的时序数据库,但是对HBase也提出了它自己的需求,比如按照时间区段缓存数据,这对Compaction提出了新的要求和挑战,文章对于这部分工作进行了详细描述,并把具体要求提交给了HBase社区。
展开查看详情

1.EFFICIENT TIME-SERIES in HBase Vladimir Rodionov, SMTS Hortonworks

2.Time Series Sequence of data points Triplet: [ID][TIME][VALUE] – basic Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE ] Stock Closing Value DJIA User behavior (web clicks) Credit card transactions Health data Fitness indicators Sensor data (IoT) Application and system metrics - ODS

3.TSDS requirements Data Store MUST preserve temporal locality of data for better in-memory caching Facebook ODS : 85% queries are for last 26 hours Data Store MUST provide efficient compression Time – series are highly compressible (less than 2 bytes per data point in some cases) Facebook custom compression codec produces less than 1.4 bytes per data point Data Store MUST provide automatic time-based rollup aggregations: sum, count, avg, min, max, etc., by min, hour, day and so on – configurable. Most of the time its aggregated data we are interested in.

4.OpenTSDB 2.x Data Store MUST preserve temporal locality of data for better in-memory caching – NO Size-Tiered HBase compaction does not preserve temporal locality of data. Major compaction creates single file, for example, where recent data is stored with data which is months or years old. Compaction trashes block cache as well –decreases read performance and increases latencies. Data Store MUST provide efficient compression – NO OpenTSDB supports compression, but its very heavy (runs externally) and usually users disable it in production. Data Store MUST provide automatic time-based rollup aggregations – NOT IMPLEMENTED

5.Ideal HBase TSDB Keeps raw data for hours Does not compact raw data at all Preserves raw data in memory cache for periodic compactions and time-based rollup aggregations Stores full resolution data only in compressed form Has different TTL for different aggregation resolutions: Days for by_min, by_10min etc. Months, years for by_hour Compaction should preserve temporal locality of both: full resolution data and aggregated data.

6.TSDS HBase Raw Events Region Server HDFS CF:Compressed CF:Raw CF:Aggregates C A C A Compressor Coprocessor Aggregator Coprocessor CF:Aggregates CF:Raw – TTL hours CF:Compressed – TTL months CF:Aggregates – TTL days/months

7.Exploring (Size-Tiered) Compaction D oes not preserve temporal locality of data. Compaction trashes block cache No efficient caching of data is possible It hurts most-recent-most-valuable data access pattern. Compression/Aggregation is very heavy. To read back recent raw data and run it through compressor, many IO operations are required, because … We can’t guarantee recent data in a block cache.

8.HBASE-14468 FIFO compaction First-In-First-Out No compaction at all TTL expired data just get archived Ideal for raw data storage No compaction – no block cache trashing Raw data can be cached on write or on read Sustains 100s MB/s write throughput per RS Patch available Can be applied to 1.0/1.1/1.2/1.3/2.0 0.98 requires some code changes

9.HBASE-14477 DT compaction DateTieredCompactionPolicy CASSANDRA-6602 Works better for time series than ExploringCompactionPolicy Adds delayed compaction (not in CASSANDRA) Better temporal locality helps with reads Good choice for compressed full resolution and aggregated data. Patch will follow shortly. Again, or 1.0 and up. Can be back-ported to 0.98

10.DateTieredCompactionPolicy

11.Temporal Locality Age Size No compaction STCP Major DTCP

12.Temporal Locality Age Size No compaction STCP Major DTCP Most Recent Data

13.HBASE-14496 Delayed compaction Files are eligible for minor compaction if their age > delay Good for application where most recent data is most valuable. Prevents block cache from trashing for recent data due to frequent minor compactions of a fresh store files Will enable this feature for Exploring Compaction Policy DTCP will have it by default. DTCP + Delay (1-2 days) is good option for compressed full resolution and aggregated data. Patch available. HBase 1.0+ (can be back-ported to 0.98)

14.TSDS HBase Raw Events Region Server HDFS CF:Compressed CF:Raw CF:Aggregates C A C A Compressor Coprocessor Aggregator Coprocessor CF:Aggregates CF:Raw – TTL hours CF:Compressed – TTL months CF:Aggregates – TTL days/months FIFO DTCP+Delay DTCP+Delay

15.Summary Disable major compaction Disable region splits (DisabledRegionSplitPolicy) Presplit table in advance. Increase hbase.hstore.blockinStoreFiles for raw data Have separate column families for raw, compressed and aggregated data (each aggregate resolution – its own family) FIFO for Raw, ECP + Delay for others (now), DTCP + Delay (in near future) Run periodically internal job (coprocessor) to compress data and produce time-based rollup aggregations.

16.Q&A