1.EFFICIENT TIME-SERIES in HBase Vladimir Rodionov, SMTS Hortonworks
2.Time Series Sequence of data points Triplet: [ID][TIME][VALUE] – basic Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE ] Stock Closing Value DJIA User behavior (web clicks) Credit card transactions Health data Fitness indicators Sensor data (IoT) Application and system metrics - ODS
3.TSDS requirements Data Store MUST preserve temporal locality of data for better in-memory caching Facebook ODS : 85% queries are for last 26 hours Data Store MUST provide efficient compression Time – series are highly compressible (less than 2 bytes per data point in some cases) Facebook custom compression codec produces less than 1.4 bytes per data point Data Store MUST provide automatic time-based rollup aggregations: sum, count, avg, min, max, etc., by min, hour, day and so on – configurable. Most of the time its aggregated data we are interested in.
4.OpenTSDB 2.x Data Store MUST preserve temporal locality of data for better in-memory caching – NO Size-Tiered HBase compaction does not preserve temporal locality of data. Major compaction creates single file, for example, where recent data is stored with data which is months or years old. Compaction trashes block cache as well –decreases read performance and increases latencies. Data Store MUST provide efficient compression – NO OpenTSDB supports compression, but its very heavy (runs externally) and usually users disable it in production. Data Store MUST provide automatic time-based rollup aggregations – NOT IMPLEMENTED
5.Ideal HBase TSDB Keeps raw data for hours Does not compact raw data at all Preserves raw data in memory cache for periodic compactions and time-based rollup aggregations Stores full resolution data only in compressed form Has different TTL for different aggregation resolutions: Days for by_min, by_10min etc. Months, years for by_hour Compaction should preserve temporal locality of both: full resolution data and aggregated data.
6.TSDS HBase Raw Events Region Server HDFS CF:Compressed CF:Raw CF:Aggregates C A C A Compressor Coprocessor Aggregator Coprocessor CF:Aggregates CF:Raw – TTL hours CF:Compressed – TTL months CF:Aggregates – TTL days/months
7.Exploring (Size-Tiered) Compaction D oes not preserve temporal locality of data. Compaction trashes block cache No efficient caching of data is possible It hurts most-recent-most-valuable data access pattern. Compression/Aggregation is very heavy. To read back recent raw data and run it through compressor, many IO operations are required, because … We can’t guarantee recent data in a block cache.
8.HBASE-14468 FIFO compaction First-In-First-Out No compaction at all TTL expired data just get archived Ideal for raw data storage No compaction – no block cache trashing Raw data can be cached on write or on read Sustains 100s MB/s write throughput per RS Patch available Can be applied to 1.0/1.1/1.2/1.3/2.0 0.98 requires some code changes
9.HBASE-14477 DT compaction DateTieredCompactionPolicy CASSANDRA-6602 Works better for time series than ExploringCompactionPolicy Adds delayed compaction (not in CASSANDRA) Better temporal locality helps with reads Good choice for compressed full resolution and aggregated data. Patch will follow shortly. Again, or 1.0 and up. Can be back-ported to 0.98
11.Temporal Locality Age Size No compaction STCP Major DTCP
12.Temporal Locality Age Size No compaction STCP Major DTCP Most Recent Data
13.HBASE-14496 Delayed compaction Files are eligible for minor compaction if their age > delay Good for application where most recent data is most valuable. Prevents block cache from trashing for recent data due to frequent minor compactions of a fresh store files Will enable this feature for Exploring Compaction Policy DTCP will have it by default. DTCP + Delay (1-2 days) is good option for compressed full resolution and aggregated data. Patch available. HBase 1.0+ (can be back-ported to 0.98)
14.TSDS HBase Raw Events Region Server HDFS CF:Compressed CF:Raw CF:Aggregates C A C A Compressor Coprocessor Aggregator Coprocessor CF:Aggregates CF:Raw – TTL hours CF:Compressed – TTL months CF:Aggregates – TTL days/months FIFO DTCP+Delay DTCP+Delay
15.Summary Disable major compaction Disable region splits (DisabledRegionSplitPolicy) Presplit table in advance. Increase hbase.hstore.blockinStoreFiles for raw data Have separate column families for raw, compressed and aggregated data (each aggregate resolution – its own family) FIFO for Raw, ECP + Delay for others (now), DTCP + Delay (in near future) Run periodically internal job (coprocessor) to compress data and produce time-based rollup aggregations.