Enabling Efficient OS Paging for Main-Memory OLTP Databases

we propose a simple and low-overhead technique that enables main-memory databases to eciently migrate cold data to secondary storage by relying on the OS’s virtual memory paging mechanism. We propose to log accesses at the tuple level, process the access traces o✏ine to identify relevant access patterns, and then transparently re-organize the in-memory data structures to reduce paging I/O and improve hit rates. The hot/cold data separation is performed on demand and incrementally through careful memory management, without any change to the underlying data structures.

1. Enabling Efficient OS Paging for Main-Memory OLTP Databases Radu Stoica Anastasia Ailamaki École Polytechnique Fédérale de Lausanne École Polytechnique Fédérale de Lausanne radu.stoica@epfl.ch anastasia.ailamaki@epfl.ch ABSTRACT Hana[7]). Such systems assume all data resides in RAM Even though main memory is becoming large enough to fit and explicitly eliminate traditional disk optimizations that most OLTP databases, it may not always be the best option. are deemed to introduce an unacceptable large overhead. OLTP workloads typically exhibit skewed access patterns Today storage technology trends have changed - along where some records are hot (frequently accessed) but many with the large main memory of modern servers, we have records are cold (infrequently or never accessed). Therefore, solid-state drives (SSDs) that can support hundreds of thou- it is more economical to store the coldest records on a fast sands of IOPS. In recent years, the capacity improvements of secondary storage device such as a solid-state disk. How- NAND flash-based devices outpaced DRAM capacity growth, ever, main-memory DBMS have no knowledge of secondary accompanied by corresponding trends in cost per gigabyte. storage, while traditional disk-based databases, designed for In addition, major hardware manufacturers are investing workloads where data resides on HDD, introduce too much in competing solid-state technologies such as Phase-Change overhead for the common case where the working set is mem- Memory. ory resident. In OLTP workloads, record accesses tend to be skewed: In this paper, we propose a simple and low-overhead tech- some records are ”hot” and accessed frequently (the working nique that enables main-memory databases to efficiently mi- set), others are ”cold” and accessed infrequently. Hot records grate cold data to secondary storage by relying on the OS’s should reside in memory to guarantee good performance; virtual memory paging mechanism. We propose to log ac- cold records, however, can be moved to cheaper external cesses at the tuple level, process the access traces o✏ine solid-state storage. to identify relevant access patterns, and then transparently Ideally, we want a DBMS that: a) has high performance, re-organize the in-memory data structures to reduce paging characteristic of main-memory databases, when the working I/O and improve hit rates. The hot/cold data separation set fits in RAM, and b) the ability of a traditional disk-based is performed on demand and incrementally through careful database engine to keep cold data on the larger and cheaper memory management, without any change to the underly- storage media, while supporting, as efficiently as possible, ing data structures. We validate experimentally the data the infrequent cases where data needs to be retrieved from re-organization proposal and show that OS paging can be ef- secondary storage. ficient: a TPC-C database can grow two orders of magnitude There is no straightforward solution as DBMS engines op- larger than the available memory size without a noticeable timized for in-memory processing have no knowledge of sec- impact on performance. ondary storage. Traditional storage optimizations, such as the bu↵er pool abstraction, the page organization, certain data structures (e.g. B-Trees), and ARIES-style logging are 1. INTRODUCTION explicitly discarded in order to reduce overhead. At the Database systems have been traditionally designed under same time, switching back to a traditional disk-based DBMS the assumption that most data is disk resident and is paged would forfeit the fast processing benefits of main-memory in and out of memory as needed. However, the drop in systems. memory prices over the past 30 years led several database In this paper, we take the first step toward enabling a engines to optimize for the case when all data fits in memory. main-memory database to migrate data to a larger and cheaper Examples include both research systems (MonetDB[6], H- secondary storage. We propose to record accesses during Store[13], Hyper[14], Hyrise [11]) and commercial systems normal system runtime at the tuple level (possibly using (Oracle’s TimesTen[2], IBM’s SolidDB [1], VoltDB[4], SAP sampling to reduce overhead) and process the access traces o↵ the critical path of query execution in order to identify data worth storing in memory. Based on the access statis- Permission to make digital or hard copies of all or part of this work for tics, the relational data structures are re-organized such that personal or classroom use is granted without fee provided that copies are the hot tuples are stored as compactly as possible, which not made or distributed for profit or commercial advantage and that copies leads to improved main memory hit rates and to reduced bear this notice and the full citation on the first page. To copy otherwise, to OS paging I/O. The data re-organization is unintrusive since republish, to post on servers or to redistribute to lists, requires prior specific only the location of tuples in memory changes, while the permission and/or a fee. DaMoN ’13 internal pointer structure of existing data-structures and Copyright 2013 ACM Copyright 2013 ACM 978-1-4503-2196-9/13/06 query execution remain una↵ected. In addition, the data ...$15.00.

2. 3GB RAM 18GB RAM 62GB RAM DBMS engine 35 3 Process access logs 2 Write access Query 1 Sample accesses Transactions/sec (1000s) logs Monitoring process 30 Access logs Writer thread a) Access frequency b) Memory allocation 25 c) Misplaced tuples Data-structures 20 15 Heap files Binary trees Hash Indexes 10 Hot Network Start of swapping on FusionIO SSD Cold 5 Insert/delete trx. 0 4 Read optimal tuple Storage 5 Re-organization placement 0 20 40 60 80 100 120 140 160 180 200 220 Database size (GB) Figure 2: System Architecture Figure 1: VoltDB throughput when paging to a SSD re-organization is performed incrementally and only when minutes. In addition to the price, the maximum memory required by the workload. capacity of a server or the memory density can both pose We implement the data re-organization proposal in a state- challenges. Today, a high-end workstation can fits at most of-art, open source main-memory database, VoltDB [4]. Our 4TB of main memory, servers can handle around 256GB of experimental results show that a TPC-C database can grow DRAM per CPU socket, while a 1U rack slot hosts less than 50⇥ larger than the available DRAM memory without a 512GB of DRAM. noticeable impact on performance or latency, and present micro-benchmark results that indicate that a system can de- liver reasonable performance even in cases where the work- 2.2 Skewed Accesses in OLTP workloads ing set does not fully fit in memory and secondary storage Real-life transactional workloads typically exhibit consid- data is frequently accessed. erable access skew. For example, package tracking workloads In this paper we make the following contributions: for companies such as UPS or FedEx exhibit time-correlated skew. Records for a new package are frequently updated un- 1. We profile the performance of a state-of-art main-memory til delivery, then used for analysis for some time, and after DBMS and identify its inefficiencies in moving data to sec- are accessed only on rare occasions. Another example is ondary storage. the natural skew found on large e-commerce sites such as 2. We propose an unintrusive data re-organization strat- Amazon, where some items are much more popular than egy that separates hot from cold data with minimal over- others are or where some users are much more active than head and allows the OS to efficiently page data to a fast the average. Such skewed accesses may change over time solid-state storage device. but typically not very rapidly (the access skews might shift 3. We implement the data re-organization technique in a over days rather than seconds). state-of-art database and show that it can support a TPC- C dataset that grows 50⇥ bigger than the available physical 2.3 Existing DBMS Architectures memory without a significant impact on throughput or la- tency. 2.3.1 Disk-based DBMS. The remaining of this document is organized as follows. One possible option is to use a disk-based database archi- Section 2 details the motivation for our work; Section 3 de- tecture that optimizes for the case data is on secondary stor- scribes the architecture of the data re-organization proposal, age. Traditional DBMS pack data in fixed-size pages, use which is validated experimentally in Section 4; Section 5 sur- logical pointers (e.g. page IDs and record o↵sets) that al- veys the related work and, finally, Section 6 concludes. lows the bu↵er pool abstraction to move data between mem- ory and storage transparently, and have specialized page re- 2. MOTIVATION placement algorithms to maximize hit rates and reduce I/O Our work is motivated by several considerations: i) by latency. hardware trends that make storing data on solid-state stor- However, given current memory sizes and the lower la- age attractive; ii) by the workload characteristics of OLTP tency of SSDs compared to HDDs, such optimizations might databases; and iii) by the inefficiencies of existing systems, not be desirable. Stonebraker et al. [20] introduced a main- either traditional disk-based databases or newer main-memory memory database that is two orders of magnitude faster than optimized DBMS. a general purpose disk-based DBMS, while Harizopoulos et al. [12] showed that for transactional workloads the bu↵er 2.1 Hardware trends pool introduces a significant amount of CPU overhead: more It is significantly cheaper to store cold data on secondary than 30% of execution time is wasted due to extra indirec- storage rather than in DRAM. In previous work [17] we tion layer. This overhead does not even include latching computed an updated version of the 5-minute rule [10] and bu↵er pool pages at every tuple access, or the overhead of found that it is more economically to store a 200B tuple on a maintaining page-level statistics to implement the page re- SSD if it is accessed less than approximately once every 100 placement algorithm. Default OS paging the processing phase to determine if the location of a tuple At the other extreme, a straightforward way of extend- matches its access frequency (i.e. whether the position of a ing available memory is to keep main-memory databases tuple matches its cold/hot status). The total length of an unchanged and simply use the default OS paging. Unfor- AccessRecord is 1B + 4B + 8B = 13B. tunately, we find such an approach to cause unpredictable A quantum of 1s is used as it is sufficiently fine grained to performance degradation. di↵erentiate between tuples with di↵erent access frequencies We show in Figure 1 the throughput of the VoltDB DBMS at the time granularity of interest (as mentioned, a cold when runing the TPC-C benchmark [3] (please see Section tuple should be moved to a SSD if accessed less than once 4 for the detailed experimental setup). The working set every tens to hundreds of minutes). We experimented with size of the TPC-C benchmark is constant, while the size of multiple sampling probabilities and decided to use a default the database grows over time as new records are inserted in sampling probability of 10% (we log at most 10% of the total the Order, NewOrder, OrderLine, and History tables. We tuple accesses) that o↵ers a good tradeo↵ between logging vary the amount of physical DRAM available to the OS and overhead and precision. enable swapping to a 160GB Fusion ioDrive PCIe SSD [9]. 2 Write access logs. A single dedicated thread writes The scale factor is 16 (the initial database occupies 1.6GB) the access logs either to a file or to the network. The writer and in the beginning all data is memory resident. As the thread has three functions. Firstly, it decouples transaction database size grows, the RAM available is exhausted, and execution from any other I/O or communication overhead. the OS starts paging virtual memory pages that it deems Secondly, it throttles the rate at which the logs are generated cold to the SSD. to insure minimal system interference. If the access logs Figure 1 shows the trade-o↵ between extending the mem- are not flushed fast enough and fill up, the backpressure ory size of a system by using flash memory and the per- will cause worker threads to reduce the sampling frequency. formance penalty incurred. When extending the memory Thirdly, the design allows us to decouple the database engine budget from 62GB to 220GB (3.5⇥ increase), the through- from the later processing stage of the access logs. put drops by 20%; increasing the main memory from 18GB 3 Process access logs. The access traces are processed to 176GB (9.8⇥ increase), the throughput decreases by 26%; o✏ine to identify data placement inefficiencies. The analy- finally, when extending the memory size from 3GB to 161GB sis step takes place ideally on a separate server to avoid (53⇥ increase), the throughput penalty is 66%. The results interference with ongoing query execution. The output of are sub-optimal as the TPC-C working set fits in memory the processing step is: a) a memory budget for each data- even if the database grows. Clearly, the performance impact structure (table or index), and b) two lists of hot/cold mis- of paging is significant and needs consideration. placed tuples for each object. A hot misplaced tuple list identifies which tuples are accessed often enough to justify storing them in memory and are not yet placed in the hot 3. DATA RE-ORGANIZATION memory region of the object, while the cold misplaced tuple list represents the tuples that are the best candidates for Our data re-organization proposal, as depicted in Figure eviction from the hot memory region of each object. 2, is composed of five independent steps: First, tuple level The access traces are processed as follows: accesses are logged as part of query execution (phase one); Step 1 - estimate access frequency. The access frequency then, the access logs are shipped (phase two) for the process- is estimated using a variant of the Exponential Smoothing ing stage that identifies which tuples are hot or cold (phase algorithm [17], although any other standard algorithms such three); finally, the output of the processing stage is propa- as ARC[18], CAR [5], or LRU-K [19] variants can be used. gated back to the database engine (phase four) that performs The advantages of the Exponential Smoothing algorithm the data re-organization when needed (phase five). The rest over other frequency estimation algorithms are twofold: i) of this section details each step of the data re-organization it estimates more accurately access frequency, resulting in process. higher hit rates, and ii) it is efficient as it supports paral- 1 Sample accesses. In the first phase, each worker lelization and minimizes the number of access records pro- thread samples probabilistically tuple accesses and writes cessed (by scanning the access logs in reverse time order). the access records to a dedicated log-structure (a circular Step 2 - compute optimal memory allocation. The ac- bu↵er). Each worker thread has one such log-structure in cess frequencies alone are not sufficient to decide what data order to avoid any synchronization-related overhead. Time should be stored in memory – we also need to consider the is split in discreet quanta, each time quantum receiving a size of each tuple to maximize the access density. We use the single time stamp to further reduce overhead and reduce average tuple size corresponding to each object to normalize the access log size. the tuple access frequency. After the normalization, the hot Each access log quantum has the following syntax: <TimeS- tuples of each object can be identified. The memory budget tamp, AccessRecord*>, while an AccessRecord is composed of each object is simply the number of tuples qualifying as of <ObjectID, [Key | T upleID], FileO↵set>. The ObjectID hot multiplied by the average tuple size. is an identifier that represents the relational object data- Step 3 - identify misplaced tuples. Finally, we identify mis- structure containing the tuple (can be either a table or an placed tuples based on the memory allocation of each object. index). The next field, [Key | T upleID], identifies the tu- As mentioned, each tuple access contains the relative mem- ple accessed; we use the primary table key or indexing key, ory o↵set of the tuple. The relational data structures are denoted as Key, for this purpose. However, the primary key allocated contiguous in memory (as described below), thus can be replaced with the tuple ID in systems that such sup- we can identify if the access frequency of a tuple matches its port such identificators. The last field, FileO↵set, presents location by simply comparing the tuple’s relative memory the memory o↵set of the tuple relative to the beginning o↵set with the hot memory budget of the relation object. of the relational data-structure. FileO↵set is used during

4.New inserts Initial layout Data re-organization Initial layout Data re-organization Hot Cold Hot Cold Hot Data Region Cold Data Region 4 4 6 Shrink/Grow Grow 2 6 2 5 7 1 3 5 7 3 1 Hot free space list Tuple migration Cold free space list Binary Tree Hash Index (a) Memory allocation strategy (b) Node re-organization for binary trees and hash indexes Figure 3: Memory allocation and hot/cold data placement 4 Read optimal tuple placement. A dedicated thread 4.1 Experimental Setup reads the output of the processing step. It first reads and DBMS. We use the open-source commercial VoltDB [4] sets the target memory budget for each object, then incre- DBMS (version 2.7) running on Linux to implement the data mentally reads the lists of misplaced tuples for each object re-organization technique. VoltDB uses data partitioning to as needed by the tuple re-organization. handle concurrency and insure scaling with the number of 5 Re-organization. Each data-structure is allocated cores. Each worker thread has its own set of data structures memory sequentially in the virtual address space of the database (both tables and indexes) that it can access independently process using the facilities of the OS. The memory layout of of the other worker threads. Thus, each logical table is com- a data-structure is presented in Figure 3(a). The beginning posed of several physical tables, each worker thread being portion of the data file is logically considered “hot” and is assigned one such physical table. In all experiments, the pinned in memory. The rest of the file is left to contend for database is partitioned to match the number of available system RAM and uses the default OS paging policy. The cores. OS is left to manage at least 10% of available memory to Memory management. The core VoltDB functional- insure there is enough memory for other processes. ity is implemented in C++, while the front-end (network The tuple re-organization is performed without changing connectivity, serialization of results, query plan generation, the pointer structure or record formant of the relational ob- DBMS API) are implemented in Java. The memory over- jects – we change only how memory is allocated as depicted head of the front-end is independent of the database size in Figure 3(b). The memory allocator of each data-structure and we found it crucial to keep memory resident the front- has two free-space lists, one for the hot memory region, and end related objects and code. We pin in memory all code the other for the cold memory region. When inserting, the mappings, front-end data structures, and Java heap and al- data-structure can request either hot or cold memory; when low only the large relational data-structures (tables, hash deleting, memory is reclaimed by comparing the memory o↵- and binary tree indexes) to be paged out to the SSD. We set with the hot/cold threshold and appending the memory track which address ranges correspond to code sections or to region to the corresponding free space list. A tuple move- other data-structures by examining the /proc process infor- ment can be logically thought of as a delete operation of the mation pseudo-file system. For relational objects, we allo- misplaced tuple followed by its re-insertion in the appropri- cate memory sequentially in the virtual address space of the ate memory region. database process by using the mmap()-related system calls Overall, the key insights of our technique are as follows: i) and pin/unpin pages in memory using the mlock()/munlock() we decouple, to the extent possible, all data re-organization system calls. operations (logging, analysis, re-organization) from the crit- We note that the VoltDB core engine already optimizes ical path of query execution, ii) we optimize data locality memory allocation. VoltDB allocates memory in large con- such that hot tuples are stored compactly in a contiguous tiguous chunks for each object in order to reduce malloc() memory region, and iii) we restrict OS paging decisions by call overhead and reduce memory fragmentation. Each re- preventing code memory sections and the hot data regions lational data-structure has its own memory allocation pool of relational objects from being paged out. and memory chunks never contain data from more than one table or index, which insures that virtual memory pages con- tain data of the same type. Therefore, the performance ben- 4. EVALUATION efits of the data re-organization stem only from taking into In this section, we evaluate experimentally how e↵ective account access frequency rather than from other memory is the data re-organization technique in reducing OS pag- management optimizations. ing overhead. We first demonstrate the end-to-end perfor- Query execution. Transactions run as stored proce- mance of a main-memory DBMS when running a TPC-C dures (no query optimization is performed at runtime), with benchmark where the majority of the database resides on query execution and result externalization being handled by an SSD. We show that the data re-organization strategy and separate threads. The benchmark client, responsible for sub- paging-related I/O have little impact on the overall system mitting transactions, is placed on a di↵erent machine on the throughput or latency even when the database size outgrows same 10Gb local network. For throughput results, we make the RAM size by a factor of more than 50⇥. As a TPC-C sure the server is fully loaded by maintaining sufficient in- database has a predictable working set and data growth, we flight transactions and measure throughput every second in then explore the impact of a full workload change on the all experiments. The network bandwidth is never saturated overall system performance through a micro-benchmark. and the network latency does not a↵ect any throughput re-

5. In-memory Data-reorganization Default paging 35 Server latency(µs) Client latency(µs) 30 Trx.Name Transactions/sec (1000s) No Paging Paging No Paging Paging 25 avg avg avg avg NewOrder 283 135 318 182 3270 5430 3301 5509 20 OrderStatus 82 26 126 58 2950 5486 2990 5499 15 Payment 149 52 183 107 3027 5458 3094 5492 Database size > physical RAM Delivery 314 152 347 214 3317 5331 3358 5513 10 StockLevel 797 243 898 348 3531 5372 3611 5570 5 avg = average, = standard deviation 0 0 20 40 60 80 100 120 140 160 Database size (GB) Table 1: TPC-C transaction latency Figure 4: TPC-C throughput. sults due to the batch-style processing architecture. number of transaction at 50% of the maximum through- Hardware. In all our experiments we use a 4 socket put (to prevent transaction latency from including queuing Quad-Core AMD Opteron (16 cores in total) equipped with times). 64GB of physical DRAM (the amount of memory visible to We show in Table 1 the transaction latency as measured the OS varies according to each experiment). The paging de- from both the server- and client- side and compute for each vice is a single 160GB FusionIO PCIe SSD that can support, transaction its standard deviation. As shown, paging in- according to our measurements, up to 65,000 4kB random creases the average transaction latency by ⇠50µs and its read IOPS and between 20,000 (long term throughput) and standard deviation by ⇠100µs. However, the variability in- 75,000 (peak throughput) 4kB random write IOPS. crease is small relative to the total transaction execution time and becomes insignificant from the client perspective. 4.2 TPC-C Results The transaction latency, as experienced by the client, is We use the TPC-C benchmark [3] to validate the overall dominated not by the actual transaction execution but rather efficiency of the data re-organization strategy. The scaling by the software overhead of submitting a transaction to the factor is set to 16 to match the number of cores and the server, by the overhead of the network I/O, and finally by database is partitioned on the warehouse ID (standard TPC- transaction management on the server side. To put the val- C partitioning), where each worker thread is responsible for ues into perspective, the advertised I/O latency for the Fu- executing all transactions related to its given warehouse. To sion ioDrive SSD is 25µs and the network latency (round- maximize the number of growing tables and the database trip wire latency plus switching) is ⇠30µs. growing speed, we disable for the Delivery transaction the deletions of fulfilled orders from the NewOrder table. 4.3 Response to workload changes The TPC-C working set size is fixed (the size of the hot 4.2.1 Throughput data does not change over time) and only 4 tables out of 9 We measure the TPC-C transaction execution throughput are growing. In addition, accesses show a predictable time in three scenarios. In the first case, we run the unmodified correlation: the newest tuples in the growing tables are also VoltDB database with all of the 64GB of RAM available the hottest. We expect that in most workloads such a time to the OS (⇠62GB were useful for actual data storage). In correlation exits, nonetheless we want to understand the sta- the second scenario, we run TPC-C on the same unmodified bility of the system if the working set shifts in unpredictable engine but restrict the amount of physical memory to 5GB ways or if the working set does not fully fit in RAM. (⇠3GB useful memory) and turn on swapping to the SSD. To answer these questions, we develop a micro-benchmark In the third case, we run VoltDB modified with our data re- with a dual purpose: to measure performance for the case organization technique and maintain the same memory con- where the working set does not fully fit in memory, and to figuration (⇠3GB of useful memory); the memory mapped test how fast the system reacts to workload changes. We data files are backed directly by the SSD without an in- generate a database composed of 108 200B tuples indexed between file-system. by a 4B integer primary key. The total memory footprint of As shown in Figure 4, the hot/cold data separation stabi- the VoltDB DBMS (the table, index, plus VoltDB memory lizes throughput within 7% of the in-memory performance, overhead) is 26.2GB. The database size is constant and the although the actual data stored grows 50⇥ larger than the available memory is to 5GB (physical memory is ⇠20% of amount of RAM available. On the other hand, the through- the total database size). We execute a single query that put of the unmodified engine drops by 66% when swapping retrieves single tuples of the form: to the SSD. SELECT * FROM R WHERE PK = <ID> ; 4.2.2 Latency Paging potentially introduces I/O operations in the criti- The ID values for the select query are generated according cal path of a transaction and a relevant question is how SSD to a Zipfian distribution with skew factor 1 (80% of accesses I/O changes transaction latency. We repeat the same TPC- target 20% of data). To measure how fast the system adapts C experiments, only this time we throttle the maximum to a workload change, we randomize the set of hot tuples

6.while preserving the same access distribution and measure In-memory Data-reorganization Uniform accesses how throughput varies over time. 80 We show in Figure 5 the impact of the working set shift on 70 the throughput of the system. We distinguish five distinct Queries /sec (1000s) 60 1 5 phases. The initial steady-state (phase 1) represents the long term behavior of the system when the table and index 50 data structures are fully optimized according to the access 40 pattern. Once the workload shift occurs, the performance of 30 4 the system is immediately a↵ected (phase 2) and drops by 20 96% from 75,000 to 4,500 queries/sec. The OS page replace- 2 3 ment algorithm first reacts to the workload shift (phase 3) 10 as the hottest pages from the cold file portions are identified 0 and bu↵ered in memory. As the access distribution contains 0 30 60 90 120 150 180 significant skew, even the small memory budget available to Time(s) the OS is enough to capture the most frequently accessed pages. As a result, throughput improves to around 25,000 Figure 5: Throughput for a shifting Zipfian working set. queries per second. The throughput remains stable (phase 4) until the data re-organization starts to relocate tuples (with the configurable delay of 60s). Once, the data relo- quency remain the same. In addition, if data is stored based cation commences, the throughput quickly stabilizes to the on its access frequency, the hot and cold datasets can be original level of 65,000 queries/sec (phase 5). compressed through di↵erent techniques. A similar idea is We note that the unmodified VoltDB system is unable to used in [8] where only the cold data is compressed. execute any queries. The high memory pressure combined with the lack of a clear working set cause a high OS paging activity that renders the database unresponsive. 6. CONCLUSIONS Our data re-organization proposal is generally applicable 5. RELATED WORK as it does not change the pointer structure of the physi- There has been much work on exploring OLTP engine cal data-structures, or the concurrency mechanism of the architectures optimized for main-memory access. Research database engine. The hot/cold data separation can be sim- prototypes include [6, 13, 14, 11, 15] and already there are a ply thought of a better way of allocating memory by taking multitude of commercial main-memory o↵erings [2, 1, 4, 7]. into account access frequency, while the tuple re-organization Such engines assume that the entire database fits in memory, can be modeled as short search/delete/insert transaction of completely dropping the concept of secondary storage. We individual tuples. At the same time, our proposal can be diverge from the memory-only philosophy by considering a suboptimal: some data structures maintain invariants that scenario where a main-memory database may migrate cold might force accessing cold data along the way of retrieving data to secondary storage by using OS paging and investi- hot tuples. For example, the hash index proposed in [16] gate how to efficiently separate hot from cold data in order uses linear hashing, i.e. maintains tuples sorted on keys in- to improve memory hit rate and reduce I/O. side the hash buckets to allow incremental expansion of the The work closest to ours is HyPer’s cold-data management hash index. Our data re-organization technique cannot dis- scheme [8]. In Hyper, hot transactional data is identified tinguish between logically cold and hot data as it relies on and separated from cold data that might be only referenced physical accesses; cold tuples in a hash bucket accessed on by analytical queries. Once identified and separated, the the way to the target hot tuple are considered as hot as the cold data is compressed to reduce its memory footprint in target tuple. Possible solutions for such cases could be either a read-optimized format suitable for the analytical queries. to change the original data-structures or to use two di↵erent However, Hyper’s data re-organization scheme has a di↵er- data stores, one data store for the hot and the other store for ent goal, namely to reduce the overhead of creating new the cold data. Unfortunately, both approaches have deeper OLAP threads. This is achieved by minimizing the number architectural implications. of pages that are actively modified by the OLTP threads and We proposed a simple solution for a main-memory database by compressing and storing the OLAP-only data on large to efficiently page cold data to secondary storage. Our tech- virtual memory pages. Hyper also takes a di↵erent approach nique logs accesses at the tuple level, processes the access for implementing the data re-organization: it keeps access logs o✏ine to identify hot data and re-organizes accordingly statistics at a larger granularity (at the VM page level) by the in-memory data structures to reduce paging I/O and overriding the OS’ virtual memory infrastructure, and uses improve memory hit rates. The hot/cold data separation an online heuristic to separate hot from cold data as an in- is performed on demand and incrementally through careful tegral part of query execution. memory management, without any change to the underly- Compression techniques can reduce the memory footprint ing data structures. Our experimental results show that it is of a database and can help in supporting larger in-memory feasible to rely on the OS virtual memory paging mechanism datasets. Compression, however, is an orthogonal topic. to move data between memory and secondary storage even if Compressing relational data, without changing the tuple or- the database size grows much larger than the main memory. der inside the underlying data-structure, does not create a In addition, the hot/cold data re-organization o↵ers reason- separation of cold from hot data, but rather can be viewed able performance even in the cases where the working set as a technique for expanding the available system memory. does not fully fit in memory and can adapt in a time frame The incentives of separating data based on the access fre- of a few minutes to full workload shifts.

7.7. REFERENCES what we found there. In SIGMOD, 2008. [1] IBM SolidDB. Available: http://www.ibm.com. [13] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, [2] Oracle TimesTen In-Memory Database. Information A. Rasin, S. Zdonik, E. P. Jones, S. Madden, available at: http://www.oracle.com. M. Stonebraker, Y. Zhang, et al. H-store: a [3] Transaction Processing Performance Council TPC-C high-performance, distributed main memory Standard Specification. Available: transaction processing system. In VLDB, 2008. http://www.tpc.org/tpcc/spec/tpcc_current.pdf . [14] A. Kemper and T. Neumann. HyPer: A hybrid [4] VoltDB In-Memory Database. Available: OLTP&OLAP main memory database system based http://www.voltdb.com. on virtual memory snapshots. In ICDE, 2011. [5] S. Bansal and D. S. Modha. CAR: Clock with [15] P.-˚ A. Larson, S. Blanas, C. Diaconu, C. Freedman, adaptive replacement. In FAST, 2004. J. M. Patel, and M. Zwilling. High-performance [6] P. A. Boncz, M. Zukowski, and N. Nes. concurrency control mechanisms for main-memory MonetDB/X100: Hyper-pipelining query execution. In databases. In VLDB, 2011. CIDR, 2005. [16] P.-˚ A. Larson, S. Blanas, C. Diaconu, C. Freedman, [7] F. F¨arber, S. K. Cha, J. Primsch, C. Bornh¨ ovd, J. M. Patel, and M. Zwilling. High-performance S. Sigg, and W. Lehner. SAP HANA database: data concurrency control mechanisms for main-memory management for modern business applications. ACM databases. In VLDB, 2011. Sigmod Record, 2012. [17] J. Levandoski, P.-A. Larson, and R. Stoica. Identifying [8] F. Funke, A. Kemper, and T. Neumann. Compacting Hot and Cold data in Main-Memory Databases. In transactional data in hybrid OLTP & OLAP ICDE, 2013. databases. In VLDB, 2012. [18] N. Megiddo and D. S. Modha. ARC: A self-tuning, [9] Fusion IO. Technical specifications. Available: low overhead replacement cache. In FAST, 2003. http://www.fusionio.com/PDFs/Fusion%20Specsheet.pdf. [19] E. J. O’Neil, P. E. O’Neil, and G. Weikum. The [10] J. Gray and F. Putzolu. The 5 minute rule for trading LRU-K page replacement algorithm for database disk memory for disc accesses and the 10 byte rule for bu↵ering. 1993. trading memory for CPU time. 1987. [20] M. Stonebraker, S. Madden, D. J. Abadi, [11] M. Grund, J. Kr¨ uger, H. Plattner, A. Zeier, S. Harizopoulos, N. Hachem, and P. Helland. The end P. Cudre-Mauroux, and S. Madden. Hyrise: a main of an architectural era:(it’s time for a complete memory hybrid storage engine. In VLDB, 2010. rewrite). In VLDB, 2007. [12] S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stonebraker. OLTP through the looking glass, and