A Hybrid SCM-DRAM Storage Engine for Fast Data Recovery

Storage Class Memory (SCM) has the potential to significantly improve database performance. This potential has been well documented for throughput [4] and response time [25, 22]. In this paper we show that SCM has also the potential to significantly improve restart performance, a shortcoming of traditional main memory database systems. We present SOFORT, a hybrid SCM-DRAM storage engine that leverages full capabilities of SCM by doing away with a traditional log and updating the persisted data in place in small increments. We show that we can achieve restart times of a few seconds independent of instance size and transaction volume without significantly impacting transaction throughput.
展开查看详情

1. SOFORT: A Hybrid SCM-DRAM Storage Engine for Fast Data Recovery Ismail Oukid Daniel Booss Wolfgang Lehner Technische Universität SAP AG Technische Universität Dresden & SAP AG daniel.booss@sap.com Dresden i.oukid@sap.com wolfgang.lehner@tu- dresden.de Peter Bumbulis Thomas Willhalm SAP AG Intel GmbH peter.bumbulis@sap.com thomas.willhalm@intel.com ABSTRACT media. Examples of SCM are Phase Change Memory [16], Spin Storage Class Memory (SCM) has the potential to significantly im- Transfer Torque RAM [12], Magnetic RAM [9], and Memristors [26]. prove database performance. This potential has been well docu- Continuing advances in NVM technology hold the promise that SCM mented for throughput [4] and response time [25, 22]. In this paper will become a reality in the near future. Current NVM technologies we show that SCM has also the potential to significantly improve provide read and write latencies within an order of magnitude of DRAM, with writes being noticeably slower than reads. restart performance, a shortcoming of traditional main memory database systems. We present SOFORT, a hybrid SCM-DRAM stor- In this paper we present SOFORT, a hybrid SCM-DRAM storage age engine that leverages full capabilities of SCM by doing away engine that speeds up restarts by taking advantage of the properties with a traditional log and updating the persisted data in place in of SCM to operate on the persisted data directly without having small increments. We show that we can achieve restart times of a to first cache it in DRAM. SOFORT also speeds up recovery by few seconds independent of instance size and transaction volume doing away with a traditional log and updating the persisted data without significantly impacting transaction throughput. in place in small increments. In this paper we show that, with accepted assumptions for SCM performance, SOFORT can achieve 1. INTRODUCTION this without compromising transactional throughput. To do so, we propose a novel programming model for persistent memory. We Availability guarantees form an important part of many service level carefully choose which structures to put on SCM and which ones agreements (SLAs) for production database systems [7]: minimiz- to keep in DRAM. This flexibility enables performance/restart time ing database downtime has economic as well as usability benefits. trade-offs since DRAM is faster than SCM and the structures that Database systems crash for a variety of reasons including software are not persisted in real time on SCM need to be rebuilt at restart bugs, hardware faults and user errors. Many of these conditions time. To build SOFORT, we have designed persistent, single-level, are transient: for these, restarting - after logging the error - is a lock-free, and concurrent data structures that are self-sufficient to reasonable approach to recovery. In these cases database restart time recover in a consistent way relative to the state of the database. has a significant and direct impact on database availability. To evaluate SOFORT, we compare it with Shore-MT on ramdisk In-memory DBMSs recover by rebuilding DRAM-based data struc- using the TATP benchmark [2]. Our evaluation shows that SOFORT tures from a consistent state persisted on durable media. The per- has up to 4 times higher throughput than Shore-MT on ramdisk. sisted state typically consists of a copy (checkpoint) of the database As we do not know yet the final latencies of SCM, we consider a state at a particular instant in time and a log of subsequent updates. range of latencies and report incurred performance variations using Recovery consists of reloading portions of the checkpointed state, special hardware that enables the tuning of emulated SCM latency. applying subsequent updates and then undoing the effects of unfin- We show that SOFORT stays competitive even in a high latency SCM ished transactions. Database restart time not only includes the time environment, recovers in seconds and is resilient to user contention. to recover the consistent state as it existed before the crash but also The paper is structured as follows: In Section 2, we discuss the time to reload any data required by the current workload. This the design of SOFORT. Section 3 describes our persistent memory second component can be substantial for OLAP workloads. programming model. We describe the core operations of SOFORT Storage Class Memory (SCM) is Non-Volatile Memory (NVM) in Section 4. We evaluate SOFORT in Section 5 and discuss related with latency characteristics close to that of DRAM and density, dura- work in Section 6. Finally, Section 7 concludes the paper and bility and economic characteristics comparable to existing storage outlines future work. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed 2. SOFORT DESIGN for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the SOFORT is a main-memory transactional storage engine intended author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or for mixed OLAP and OLTP workloads. To achieve good OLAP per- republish, to post on servers or to redistribute to lists, requires prior specific permission formance, tables are stored column-wise. While SOFORT has been and/or a fee. Request permissions from Permissions@acm.org. implemented as a column store, the same principles also apply for DaMoN ’14, June 22-27 2014, Snowbird, UT, USA row stores. As in other column-stores, such as SAP HANA, data Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2971-2/14/06 ...$15.00. is split into a larger read-optimized, read-only main storage and http://dx.doi.org/10.1145/2619228.2619236. a smaller write-optimized, read-write delta storage, with periodic

2. Figure 2 is an example of a SOFORT table of employees and their offices. It illustrates how multiple versions are managed. When a row is updated (in the example, Ingo changes office), the current version of the row is invalidated by updating the DTS of the corre- sponding MVCC entry, and a new version of the row is created, with its corresponding CTS equal to the DTS of the deleted row. SOFORT is latch-free and uses atomic instructions only, except when resizing tables, where only writers are blocked. All data structures are concurrent and no centralized lock is needed. This makes SOFORT highly scalable to the number of cores and resilient to user contention. At the moment, SOFORT supports only statement level isolation. 3. PROGRAMMING MODEL First, we discuss what persistent primitives current hardware offers. Then, we give an overview of SOFORT’s memory management. Last, we detail the recovery mechanism of SOFORT. Figure 1: Overview of SOFORT. 3.1 Persistence Primitives In our proof-of-concept, SCM is connected via PCIe. However merges from delta to main to keep the size of the delta bounded. in the future, SCM may be tightly integrated into the memory sub- Currently only the delta is implemented in SOFORT; implementing system and may use the CPU’s cache interface. Independent from the main is straightforward as it is read-only. Given our architectural the implementation, enforcing persistency is critical. Using the choices, we expect SOFORT to have high OLAP performance. existing architecture, one could envision an implementation using SOFORT uses Multi-Version Concurrency Control (MVCC) [14, non-temporal stores, CPU flushing instructions and memory barriers 15] and dictionary encoding. In particular, every column has its own [3]. The primitives we use are: dictionary. Dictionary codes are integers called ValueIDs. Figure 1 gives an overview of SOFORT. An instance of SOFORT consists CLFLUSH: Flushing Instruction. Invalidates the cache line that of multiple persistent tables (PTables) and a persistent array of contains a given linear address. Writes back this cache line if currently running transaction objects (TRX array). In the current it is inconsistent with memory. implementation, the size of the TRX array is fixed and bounds the MOVNT: Store Instruction. Bypasses the cache and writes directly maximum number of concurrent transactions. A transaction object to memory. holds information related to the state of the transaction including a Modern CPUs implement complex out-of-order execution, where transaction time-stamp (TTS). A transaction ID (TXID) is the index for example, a flushing instruction can be reordered with previous of a transaction in the TRX array. Each PTable encompasses several instructions, leading to the eviction of a cache line that may not take persistent columns (PColumns) and a persistent MVCC array. into account the new writes that should have happened before the There is one MVCC entry per row of a table. An MVCC entry flushing instruction. To order memory instructions, we use memory consists of a commit time-stamp (CTS) and a deletion time-stamp barriers. On x86 architecture, we use [3]: (DTS) which are both initialized to ∞. Time-stamps are logical and generated starting from sizeof (TRX array) by a global counter. SFENCE: Memory Barrier. Performs a serializing operation on all When a transaction starts, it is assigned a TTS which is equal to the store-to-memory instructions that were issued prior to this current logical time. The time-stamp counter is incremented for instruction. every commit. A CTS or DTS is interpreted as a TXID if it is lower MFENCE: Memory Barrier. Performs a serializing operation on all than or equal to the size of the TRX array. Based on its MVCC entry memory instructions (load and store) that were issued prior to and relative to a transaction, a row is: this instruction. Visible, if the TTS is greater than or equal to the CTS and lower For example, let persistentInt be a persistent integer variable. We than the DTS. If the DTS is a transaction ID (TXID), then want to persistently write 1 to this variable. To do so, we have to we compare with the commit time-stamp of the transaction execute the following sequence of instructions: pointed to by that TXID. persistentInt = 1; Invisible, if it is not visible. MFENCE(); CLFLUSH(&persistentInt); Locked, if its DTS is a TXID. MFENCE(); A PColumn is append-only and contains a persistent dictionary array The first memory barrier guarantees that value 1 has effectively been (PDict.) that maps ValueIDs to Values, supported by a dictionary written to persistentInt and that the following flushing instruction index that maps Values to ValueIDs. Since not all data structures will write back the new value of persistentInt. The cache line need to be persistent, we carefully choose which ones to put on SCM flushing instruction invalidates the cache line where persistentInt and which ones we keep on DRAM. In general, column structures are is held and writes it back to its location in SCM. This implies that more bandwidth-bound while tree structures are more latency-bound. persistent variables must not be split between two cache lines. The Since the dictionary index is heavily accessed and is latency-bound, last memory barrier ensures that the cache line flushing instruction we keep it on DRAM for better performance. Hence, we need to is not reordered with the next instructions. This is required to order reconstruct it from the dictionary array at restart time. All the other persistent write operations. In the following sections, we use the data structures are persistent in SCM. suffix “Flush” to indicate a persistent write operation.

3. Figure 2: Example of a SOFORT PTable. The office of Ingo is updated from B1.01 to B1.03. Figure 3: SOFORT Persistent Memory Pointer PMPtr. Memory controllers on modern hardware have write buffers that Figure 4: SOFORT Recovery Diagram. can hold a write even after it has been flushed. To overcome this issue, a mechanism needs to be put in place so that the memory to re-convert PMPtrs to new valid work pointers. The PMAllocator controller drains its buffer on power failures. is also persistent. It maintains a persistent memory page counter to know how many pages it has created. It also stores meta-data 3.2 Persistent Memory Management at the beginning of every allocated segment. At restart time, the Persistent memory is managed using a file system. User space access PMAllocator re-opens its persistent memory pages and reconstructs to persistent memory is granted via memory mapping using mmap. the mapping of its persistent memory to virtual memory. The mapping behaves like mmap for traditional files, except that To perform recovery, we need to keep track of a persistent memory the persistent data is directly mapped to the virtual address space, entry point. One entry point is sufficient for the whole storage engine instead of a DRAM-cached copy. since every structure is encapsulated into a larger structure up to the When a program crashes, its pointers become invalid since the full engine. A persistent entry point can simply be two PMPtrs: the program gets a new address space when it restarts. This implies that first points to SOFORT’s persistent structures and the second to the these pointers cannot be used to recover persistent data structures. To PMAllocator persistent object. The entry point is kept in SCM on a solve this problem, we propose a new pointer type, denoted Persis- small persistent memory page with a fixed ID. tent Memory Pointer (PMPtr). As illustrated in Figure 3, it consists 3.3 Recovery Mechanism of a base, which is a persistent memory page ID, and an offset that Figure 4 shows how recovery at restart is performed from a single relates to the start of the allocated block. To manage persistent mem- persistent entry point. Once the PMAllocator has recovered its ob- ory, we propose a persistent memory allocator (PMAllocator) that jects, SOFORT recovers its tables in parallel. Every data structure has provides a mapping of persistent memory to virtual memory which a reload function which checks the sanity of its state, restores work enables the conversion of PMPtrs to regular pointers. We denote pointers from PMPtrs, and recovers from problematic situations, “work pointers” regular pointers that are conversions of PMPtrs. In such as crashing in the middle of a table resize. In every transaction, this context, [27] have proposed “pointer swizzling at page fault updating MVCC information is the last operation to perform. Since time” where disk pointers are converted to virtual memory pointers our data structures are append-only, we do not need to do anything at page fault time. While our pointer conversion principle is the since MVCC entries attest the validity and visibility of every row. same as in [27], we convert pointers only at restart time and not for The only time consuming operation left is rebuilding the dic- every page fault. tionary indexes. We made the choice to keep them volatile for Another solution would be to recover persistent memory in the performance reasons, but we could make them persistent on SCM same virtual address space, in which case regular pointers would using a persistent map structure, such as the CDDS-Btree [23]. remain valid. However, the operating system does not guarantee Therefore, recovery time depends on the size of the dictionary in- that the previous virtual address segments will remain free. For dexes. As demonstrated in Section 5, although we rebuild dictionary example, using the option MAP_FIXED of mmap allows to recover indexes, SOFORT total recovery time takes only a few seconds. data in the same address space but will unmap any mapped files in the specified memory region, eventually leading to undesirable behaviors. Besides, using fixed addresses for data is a bad idea for 4. SOFORT ALGORITHMS security reasons. In this section, we detail some core operations of SOFORT in order The PMAllocator uses big persistent memory pages that are cut to demonstrate how consistency and durability are achieved. The into smaller segments for allocation. It maps these pages to virtual primary challenge is to make the operations failure atomic: regard- memory to enable the conversion of PMPtrs to work pointers. At less of crash condition, we can recover the database to a consistent restart time, a new mapping to virtual memory is created, allowing state. We exclude read operations since they do not change the

4.database state and thus, read operations are not a challenge for consistent recovery. Due to space constraints, we discuss only the Update operation. The Insert and the Delete operations are detailed in Appendix A. 4.1 Update Algorithm 1 Update(TableName,Key,NewRow) 1: // Get table pointer from table name Figure 5: System Setup Overview. 2: ptab = find(TableName) update operations need to be rolled back. 3: 4: // Get RowID from key value 5. EVALUATION 5: DeleteIdx = ptab–>getLatestVisible(Key) In this section, we give an overview of our system setup. Afterwards, 6: we evaluate SOFORT in two steps. First, we evaluate SOFORT’s 7: // Reserve a row and get back its index OLTP performance using the TATP benchmark. Second, we evaluate 8: InsertIdx = ptab–>ReserveRow() SOFORT’s restart time using a micro-benchmark. 9: 10: // Update the transaction object with InsertIdx and DeleteIdx 5.1 System Setup 11: UpdateTransFlush(DeleteIdx, InsertIdx) We assume a hybrid SCM-DRAM environment where both DRAM 12: and SCM can be accessed using load-store semantics. We envision 13: // Try lo lock the row in the future SCM maybe fully integrated into the memory subsys- 14: if !AtomicLock(MVCCArray[DeleteIdx]) then tem where SCM can be accessed using Load/Store semantics. In 15: Abort this paper, we attempted to emulate this scenario. We used a sys- 16: end if tem equipped with a dual socket Intel R Xeon R E5 processor @ 17: // Persistently insert new row 2.60GHz, with 20MB of L3 cache and 8 physical cores per socket. 18: ptab–>pushBackFlush(InsertIdx, NewRow) Figure 5 gives an overview of the system setup. During all our 19: tests, Intel HyperThreading Technology was disabled. In order to 20: // Commit by updating the two MVCC entries avoid NUMA effects in the performance measurements, we bind 21: CommitUpdate(DeleteIdx, InsertIdx) the process of the benchmark to run to a single socket. (NUMA 22: Flush MVCCArray[DeleteIdx] effects could be of the same order of magnitude as the higher SCM 23: Flush MVCCArray[InsertIdx] latencies.) The DRAM latency was measured with Intel Memory Latency Checker[24] as 90ns. Since different SCM media expose different latencies, we used multiple SCM latencies during our tests. The update operation is effectively a Delete operation followed SCM is managed by Persistent Memory File System (PMFS), a file by an Insert operation. Algorithm 1 describes the core steps of system optimized for byte-addressable non-volatile memory [10]. this operation. First, we get a pointer to the target table using a Memory mapped PMFS files are not buffered in DRAM. This ensures mapping from table names to table IDs (line 2). Then, we look up that applications are given direct access to SCM. The PMAllocator the indexes of the target table to get the latest visible row where the uses 1GB PMFS files, each corresponding to a logical memory page. key value (Key) occurs (line 5). Afterwards, we reserve a new row The hardware uses a custom BIOS to emulate the different higher in the table and get back the corresponding row index (line 8). If latencies of SCM. One limitation of this emulation is that the tuned needed, a resize of the table is triggered. The index of the new row latency is read-write symmetric, while SCM is expected to have is computed with an atomic increment of a counter maintained by asymmetric read-write latencies, with writes slower than reads. the table. No other operation will write neither to the row pointed by this index nor to the corresponding MVCC entry. Thus, the operation 5.2 Throughput is latch-free and thread-safe. Afterwards, we persistently update To measure the throughput of SOFORT, we have implemented the the transaction object with the index of the row to be inserted and TATP benchmark, a simple but realistic transactional benchmark that the index of the row to be deleted (line 11). Then, we execute an simulates a telecommunication application [2]. It is composed of atomic compare-and-swap operation to try to lock the row to be 80% read transactions and 20% write transactions. TATP measures deleted by setting its DTS to the transaction ID (line 14) and abort Maximum Qualified Throughput, which is the number of successful the transaction if the atomic operation fails (line 15). If it succeeds, transactions per second. We run two sets of experiments to measure we persistently append the row to to be inserted to the table (line throughput. The first one is read only and consists of running 18). The transaction commits by fetching a commit time-stamp and GetSubData, one of the TATP queries. The second one is running assigning it to the CTS of the row to be inserted and the DTS of the the full TATP benchmark. In both cases, we vary the number of row to be deleted (21). Finally, we persist the commit by flushing users from 1 to 16. We also vary the latency of SCM, from 200 ns to the updated MVCC entries (lines 22-23). 700 ns. To provide an upper bound of SOFORT’s performance, we The only challenging failure scenario is a crash between lines also run SOFORT on shared memory, i.e. on DRAM with a latency 22 and 23 in which case the commit might or might not have been of 90 ns. We run TATP with a scale factor of 100 which corresponds propagated to persistent memory. To address this scenario, during to an initial population of 1M subscribers. recovery, we rollback the update transactions that were active at the To provide a baseline for SOFORT’s throughput performance, time of failure. To do so, we visit every TRX array entry and do we also experiment with Shore-MT [13]. We use Shore-MT on the following for every update transaction: if the CTS of the row to ramdisk (a Temporary File System mount) (Shore-MT-ramdisk) to insert and the DTS of the row to delete are valid time-stamps, then get an upper bound of Shore-MT’s performance. We have also tuned do nothing. Otherwise, reset both the CTS of the row to insert and the configuration of Shore-MT baseline to get the highest possible the DTS of the row to delete to ∞. As shown in Appendix A, only throughput.

5. SOFORT-shm SOFORT-200ns SOFORT-300ns SOFORT-700ns Shore-MT-ramdisk 6 6 [x10 TXs/s] [x10 TXs/s] [% of throughput drop] 2 2 100 Max. Qualified Throughput Perf. diff. to SOFORT-shm 80 1.5 1.5 Throughput 60 1 1 40 0.5 0.5 20 0 0 0 0 4 8 12 16 0 4 8 12 16 0 4 8 12 16 #users #users #users (a) TATP GetSubData (b) TATP Mix (c) Higher latencies impact on TATP Mix Figure 6: TATP benchmark SF 100 throughput results. SOFORT outperforms Shore-MT and resists better to user contention. SOFORT stays competitive even in a high read-write latency (700 ns) SCM environment. SOFORT-SkipList SOFORT-HashMap Table 1: Breakdown of SOFORT’s Restart Time. DB Size=100M rows. Dict. Size=10M entries/column. time[s] time[s] 8 8 Configuration Dict. Indexes Rebuild. Rest of Recov. Recovery Time Recovery Time 6 6 SOFORT-HashMap 1848 ms 74ms 4 4 SOFORT-SkipList 7599ms 58ms 2 2 [x105 TXs/s] [x105 TXs/s] 0 0 20 40 60 80 100 2 4 6 8 10 6 6 Throughput Throughput DB Size(M rows) Dict Size(M entries/col) 4 4 (a) Dict. size fixed(10M entries/col) (b) DB size fixed(10M rows) 2 2 Figure 7: SOFORT Restart Time. Recovery time is dominated by 0 0 10 20 30 40 50 0 0 10 20 30 40 50 the rebuild time of dictionary indexes. time[s] time[s] Figure 6 illustrates throughput results for the experiments de- (a) SOFORT-PMFS-200ns (b) SOFORT-Disk scribed above. SOFORT on shared memory (SOFORT-shm) achieves up to 1.8M transaction per second for the read benchmark (Fig- Figure 8: SOFORT Recovery. TATP throughput with 4 users. The ure 6a), and up to 1.1M transaction per second for the full TATP database is crashed at second 15. benchmark (Figure 6b). We also notice that even in a high SCM latency environment (700 ns), SOFORT stays competitive and still skip list map (default configuration) whereas the other one uses outperforms Shore-MT on ramdisk. Figure 6c highlights the im- the Intel Threading Building Blocks hash map [1]. In Figure 7b, pact of higher SCM latencies on SOFORT’s TATP throughput relative we vary the database size while keeping dictionaries size fixed. to using shared memory. We observe that for a latency of 200 ns, In Figure 7a, we keep a fixed database size and vary dictionaries which is more than the double of shared memory latency (90 ns), size. The first observation is that SOFORT-HashMap is faster to the performance drop is only approximately 20%. This results from recover than SOFORT-SkipList. We also observe that recovery time the hybrid design of SOFORT, where structures can be either volatile increases linearly with the dictionaries size. This shows the price to (hence, faster access) or non-volatile. Additionally, SOFORT resists pay at recovery time for keeping a volatile structure for dictionary user contention regardless of SCM latency. indexes, although it enables better throughput. Besides, we notice The throughput of Shore-MT on ramdisk drops to almost zero that the restart time does not depend on the database size for a given with more than 8 users due to resource contention. However, with dictionaries size. Dora, Shore-MT resists to a certain extent user contention [19]. Table 1 represents the breakdown of the restart time into two parts: the time spent in rebuilding the dictionary indexes and the 5.3 Recovery time spent doing the rest of the recovery process. We observe that We define recovery time as the time it takes the database to recover rebuilding dictionary indexes largely dominates total restart time. and answer a first simple select query. To measure this time, we To illustrate recovery time, we run TATP with 4 users on SOFORT have implemented a micro-benchmark. The database consists of a and crash the database at second 15, as shown in Figure 8. To single table of 4 integer columns. We change the database size by provide a baseline, we have implemented SOFORT on disk using varying the number of rows, and the dictionaries size by varying the memory mapping. This means that data is kept on DRAM and is number of dictionary entries, i.e. the number of distinct values per backed by files on disk. Since SOFORT on disk is not crash safe, column. In the following experiments, SCM latency is set to 200 ns. we call sync to flush all the dirty memory mapped pages to disk We compare two configurations of SOFORT. They differ in the before crashing the system. Figure 8a shows SOFORT’s throughput data structure used for the dictionary index. One uses a lock-free on PMFS before, during, and after recovery. SOFORT recovers in

6.approximately 1 second and delivers right away full throughput [2] Telecommunication Application Transaction Processing performance since it does not need to warm up. Figure 8b shows the (TATP) Benchmark. throughput of SOFORT on disk, where recovery takes approximately http://tatpbenchmark.sourceforge.net/. 18 seconds, even though SOFORT on disk does not have a real [3] Intel R Architecture Instruction Set Extensions Programming persistence. In conclusion, SOFORT on SCM offers a good trade-off Reference. Technical report, 2014. http: as it achieves high OLTP throughput performance and recovers from //software.intel.com/en-us/intel-isa-extensions. system failures in seconds. [4] K. A. Bailey, P. Hornyack, L. Ceze, S. D. Gribble, and H. M. Levy. Exploring storage class memory with key value stores. 6. RELATED WORK In INFLOW 2013. To our knowledge, we are the first to propose a SCM-based trans- [5] S. Chen, P. B. Gibbons, and S. Nath. Rethinking database actional column-store. Bailey et al. [4] presented Echo, a persis- algorithms for phase change memory. In CIDR 2011. tent key-value storage system with snapshot isolation. Contrary [6] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. to SOFORT which assumes that DRAM and SCM are of the same Gupta, R. Jhala, and S. Swanson. Nv-heaps: Making memory hierarchy level, Echo adopts a two-level memory design persistent objects fast and safe with next-generation, with a thin layer of DRAM on top of SCM. Other works use SCM to non-volatile memories. SIGPLAN Not., 47(4), Mar. 2011. optimize OLTP durability management [20, 11]. Their focus is on [7] R. Colledge. SQL Server 2008 administration in action. disk-based, row-store databases while our focus is on main-memory Manning, Greenwich, CT, 2010. column-stores. Additionally, Chen et al. [5] discuss how B+-trees [8] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, and hash joins can benefit from SCM. Venkataraman et al. [23] D. Burger, and D. Coetzee. Better i/o through contributed Consistent and Durable Data Structures that leverage byte-addressable, persistent memory. In SOSP 2009. the non-volatility of SCM using versioning. [9] X. Dong, X. Wu, G. Sun, Y. Xie, H. Li, and Y. Chen. Circuit To manage SCM, Condit et al. [8] presented BPFS, a carefully and microarchitecture evaluation of 3d stacking magnetic ram designed, high performance transactional file system that runs on (mram) as a universal memory replacement. In DAC 2008. top of SCM. Nayaranan et al. [18] proposed Whole System Persis- [10] S. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz, tence (WSP), where data is flushed only on power failures using R. Sankaran, J. Jackson, and D. Subbareddy. System software the residual energy of the system. However, they do not consider for persistent memory. In EuroSys 2014. software failures. Recent works have also looked at how to architect [11] R. Fang, H.-I. Hsiao, B. He, C. Mohan, and Y. Wang. High SCM. Qureshi et al. [21] discussed how to architect SCM-based high performance database logging using storage class memory. In performance main-memory systems and showed that a buffer of ICDE 2011. 1GB of DRAM can hide the higher latency of 32GB of SCM. More- [12] M. Hosomi, H. Yamagishi, T. Yamamoto, K. Bessho, Y. Higo, over, Lee et al. [17] discussed how to use Phase Change Memory K. Yamane, H. Yamada, M. Shoji, H. Hachino, C. Fukumoto, as a scalable DRAM alternative. Zhao et al. [28] proposed Kiln, H. Nagao, and H. Kano. A novel nonvolatile memory with a persistent memory design that employs a non-volatile last level spin torque transfer magnetization switching: spin-ram. In cache and SCM to offer persistent in-place updates without logging IEDM 2005. or copy-on-write. Finally, other works have proposed interfaces to [13] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and ease the use of SCM as persistent memory for programmers [6, 25]. B. Falsafi. Shore-mt: A scalable storage manager for the multicore era. In EDBT 2009. 7. CONCLUSION AND OUTLOOK [14] H. T. Kung and J. T. Robinson. On optimistic methods for In this paper, we investigated how to design a columnar transactional concurrency control. ACM Trans. Database Syst., 6(3), 1981. storage engine that leverages full capabilities of SCM. We proposed [15] P.-A. Larson, S. Blanas, C. Diaconu, C. Freedman, J. M. Patel, SOFORT, a log-less, single-level columnar data store with high OLTP and M. Zwilling. High-performance concurrency control throughput performance and fast data recovery. SOFORT relies on mechanisms for main-memory databases. Proc. VLDB persistent, self-managed data structures and a novel programming Endow., 5(4), Dec. 2011. model for persistent memory to achieve its goals. Besides, SOFORT [16] B. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, exhibits competitive performance, even in high-latency SCM envi- O. Mutlu, and D. Burger. Phase-change technology and the ronments. future of main memory. Micro, IEEE, 30(1), Jan 2010. We believe that our approach can also benefit main-memory row- [17] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting stores. For future work, we plan to extend the concurrency control phase change memory as a scalable dram alternative. mechanism of SOFORT to support arbitrarily long transactions, de- SIGARCH Comput. Archit. News, 37(3), June 2009. sign persistent index structures and experiment with write-through [18] D. Narayanan and O. Hodson. Whole-system persistence. In caching policy. ASPLOS XVII, 2012. [19] I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki. Acknowledgement Data-oriented transaction execution. Proc. VLDB Endow., Special thanks go to Thomas Kissinger, Anisoara Nica, Norman 3(1-2), Sept. 2010. May and the SAP HANA PhD students for their helpful suggestions [20] S. Pelley, T. F. Wenisch, B. T. Gold, and B. Bridge. Storage and discussions. We are also grateful to Porobic Danica for her help management in the nvram era. PVLDB, 7(2), 2013. with Shore-MT. [21] M. K. Qureshi, V. Srinivasan, and J. A. Rivers. Scalable high performance main memory system using phase-change 8. REFERENCES memory technology. In ISCA 2009. [22] B. M. Sang-Won Lee. Accelerating in-page logging with [1] Intel R Threading Building Blocks. non-volatile memory. IEEE Data Eng. Bull., 33, 2010. https://www.threadingbuildingblocks.org/.

7.[23] S. Venkataraman, N. Tolia, P. Ranganathan, and R. H. Algorithm 2 InsertRow(TableName,NewRow) Campbell. Consistent and durable data structures for 1: // Get table pointer from table name non-volatile byte-addressable memory. In FAST 2011. 2: ptab = find(TableName) [24] V. Viswanathan, K. Kumar, and T. Willhalm. Intel memory 3: latency checker. Technical report. 4: // Reserve a row and get back its index http://www.intel.com/software/mlc. 5: InsertIdx = ptab–>ReserveRow() [25] H. Volos, A. J. Tack, and M. M. Swift. Mnemosyne: 6: Lightweight persistent memory. SIGPLAN Not., 47(4), 2011. 7: // Persistently insert row and get back [26] R. Williams. How we found the missing memristor. Spectrum, 8: ptab–>pushBackFlush(InsertIdx,NewRow) IEEE, Dec 2008. 9: [27] P. R. Wilson. Pointer swizzling at page fault time: Efficiently 10: // Commit by updating the CTS supporting huge address spaces on standard hardware. 11: CommitInsert(InsertIdx) SIGARCH Comput. Archit. News, 19(4):6–13, July 1991. 12: Flush MVCCArray[InsertIdx] [28] J. Zhao, S. Li, D. H. Yoon, Y. Xie, and N. P. Jouppi. Kiln: Closing the performance gap between systems with and without persistence support. In MICRO-46, 2013. A.2 Delete The delete operation is described in Algorithm 3. First, we get a APPENDIX pointer to the target table (line 2). By a column index lookup, we get the latest visible row where the key value (Key) occurs (line 5). The A. SOFORT ALGORITHMS DTS of the corresponding MVCC entry indicates whether this row is This appendix is the continuation of Section 4 in which we de- being modified by another write transaction. We execute an atomic scribe some of SOFORT’s core algorithms. compare-and-swap operation to try to lock the row and commit the delete at the same time by setting its DTS to the transaction’s A.1 Insert commit time-stamp (line 8). If the atomic operation fails, we abort Algorithm 2 illustrates the core steps of an insert operation. First, the transaction (line 9). Otherwise, the row has been successfully we get a pointer to the table using a mapping from table names to locked and the delete operation committed. The commit is finalized table IDs (line 2). Then, we reserve a new row and get back the by persisting the updated MVCC entry (line 12). corresponding row index (line 5). No other writers will write neither If a crash occurs before locking and committing a row, the trans- to the row pointed by this index nor to the corresponding MVCC action is considered as aborted and there is nothing to do at recovery entry. Thus, the operation is latch-free and thread-safe. If needed, a time. resize of the table is triggered. Afterwards, the row is persistently appended to the table (line 8). The last step is to atomically and Algorithm 3 Delete(TableName,Key) persistently commit the transaction by updating the CTS of the new 1: // Get table pointer from table name MVCC entry (lines 11-12). 2: ptab = find(TableName) If a crash happens at any step before the commit has reached 3: SCM, the transaction is considered as aborted and the row will be 4: // Get RowID from key value recovered in an invisible state (see Section 2), i.e., without any 5: DeleteIdx = ptab–>getLatestVisible(Key) impact on the state of the database. Therefore, there is nothing to 6: do at restart time. 7: // Try lo lock and commit the row 8: if !AtomicTryLockCommit(MVCCArray[DeleteIdx]) then 9: Abort 10: end if 11: // Flush updated MVCC 12: Flush MVCCArray[DeleteIdx]