HYRISE: a main memory hybrid storage engine

In this paper, we describe a main memory hybrid database system called HYRISE, which automatically partitions tables into vertical partitions of varying widths depending on how the columns of the table are accessed. For columns accessed as a part of analytical queries (e.g., via sequential scans), narrow partitions perform better, because, when scanning a single column, cache locality is improved if the values of that column are stored contiguously.

1. HYRISE—A Main Memory Hybrid Storage Engine Martin Grund Jens Kruger ¨ Hasso Plattner Hasso-Plattner-Institute Hasso-Plattner-Institute Hasso-Plattner-Institute Alexander Zeier Philippe Cudre-Mauroux Samuel Madden Hasso-Plattner-Institute MIT CSAIL MIT CSAIL times), the explicit separation between transaction processing and analytics systems introduces a fundamental bottleneck in analytics ABSTRACT response times. For some applications, directly answering analyt- ics queries from the transactional system is preferable. For exam- In this paper, we describe a main memory hybrid database system ple “available-to-promise” (ATP) applications process OLTP-style called HYRISE, which automatically partitions tables into vertical queries while aggregating stock levels in real-time using OLAP-style partitions of varying widths depending on how the columns of the queries to determine if an order can be fulfilled. table are accessed. For columns accessed as a part of analytical Unfortunately, existing databases are not optimized for such queries (e.g., via sequential scans), narrow partitions perform better, mixed query workloads because their storage structures are usu- because, when scanning a single column, cache locality is improved ally optimized for one workload or the other. To address such if the values of that column are stored contiguously. In contrast, for workloads, we have built a main memory hybrid database system, columns accessed as a part of OLTP-style queries, wider partitions called HYRISE, which partitions tables into vertical partitions of perform better, because such transactions frequently insert, delete, varying widths depending on how the columns of the tables are ac- update, or access many of the fields of a row, and co-locating those cessed (e.g., transactionally or analytically). fields leads to better cache locality. Using a highly accurate model of We focus on main memory systems because, like other re- cache misses, HYRISE is able to predict the performance of different searchers [22, 7], we believe that many future databases – partic- partitionings, and to automatically select the best partitioning using ularly those that involve enterprise entities like customers, outstand- an automated database design algorithm. We show that, on a realistic ing orders, products, stock levels, and employees – will fit into the workload derived from customer applications, HYRISE can achieve memory of a small number of machines. Commercially available a 20% to 400% performance improvement over pure all-column or systems already offer up to 1 TB of main memory (e.g., the Fujitsu all-row designs, and that it is both more scalable and produces bet- RX600 S5). ter designs than previous vertical partitioning approaches for main Main memory systems present a unique set of challenges and op- memory systems. portunities. Due to the architecture of modern CPUs and their com- plex cache hierarchy, comparing the performance of different main 1. INTRODUCTION memory layouts can be challenging. In this paper, we carefully pro- Traditionally, the database market divides into transaction pro- file the cache performance of a modern multi-core machine and de- cessing (OLTP) and analytical processing (OLAP) workloads. OLTP velop a cost model that allows us to predict the layout-dependent per- workloads are characterized by a mix of reads and writes to a few formance of a mixed OLTP/OLAP query workload on a fine-grained rows at a time, typically through a B+Tree or other index structures. hybrid row/column database. Conversely, OLAP applications are characterized by bulk updates Our model captures the idea that it is preferable to use narrow par- and large sequential scans spanning few columns but many rows of titions for columns that are accessed as a part of analytical queries, the database, for example to compute aggregate values. Typically, as is done in pure columnar systems [5, 6]. In addition, HYRISE those two workloads are supported by two different types of database stores columns that are accessed in OLTP-style queries in wider systems – transaction processing systems and warehousing systems. partitions, to reduce cache misses when performing single row re- This simple categorization of workloads, however, does not en- trievals. Though others have noted the importance of cache locality tirely reflect modern enterprise computing. First, there is an in- in main memory systems [6, 3, 12, 25], we believe we are the first to creasing need for “real-time analytics” – that is, up-to-the-minute build a dedicated hybrid database system based on a detailed model reporting on business processes that have traditionally been handled of cache performance in mixed OLAP/OLTP settings. Our work is by warehousing systems. Although warehouse vendors are doing as closest in spirit to Data Morphing [12], which also proposes a hybrid much as possible to improve response times (e.g., by reducing load storage model, but we have extended their approach with a more accurate model of cache and prefetching performance for modern Permission to make digital or hard copies of all or part of this work for processors that yields up to 60% fewer cache misses compared to personal or classroom use is granted without fee provided that copies are the layouts proposed by Data Morphing. Furthermore, the layout not made or distributed for profit or commercial advantage and that copies algorithms described in the Data Morphing paper are exponential bear this notice and the full citation on the first page. To copy otherwise, to (2n ) in the number of attributes in the input relations, and as such republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present do not scale to large relations. Our algorithms scale to relations with their results at The 37th International Conference on Very Large Data Bases, hundreds of columns, which occur frequently in real workloads. August 29th - September 3rd 2011, Seattle, Washington. We note that several analytics database vendors have announced Proceedings of the VLDB Endowment, Vol. 4, No. 2 support for hybrid storage layouts to optimize performance of par- Copyright 2010 VLDB Endowment 2150-8097/10/11... $ 10.00. 105

2.ticular workloads. For example, Vertica introduced FlexStore, which of large contiguous blocks of memory. Data types are dictionary- allows columns that are accessed together to be physically stored to- compressed into fixed-length fields to allow direct access (offsetting) gether on disk. VectorWise, Oracle, and GreenPlum have made sim- to any given position (exploring further compression schemes is an ilar announcements. None of these vendors have released detailed area of future work.) Position offsets typically come from another information about how their hybrid schemes work, and they do not container or from a value-index lookup. appear to have database designers such as ours that can automate Figure 2 shows an example of a relation r with eight attributes hybrid partitioning, but these products acknowledge the importance partitioned into three containers. In this example, the first container of hybrid designs such as those we explore in this paper. contains one attribute only. The second and third containers contain In summary, we make several contributions in this paper: five and two attributes respectively. 1. We develop a detailed cache performance model for layout- r = (a1 ... a8) dependent costs in hybrid main memory databases. C1 (a1) C2 (a2 .. a6) C2 (a7 .. a8) 2. We develop an automated database design tool that, given a schema, a query workload, and using our analytical model, Figure 2: Partitioning example recommends an optimal hybrid partitioning. 3. We show that our system, running on a customer-derived Since our findings in enterprise software show that historic data benchmark (which we describe in detail), is 20% to 400% must be kept for legal reasons [18] our system currently focuses on faster than either a pure-row or a pure-column store running selections and insertions. In order to keep track of all data changes, on the same data. We also show that our designs are better we handle updates and deletions using validity timestamps as de- than previous hybrid storage schemes. scribed in [23]. Before describing the details of our model and design algorithms, we provide a brief overview of the architecture of HYRISE. 2.2 Query Processor The HYRISE query processor creates a query plan, consisting of a tree of operators, for every query it receives. HYRISE currently im- 2. HYRISE ARCHITECTURE plements projection, selection, join, sorting, and group by operators. The following section describes the architecture of HYRISE. The For joins HYRISE includes hash and nested loops join algorithms. main architectural components are shown in Figure 1. The storage Most of our operators support both early and late materialization, manager is responsible for creating and maintaining the hybrid con- meaning that HYRISE provides both position or value-based opera- tainers storing the data. The query processor receives user queries, tors [1]. In late materialization, filters are evaluated by determining creates a physical query plan for each query, and executes the query the row indexes (“positions”) that satisfy predicates, and then those plan by calling the storage manager. The layout manager analyzes positions are looked up in the columns in the SELECT list to deter- a given query workload and suggests the best possible layout (parti- mine values that satisfy the query (as opposed to early materializa- tioning) for this workload to the storage manager. tion, which collects value lists as predicates are evaluated.) Non-join queries are executed as follows: index-lookups and Layout Manager Attribute predicates are applied first in order to create position lists. Position Query Processor Groups R Workload lists are combined (e.g., ANDed) to create result lists. Finally, results Layouter Data ●●● are created by looking-up values from the containers using the result R Attribute lists and are merged together to create the output tuples. For join In-Memory Storage Groups Data Container plans, predicates are first applied on the dimension tables. Then, R Manager foreign-key hash-joins are used to build position lists from the fact Figure 1: HYRISE architecture tables. Additional predicates can then be applied on the fact tables to produce additional position lists. All position lists are combined We have built a prototype of this architecture. Our prototype ex- with the output of the joins, and the final list of positions is used to ecutes hand-coded queries based on the query processor API and create the final results. Query execution is currently single-threaded currently lacks support for transactions and recovery. We omit these and handles one operator at a time only; we are extending HYRISE features because we believe they are orthogonal to the question of to support efficient parallel execution for multi-core processors. which physical design will perform best for a given workload. How- ever, to minimize the impact of transactions in HYRISE, in addition to normal write operations, we use non-temporal writes, which make 3. HYBRID MODEL it possible to write directly back to main memory without loading the In this section, we derive a cost model for the most important op- written content into the CPU cache (see Appendix E.) Even though erations performed in HYRISE. This cost model will be used in Sec- our prototype currently executes one query at a time only, we use tion 4 to compare the performance of various hybrid layouts given a thread-safe data structures that include latch acquisition costs to sup- query workload. port later query parallelization. We distinguish between layout-dependent and layout-independent We give an overview of both the storage manager and the query costs. Layout-dependent operations access the data from its primary processor below. The approach used by the layout manager to select physical representation—the costs of these operators vary depend- good layouts is described in detail in Section 4. ing on the physical storage structures used. Layout-independent operations occur when accessing intermediate results that are cre- 2.1 Storage Manager ated as a result of query processing. The cost of such operators Our HYRISE prototype supports a fine-grained hybrid storage does not vary when the physical storage layout changes. The mag- model, which stores a single relation as a collection of disjoint ver- nitude of layout-independent costs depends on the materialization tical partitions of different widths. Each partition is represented by strategy, since early materialization will result in more intermediate a data structure we call container. Each attribute is mapped to one results. We focus on layout-dependent operations in the following and only one container. A container provides methods to access the since these are the only operations that benefit from changes of the various values it holds. Containers are physically stored as a list physical layout. 106

3. C.o a1 a2 a3 a4 a5 pens whenever the non-projected segments of the container (C.w − r0 π.w π.w for each container row) are strictly smaller than a cache line r1 r2 L.w and can never be skipped when retrieving the projected pieces. The r3 number of cache misses incurred by a full scan is: C.w C.w × C.n + C.o Missi (C, π) = (2) Figure 4: A projection projecting the first two attributes of a Li .w 5-attribute container Here, C.o denotes the offset (in bytes) between the beginning of the Most of the layout-dependent costs incurred in a main-memory container and the first preceding address that can be mapped to the system like HYRISE originate from CPU stalls—typically caused beginning of a cache line. In this case, the number of cache misses by cache misses when moving data from main memory; those CPU corresponds to the number of cache lines needed to read the entire stalls are known to account for a significant fraction of the total cost container (C.w × C.n), plus any additional accesses if the address of a query (see [4]). Our experiments in Section 5 show that cache of the beginning of the container is not aligned to the beginning of a misses are a good predictor of query runtime. cache line (i.e., C.o = 0). Our model is based on a detailed analysis of such costs, taking If the condition in equation 1 does not hold, parts of the container into account the cache misses for different cache levels (e.g., L1 and can be skipped when executing the projection. The number of cache L2 cache). Unlike previous cache-aware cost models (Manegold et misses incurred by such a partial projection depends on the align- al. [16] and Boncz et al. [6]), which focused on column-oriented ment of each row r with respect to the cache lines. We first determine designs only, we analyze sets of hybrid data containers that can store the offset r.o from the start of the container to the start of the r-th an arbitrary number of columns or rows. We choose this level of row of the container: detail so that our cost model can be reused in a cost-based optimizer. r.o = C.w × r. (3) To illustrate our model, we provide the cache misses and CPU The offset between the beginning of the projection of the r-th row cycles for different operations as measured in our system. All mea- and the beginning of the nearest previous cache line is: surements were executed using an Intel E5450 quad-core CPU with lineoffseti (r, π) = (C.o + r.o + π.o) mod Li .w. (4) 32KB per core L1 data and instruction cache (8-way associative, 64 To retrieve the projected attributes for the r-th row, the system has byte cache lines), a shared 6MB L2 cache (24-way associative, 64 to read π.w bytes, in addition to the lineoffseti (r, π) bytes implicitly byte cache lines), and 64 GB of PC2 5300 CL2 RAM. read by the cache because of the misalignment between the cache line and the projected segment. The number of cache lines required 3.1 Notation to read the r-th row is thus: We consider a database DB, consisting of a list of relations r ∈ lineoffseti (r, π) + π.w R. Each relation r is defined by a list of attributes (a1r , . . . , amr ) Missi (r, π) = (5) Li .w and contains a certain number of tuples r.n. Each relation is decom- posed into a set of containers C1r , . . . , Cnr . Each container stores a Finally, the total number of cache misses incurred by the partial pro- subset of the attributes of r: Cir = (akr , . . . , alr ) (in the remainder jection is: C.n−1 of this section, we omit the subscript r for readability). We say that Missi (C, π) = Missi (r, π). (6) each container stores a contiguous list of tuple fragments. We write r=0 Ci .w to denote the width of the container in bytes, and Ci .n for the Due to the high number of iterations to calculate the total cache number of rows in the container. misses, we would like to replace equation 6 with an exact calculation In the following, we evaluate the number of cache misses for the (compared to the average calculation in [16]). The key observation main operations supported by our system. We write Li .w to express is that the value of lineoffseti (r, π) follows a repeating pattern, de- the length (in bytes) of a cache line for cache level i, and Li .n to pending on the value of Li.w and C.w. In general, the number of indicate the number of cache lines available for cache level i. The distinct values of lineoffseti (r, π) is known as the additive order of total size of the cache for level i is thus Li .n × Li .w. Loading a C.w mod Li.w [14], and has v distinct values: cache line through the cache hierarchy causes the CPU to stall. We v = Li.w/gcd(C.w, Li.w) (7) write Li .cpu to express the number of CPU cycles spent to load a where gcd(C.w, Li.w) is the greatest common divisor of C.w and line from cache level i to cache level i − 1. Li.w. Hence, it is enough to evaluate equation 6 for the first v rows, and then multiply the result by C.n/v; that is: 3.2 Partial Projections C.n v We start by evaluating the number of cache misses that occur when Missi (π, C) = Missi (r, π). (8) v r=0 performing a projection π on a container C. Initially, we restrict the To illustrate this model, we compare the number of cache misses projections to a series of contiguous attributes in C, starting at an for different layouts. Figure 3(a) shows the results of an experiment offset π.o from the beginning of the container and retrieving π.w on two different layouts, one with 100 narrow one-attribute contain- bytes of attributes. A simple example is depicted in Figure 4 for a 5- ers, and the other one with only one wide 100-attribute container (all attribute container and a projection retrieving the first two attributes attributes are 4 byte long). Both layouts have the same total width. of the container (π.o = 0 and π.w = 8 bytes considering 4-byte The figure reports the total number of L2 cache misses for partial attributes). projections ranging from one to 100 attributes, as well as the num- Projections are executed by reading the relevant portions of the ber of misses predicted by our model (note that the lines completely containers. If the data is not already cached, the system reads it from overlap.) Figure 3(b) further shows that there is a relation between RAM and loads it into the cache hierarchy—assuming an inclusive the number of cache misses and the number of CPU cycles for these cache hierarchy (as in Intel Processors)—one cache line and level at operations. Cache misses are highly correlated with—and a good a time. Two cases can occur depending on the width of the container, predictor of—total CPU cycles in database access methods because the projection, and the cache line. In the first case, when the primary action of these operators is to retrieve values from mem- C.w − π.w < Li .w (1) ory, and cache misses tend to dominate these memory access costs the entire container must be read, resulting in a full scan. This hap- for memory-bound operations. 107

4. 8e+06 1.2e+09 100 x 1 attribute container 100 x 1 attribute container Number of L2 Cache Misses 7e+06 1 x 100 attributes container 1 x 100 attributes container Number of CPU Cycles 1 x 100 attributes container (model) 1e+09 6e+06 100 x 1 attribute container (model) 5e+06 8e+08 4e+06 6e+08 3e+06 4e+08 2e+06 1e+06 2e+08 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Number of Attributes in Projection Number of Attributes in Projection Figure 3: Modeled vs Measured L2 misses (a); CPU cycles (prefetcher off) (b); L2 Misses Row/Column Containers and Varying Selectivity (c) 3.3 Combining Partial Projections Independent selections: whenever C.w − π.w − 1 ≥ Li .w, the The previous section discussed the projection of contiguous at- gaps between the projected segments cause each row to be retrieved tributes from a container. However, query plans often need to project independently of the others. In that case, the cache misses incurred non-contiguous sets of attributes. Non-contiguous projections can by each row retrieval are independent of the other row retrievals. be rewritten as a set of partial projections {π1 , . . . , πk }, each of The total number of cache misses incurred by the selection is the which retrieves a list of contiguous attributes in C. Such projec- sum of all the misses incurred when independently projecting each tions define a set of gaps {γ1 , . . . , γl }, i.e., contiguous attributes row r ∈ S from the set of selected rows: C.n−1 groups that are not projected. For instance, a projection on the first Missi (C, π)sel = Missi (r, π) π.s. (12) and third attribute of a five-attribute container is equivalent to two r=0 partial projections—one on the first and one on the third attribute. This number of misses can be efficiently computed using the additive The projection defines two gaps, one on the second attribute, and a order approach described above. second one on the fourth and fifth attributes. Two cases can occur depending on the width γ.w of the gaps: Overlapping selections: when C.w − π.w − 1 < Li .w, retriev- ing the projection for a given row may retrieve parts of the projec- Full-scan: if ∀γi ∈ {γ1 , . . . , γl }, γi .w < Li .w, all gaps are strictly tion for a different row. This effect is particularly apparent for low- smaller than a cache line and cannot be skipped. Thus, the projection selectivity projections on narrow containers, for which several rows results in a full scan of the container. can fit on a single cache line. The average number of rows that can fit Independent projections: if ∃γi ∈ {γ1 , . . . , γl } | γi .w ≥ Li .w, in one cache line is equal to Li .w/C.w. For each cache line fetched there exist portions of the container that might potentially be skipped to retrieve a selected row, there are on average when executing the projection. The projection is then equivalent to Li .w a set of 1 + li=1 1γi .w≥Li .w partial projections defined by the gap totalCachedRows = 1 + π.s −1 (13) C.w boundaries (where 1 is an indicative function used to express the selected rows cached, assuming that the rows are selected indepen- gaps that are greater than the cache line). Writing Γi to express the dently of each other. The average number or misses is thus i-th largest gap whose width Γi .w ≥ Li .w, the equivalent partial π.s projections can be defined as Missi (C, π)sel =∼ Missi (C, π). (14) totalCachedRows πieq .o = Γi .o. (9) Figure 3(c) compares the measured and modeled number of cache and misses for selections on two layouts: one consisting of 16 one- πieq .w = Γi+1 .o − (Γi .o + Γi .w) (10) attribute containers and a second one consisting of one 16-attribute wide container. Both layouts have the same total width and both Similarly, we can merge the first and last projections by taking into have 2M tuples. For low selectivities, using wider containers results account the fact that the last bytes of a row are stored contiguously in fewer cache misses, since each (whether narrow or wide) con- with the first bytes of the following row. Hence, we merge the first tainer generates at least one cache miss per tuple fragment retrieved and last projections when the gap Γrow between them is smaller than for very selective projections. a cache line, i.e., when Γrow = (πfeqirst .o + C.w) − (πlast eq eq .o + πlast .w) < Li .w. (11) 3.5 Joins and Aggregates Our system currently uses late-materialization based hash joins The final projection π eq is in that case defined as follows: π eq .o = and hash-based GROUP BYs. These operations can both be mod- eq πlast .o and π eq .w = πfeqirst .w + πlast eq .w + Γrow . eled as partial projections and selections. For a join of tables R Using this framework, we can model the impact of complex and S, where R is the table that is to be hashed, R is filtered via a queries involving the projection of an arbitrary set of attributes from partial projection and a position lookup (using positions from early a container without erroneously counting misses twice. operators in the plan). The resulting hash table is then probed with a similarly filtered version of S. For GROUP BYs, the grouping 3.4 Selections column is also filtered via a partial projection and position lookup. In this subsection, we consider projections that only retrieve a spe- cific subset S of the rows in a container. We assume that we know the 3.6 Padded Containers and Reconstruction list of rows ri ∈ S that should be considered for the projection (e.g., Padding and Alignment: For partial projections it is possible to from a position lookup operator). The selectivity of the projection reduce the total number of cache misses by performing narrow row- π.s represents the fraction of rows returned by the projection. Our padding such that the beginning of each row coincides with the be- equations in this section must capture the fact that highly selective ginning of a cache line. For a padded container C, the padding projections touching a few isolated rows can generate more cache (empty space) ρ.w to insert at the end of each row to achieve such an misses per result than what full-scans would do. effect is C.w mod Li .w bytes wide for a given cache level. The Two cases can also occur here, depending on the relative sizes of expressions given above to compute the number of cache misses can the container, the projection, and the cache line: be used on padded containers by replacing the width of the container 108

5.C.w and the width of the row r.w by their corresponding expressions this section is devoted to the determination of good layouts that will taking into account padding, namely C.w + ρ.w and r.w + ρ.w re- minimize the query response time. Formally, given a database DB spectively. As an example: For a 90-attribute container, where each and a workload W , our goal is to determine the list of layouts λopt attribute is 4 bytes, an additional 24 bytes of padding is added per minimizing the workload cost: tuple. This increases access performance in 87% of all simple pro- λopt = argmin (CostDB (W )) . (18) jections with an average speedup of ≈ 7%. λ Depending on the associativity of the L1 cache and the imple- 4.2 Layout Selection mented replacement policy for cache lines, HYRISE applies a spe- We use the model defined in the previous section to automatically cial alignment policy to avoid early evictions due to cache set colli- determine good hybrid layouts given a database DB and a workload sions (for details, see Appendix A). W . Based on the model, we make two observations: First, projec- Tuple Reconstruction: If the final result is materialized into one tions retrieving π.w bytes out of a C.w-wide container often incur an result set, output tuples must be built from multiple containers. In the overhead. This overhead is caused by loading attributes into cache rare case that the width of an output tuple is wider than the available that are not used by the projection. This overhead is proportional to cache size at a given level (e.g., 32KB for L1, 6MB for L2), evictions C.w − π.w for full scans, and can vary for partial projections and will occur before the last attribute of the tuple is written, triggering selections depending on the exact alignment of the projection and additional cache misses. To avoid these misses, the output must be the cache lines. We call this overhead container overhead cost. written one container at a time instead of one tuple at a time. Second, when the output tuples can be reconstructed without any cache eviction (Section 3.6), the cost expression distributes over the 4. LOGICAL DATABASE DESIGN set of queries, in the sense that each cost can be decomposed and There are a very large number of possible hybrid physical designs computed separately for each partition and the corresponding subsets (combinations of non-overlapping containers containing all of the of the queries {qi , . . . , qj } accessing the partition: columns) for a particular table. For a table of n attributes, there exist CostDB ({q1 , . . . , qm }) = CostP ({qi , . . . , qj }) (19) a(n) possible hybrid designs, where P ∈λ a(n) = (2n − 1)a(n − 1) − (n − 1)(n − 2)a(n − 2) (15) Based on our model and the above observations our layout algo- where a(0) = a(1) = 1. This corresponds to the number of parti- rithm works in three phases called candidate generation, candidate tions of {1, . . . , n} into any number of ordered subsets. merging, and layout generation phases. An example is described in There are for instance 3,535,017,524,403 possible layouts for a detail in Appendix B. table of 15 attributes. Most previous hybrid database systems do Candidate Generation: The first phase of our layout algorithm de- not automatically suggest designs to the database administrator (see termines all primary partitions for all participating tables. A pri- Section 6) — the only automated hybrid designer we are aware of, mary partition is defined as the largest partition that does not incur the HillClimb Data Morphing algorithm [12], does not work for wide any container overhead cost. For each relation R, we start with the tables in practice since it scales exponentially (2n ) with the number complete set of attributes {a1 , . . . , am } in R. Each operation opj of attributes in both time and space. We propose two new algorithms implicitly splits this set of attributes into two subsets: the attributes that can efficiently determine the most appropriate physical design that are accessed by the operation, and those that are ignored. The for tables of many tens or hundreds of attributes given a database order in which we consider the operations does not matter in this and a query workload. Our first algorithm has a worst-case running context. By recursively splitting each set of attributes into subsets time that is exponential in the problem size, but incorporates several for each operation opj , we end up with a set of |P | primary par- pruning steps that allows it to scale to wide tables in practice. Our titions {P11 , . . . , P|P 1 | }, each containing a set of attributes that are second algorithm includes a partitioning step and can scale to larger always accessed together. The cost of accessing a primary partition problems, while introducing bounded sub-optimality. is independent of the order in which the attributes are laid out, since 4.1 Layouts all attributes are always queried together in a primary partition. We start by extending the model described in Section 3.1. We Candidate Merging: The second phase of the algorithm inspects consider a query workload W , consisting of a set of queries qi : W = permutations of primary partitions to generate additional candidate {q1 , . . . , qm } that are regularly posed against the database. Each partitions that may ultimately reduce the overall cost of the work- query has a weight w1 , . . . , wm that captures the relative frequency load. Our cost model shows us that merging two primary partitions of the query. Furthermore, each query has a cost, representing the Pi1 and Pj1 is advantageous for wide, random access to attributes time required by the database to answer the query. The time needed since corresponding tuple fragments are co-located inside the same to answer all queries of a given workload W is thus proportional to: partition; for projections, the merging process is usually detrimen- m CostDB (W ) ∼ wi CostDB (qi ) (16) tal due to the additional access overhead (which occurs unless both i=1 primary partitions are perfectly aligned to cache lines.) where CostDB (qi ) is the cost (i.e., total number of cache misses This tension between reduced cost of random accesses and penal- weighted by the correct value of Li .cpu for each cache level) asso- ties for large scans of a few columns allows us to prune many of the ciated with the operations performed as part of qi as described in the potential candidate partitions. To do this, we compute the cost of the preceding section. The containers implicitly split the relations into workload W , CostPin (W ) on every candidate partition Pin obtained sets of partitions P1 , . . . , Pn with: by merging n primary partitions (P11 , . . . , Pn1 ), for n varying from 2 ∀ai ∈ R ∃Pi | ai ∈ Pi ∧ ai ∈ / Pj ∀Pj = Pi . (17) to |P |. If this cost is equal to or greater than the sum of the individual We call the sets of partitions following the preceding condition lay- costs of the partitions (due to the container overhead), then this can- outs λ ∈ Λ. Note that layouts consist of unordered sets of parti- didate partition can be discarded: In that case, the candidate partition tions, such that the layouts λ1 = {(a1 , a2 ), (a3 , a4 )} and λ2 = can systematically be replaced by an equivalent or more optimal set {(a3 , a4 ), (a1 , a2 )} are considered identical. of partitions consisting of the n primary partitions P11 , . . . , Pn1 since n We write λ = (λR1 , . . . , λRr ) to express the list of layouts by CostPin ({qi , . . . , qj }) ≥ CostP 1 ({qi , . . . , qj }) (20) m which relations R1 , . . . , Rr are stored in the system. The rest of m=1 109

6.and since the other terms of the total layout cost (Equation 19) are not for instance). The only times it generates suboptimal layouts is affected by this substitution. If a candidate partition is not discarded when a complex combination of partitions belonging to different by this pruning step, it is added to the current set of partitions and sub-layouts yields a smaller cost than the one found using our greedy will be used to generate valid layouts in the following phase. algorithm. The penalty incurred is in any case never greater than 2 ∗ cut M (i, j) (where cut M (i, j) represents the set of edges Layout Generation: The third and last part of our algorithm gen- removed during the graph-partitioning phase), which is the maximal erates the set of all valid layouts by exhaustively exploring all pos- penalty incurred by the partitioning phase. It is also very efficient sible combinations of the partitions returned by the second phase. for relatively small values of K (see Appendix F for details). The algorithm evaluates the cost of each valid layout consisting of a covering but non-overlapping set of partitions, discarding all but the physical layout yielding the lowest cost. This last layout is the 5. PERFORMANCE EVALUATION optimal layout according to our cost model, since all potentially in- In this section we evaluate the performance of HYRISE on a work- teresting permutations of attributes are examined by our algorithm load derived from a customer application. Our goal is to compare (only irrelevant permutations, such as subsets of primary partitions the performance of an all-row or all-column database design against or suboptimal merges from Section 4.2, are discarded). our hybrid approach, and to validate the cost model and database The worst-case space complexity of our layout generation algo- designer described above. To increase the execution performance rithm is exponential with the number of candidates partitions |P |. we performed several additional optimizations, including memory However, it performs very well in practice since very wide rela- alignment, narrow-row padding and cache-set collision avoidance, tions typically consist of a small number of sets of attributes that as described in Appendix A. are frequently accessed together (thus, creating a small number of To evaluate our model we choose a set of queries derived from primary partitions) and since operations across those partitions are an SAP enterprise resource planning (ERP) application that includes often relatively infrequent (thus, drastically limiting the number of several analytical queries that model reporting over the recent history new partitions generated by the second phase above). of these transactions. To show the robustness of our approach we ex- ecute one analytical query in two different versions (Q11 and Q12), 4.3 Divide and Conquer Partitioning with a layout-dependent selectivity of 2% (Q11) and 50% (Q12). For large relations and complex workloads involving hundreds of We built our own application-derived workload because real en- different frequently-posed queries, the running time of the above al- terprise applications (such as those we have encountered at SAP) ex- gorithm may still be high. In this section, we propose an approxi- hibit significant differences in terms of number of attributes per table mate algorithm that clusters the primary partitions that are often co- from benchmarks like TPC-C, TPC-E, and TPC-H. For example, in accessed together, generating optimal sub-layouts for each cluster of TPC-E (the most complex of these three benchmarks) the maximum primary partitions, and finally combining the optimal sub-layouts. number of attributes per relation is about 25; in SAP’s enterprise We start by generating a |P |×|P | matrix M , storing in each entry applications it is not uncommon to see tables with 200 attributes or M (i, j) the number of times the two primary partitions {Pi1 , Pj1 } more. A second reason for creating our own benchmark is that we are accessed together. Computing this number is done using our cost wanted to execute both OLTP-style and analytical-style queries on model to estimate how many rows are accessed by each query in the same data, and is not easy to retrofit an existing benchmark like each primary partition. This matrix is symmetric and can be seen as TPC-C or TPC-H to support both analytical and transactional queries a graph where each vertex is a primary partition, and the weight be- without substantailly changing benchmark. A detailed description of tween two vertices i and j (M (i, j)) represents the co-access affinity the benchmark and its queries are given in Appendix C. between the two primary partitions Pi1 , Pj1 . We partition this graph in order to obtain a series of min-cut sub- 5.1 Schema and Query Selection graphs each containing at most K primary partitions, where K is a The schema for our workload is based on a CRM application. The constant. These cuts seek to minimize the total cost (weight) of all main tables represents sales orders (VBAK) and sales order line items edges that must be removed. This is a very well studied problem (VBAP). The schema also contains tables for materials (MARA), ma- in the theory community, and there exist practical implementations terial texts (MAKT), business partners (KNA1), business partner ad- of these algorithms; in our system, we use metis, an efficient and dresses (ADRC), and the material hierarchy table (MATH). approximate multilevel k-way partitioner [15]. When running queries, we populated tables with sizes obtained At this point, each subgraph contains a set of primary partitions from interviews with consultants familiar with the application. that are accessed together, and which thus represent excellent candi- Specifically, the sales order header table (VBAK) contains 3.6M dates for our merging phase (Section 4.2). We determine the optimal entries and the sales order line item table (VBAP) 144M items. layout of each subgraph separately using the algorithm described Each sales order has between 3 and 5 items. The sizes of the addi- above (Section 4.1), which is in the worst-case exponential with the tional tables are 600, 000 materials (MARA), 600, 000 material texts maximum number of primary partitions in a subgraph (K). (MAKT), 180, 000 addresses, 144, 000 business partners (ADRC) Finally, we combine the sub-layouts obtained in the previous step: and 1M elements in the material hierarchy. The selectivity of the we incrementally combine pairs of partitions Pi and Pj belonging to queries is matched to the results from the actual application deploy- two different sub-layouts and yielding the most savings according to ment. The total system size in memory is about 28 GB. our cost model, until no further cost-reduction is possible. This final As an example, we show the result that our layout generator pro- step requires O(|P | ∗ |P2 | ) partition evaluations in the worst-case duced for the VBAP sales order line-item table. In the following (here |P | is the total number of partitions in all sub-layouts), but is notation 4_block represents an attribute which is 4 blocks wide, much faster in practice since the most similar primary partitions are normalized to the width of 4 bytes per block. already clustered together in the same subgraph, and since narrow 1. ((’VBELN’) 2. (’MATNR’) partitions yielding many different combinations are trivial to evalu- 3. (’KWMENG’,’AEDAT’) ate as they contain few attributes. 4. (’94 block’) This approximate algorithm is very effective in practice (it al- 5. (’1 block’, ’1 block’, ’4 block’, ways finds the optimal layouts for our workload and for K > 3, ’70 block’, ’39 block’,’NETWR’)) 110

7. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Total R 56770 24030 27050 15250 96780 90 13890.3 52301.5 5431.5 32297.6 29687.8 117471.1 4899.2 475949 C 9050 3510 11220 11930 30940 260 2154.8 9416.0 795.5 6032.4 6744.1 45468.6 2939.8 140461.2 H 9290 2910 4010 12810 11660 100 1795.3 7114.8 723.2 6243.5 6852.6 45751.1 2517.7 111778.2 Figure 5: Benchmark Results; graphs at the top show normalized (slowest system=1.0) CPU cycles (left) and normalized L2 cache misses (right); table shows absolute CPU cycles / 100k Each item represents one vertical partition in the table. While of a few queries to improve overall workload performance. there is no particular order for the partitions, all attributes inside the The second observation is that cache misses are a good predictor partition are stored in order. We chose this table as an example since of performance. In general, the differences in cache misses tend to be it is accessed by most of the queries and is partitioned to optimize for more pronounced than the differences in CPU times, but in all cases, the competing needs of several queries. Since the sales order num- the best performing query is the one with the fewest cache misses. ber (VBELN) and the related material (MATNR) columns are often Third, the model is a good predictor of the actual cache misses. scanned in their entirety (e.g., by Q6 and Q8) the layouter chooses Though there are absolute differences between the normalized and to store them as single columns. The amount KWENG and the deliv- predicted cache misses, the relative orderings of the schemes are al- ery data AEDAT columns are always accessed together (Q11,Q12) ways the same. In general, the differences are caused by very hard to and thus are stored as a group of two attributes. The rest of the at- model differences, such as the gcc optimizer (which we ran at -O3), tributes are merged by the layout algorithm to achieve best possible which can affect the number of cache misses. performance on SELECT* queries. In summary, HYRISE uses 4x less cycles than the all-row layout. HYRISE is about 1.6x faster than the all-column layout on OLTP 5.2 Performance queries (1–9), with essentially identical performance on the analyt- For each of the all-rows, all-columns, and optimal HYRISE de- ical queries (10–13). For some OLTP queries it can be up to 2.5x signs, we used our cost model to estimate the total cost and also faster than the all-column layout. Of course, in a hybrid database executed them in the HYRISE query executor. For all queries we system, the actual speedup depends on the mix of these queries – in captured the complete query execution time both in CPU cycles and practice that many OLTP queries will be run for every OLAP query, last level cache misses (in our case L2 cache misses). For this bench- suggesting that our hybrid designs are highly preferable. mark we choose late materializing plan operators (e.g., joins and aggregates that compute final results by going back to underlying 5.3 Data Morphing Layouts physical representation) so that the performance of all plan oper- In this section, we describe the differences between the behavior ators is directly affected by the physical representation. We tried and performance of our layout algorithm and the Hill-Climb Data other (early materialization) plans and found them to be slower for Morphing algorithm proposed by Hankins and Patel [12] (the paper these queries. Of course, in some settings early materialization- proposes two algorithms; Hill-Climb is the optimized version.) We based operators may perform better than late materialization, but in could not run Hill-Climb on our entire benchmark because (as noted these cases the performance of the operators will be unaffected by in the Introduction) the algorithm scales exponentially in both time our choice of storage layouts. and space with the number of attributes (see Appendix D), and thus The results for all of the queries are shown in Figure 5. The ta- can only be used on relatively simple databases. ble shows the absolute number of CPU cycles for each of the three Instead, we ran a simplified version of our benchmark, focusing on designs. The graphs compare the normalized performance of each the smallest relation (MATH) — the only one Hill-Climb could han- system in terms of the number of CPU cycles (left), actual number of dle — and query 13 which runs over it. Here, Data Morphing sug- L2 cache misses (middle), and number of L2 cache misses predicted gests a complete vertical partitioning, which performs 60% worse in by our model (right). Here “normalized” means that for a given terms of cache misses and 16% worse in terms of CPU cycles com- query, the worst-performing system (in terms of CPU cycles or cache pared to the layout used by HYRISE. The reason for this difference misses) was assigned a score of 1.0, and the other systems were as- is mainly due to the lack of partial projections in the Data Morphing signed a score representing the fraction of cycles/misses relative to cost model. We would expect to see similar performance differences the worst-performing system. For example, on Q1, all-columns and for other queries if Data Morphing could scale to them, since the HYRISE used about 16% of the cycles as all-rows. Data Morphing model is missing several key concepts (e.g. partial There are several things to note from these results. First, in terms projections, data alignment, and query plans—see Appendix D). of actual cache misses and CPU cycles, HYRISE almost always does as well as or outperforms the best of all-rows and all-columns. For those queries where HYRISE does not outperform the other layouts, 6. RELATED WORK our designer determines it is preferable to sacrifice the performance As mentioned in Section 5.3, the approach most related to 111

8.HYRISE is the Data Morphing approach of Hankins and Patel [12]. previous state of the art workload-aware approach for partitioning Data Morphing partitions relations into both row and column- main memory databases. As future work, we plan to examine hor- oriented storage. The main differences between our approaches izontal partitioning as well as additional hybrid-based query opti- and Data Morphing are in the fidelity of our cache-miss model (we mizations, and to optimize HYRISE for future many-core processors model many cases that Data Morphing is unable to capture), and in with multiple memory channels and an increasing parallelism. our physical database design algorithm. Taken together, these make HYRISE significantly faster than Data Morphing, and also allow it 8. REFERENCES to scale to tables with tens or hundreds of attributes, whereas Data [1] D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. Madden. Morphing cannot scale to tables with large numbers of attributes. Materialization Strategies in a Column-Oriented DBMS. In ICDE, Vertical partitioning is a widely used technique that has been ex- pages 466–475, 2007. plored since the early days of the database community [24, 13, 11, [2] S. Agrawal, V. R. Narasayya, and B. Yang. Integrating Vertical and Horizontal Partitioning Into Automated Physical Database Design. In 17, 2]. Some of this work [9, 10, 17, 2] attempts to automatically SIGMOD Conference, pages 359–370, 2004. derive good partitions, but does so with an eye towards minimizing [3] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving disk seeks and I/O performance rather than main memory costs as Relations for Cache Performance. In VLDB, pages 169–180, 2001. we do in HYRISE. As such, these systems do not include careful [4] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a models of cache misses. The work of Agrawal et al [2] is most Modern Processor: Where Does Time Go? In VLDB, pages 266–277, similar to our approach in that it uses as cost-based mechanism to 1999. identify partitions that are likely to work well for a given workload. [5] P. A. Boncz, S. Manegold, and M. L. Kersten. Database Architecture Recently there has been a renewed interest in pure vertical par- Optimized for the New Bottleneck: Memory Access. In VLDB, pages 54–65, 1999. titioning into a “column-store”, e.g., DSM [8], Monet and Mon- [6] P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: etDB/X100 [5, 6], C-Store [21]. As “pure” column systems, these Hyper-Pipelining Query Execution. In CIDR, pages 225–237, 2005. approaches are quite different than HYRISE. The Monet system is [7] S. K. Cha and C. Song. P*TIME: Highly Scalable OLTP DBMS for perhaps most related because its authors develop complete models Managing Update-Intensive Stream Workload. In VLDB, pages of cache performance in column stores. 1033–1044, 2004. There have been several attempts to build systems in the spirit of [8] G. P. Copeland and S. Khoshafian. A Decomposition Storage Model. HYRISE that are row/column hybrids. PAX [3] is an early exam- In SIGMOD Conference, pages 268–279, 1985. ple; it stores data from multiple columns in a disk block, but uses a [9] D. W. Cornell and P. S. Yu. An Effective Approach to Vertical Partitioning for Physical Design of Relational Databases. IEEE column-wise data representation for those columns. In comparison Transactions on Software Engineering, 16(2):248–258, 1990. to the cache miss performance of HYRISE when scanning a nar- [10] P. De, J. S. Park, and H. Pirkul. An integrated model of record row projection, PAX will incur somewhat more cache misses when segmentation and access path selection for databases. Information scanning just a few columns from a table (since it will have to jump Systems, 13(1):13–30, 1988. from one page to the next in memory). Similarly, in comparison [11] M. Hammer and B. Niamir. A Heuristic Approach to Attribute to HYRISE scanning a wide projection, PAX will incur more cache Partitioning. In SIGMOD Conference, pages 93–101, 1979. misses when scanning many columns from a table (since it will have [12] R. A. Hankins and J. M. Patel. Data Morphing: An Adaptive, to jump from one column to the next in memory.) Cache-Conscious Storage Technique. In VLDB, pages 417–428, 2003. We chose not to compare our work against PAX directly because [13] J. A. Hoffer and D. G. Severance. The Use of Cluster Analysis in Physical Data Base Design. In VLDB, pages 69–86, 1975. the Data Morphing paper [12] showed that a hybrid system like [14] B. L. Johnston and F. Richman. Numbers and Symmetry: An HYRISE can be up to a factor of 2 faster for workloads that read Introduction to Algebra. CRC-Press, 1997. just a few columns, and as we show in Section 5.3, HYRISE gen- [15] G. Karypis and V. Kumar. Multielvel k-way partitioning scheme for erally performs better than Data Morphing. Fractured Mirrors [19] irregular graphs. Journal of Parallel and Distributed Computing, and Schaffner et al. [20] are hybrid approaches that consider both 48(1):96–129, 1998. row and column representations and answers queries from the rep- [16] S. Manegold, P. A. Boncz, and M. L. Kersten. Generic Database Cost resentation that is best for a given query; this leads to good query Models for Hierarchical Memory Systems. In VLDB, pages 191–202, performance but has substantial synchronization overhead. Unlike 2002. HYRISE, neither of these systems nor PAX vary their physical de- [17] S. B. Navathe, S. Ceri, G. Wiederhold, and J. Dou. Vertical Partitioning Algorithms for Database Design. ACM Transactions on sign based on the workload, and so do not focus on the automated Database Systems, 9(4):680–710, 1984. design problem we address. [18] H. Plattner. A common database approach for OLTP and OLAP using an in-memory column database. In SIGMOD Conf., pages 1–2, 2009. 7. CONCLUSIONS [19] R. Ramamurthy, D. J. DeWitt, and Q. Su. A Case for Fractured In this paper, we presented HYRISE, a main memory hybrid Mirrors. In VLDB, pages 430–441, 2002. database system designed to maximize the cache performance of [20] J. Schaffner, A. Bog, J. Krueger, and A. Zeier. A Hybrid Row-Column queries. HYRISE creates vertical partitions of tables of different OLTP Database Architecture for Operational Reporting. In BIRTE, 2008. widths, depending on the access patterns of queries over tables. To [21] M. Stonebraker, D. J. Abadi, and A. B. et al. C-Store: A determine the best partitioning, we developed an accurate model of Column-oriented DBMS. In VLDB, pages 553–564, 2005. cache misses that is able to estimate the number of misses that a [22] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, particular partitioning and access pattern will incur. We presented a and P. Helland. The End of an Architectural Era (It’s Time for a database design algorithm based on this model that finds partition- Complete Rewrite). In VLDB, pages 1150–1160, 2007. ings that minimize the number of cache misses for a given workload, [23] M. Stonebraker, L. A. Rowe, and M. Hirohama. The Implementation and that is able to scale to tables with a large number of columns. of Postgres. IEEE Transactions on Knowledge and Data Engineering, Our results show that HYRISE is able to produce designs that are 2(1):125–142, 1990. [24] P. J. Titman. An Experimental Data Base System Using Binary 20% to 400% faster than either a pure-column or pure-row approach Relations. In IFIP Working Conference Data Base Management, 1974. on a realistic benchmark derived from a widely used enterprise ap- [25] M. Zukowski, N. Nes, and P. Boncz. DSM vs. NSM: CPU performance plication. We also show that our approach leads to better physical tradeoffs in block-oriented query processing. In DaMoN, 2008. designs and can scale to larger tables than Data Morphing [12], the 112

9.APPENDIX problem often occurs in practice (it occurs for our test system de- In these appendices, we provide several examples of the behavior of scribed in Section 3 for instance). HYRISE’s physical container design and describe several extensions To alleviate this problem, we maintain a variable #containers that further improve HYRISE’s performance on modern machines counting the number of containers. When a new container is cre- (Appendix A.) In Appendix B we give an example of HYRISE’s ated, the system shifts the beginning of the container by Li .w ∗ layout selection. We also describe the details of the new bench- (#containers mod #sets) bytes, to maximize the utilization of mark we have developed for this paper (Appendix C.) In addition, we the cache sets. give a compact comparison to the Data Morphing cost model (Ap- Figure 7 illustrates this point for our test system and 100 one- pendix D), detailed information on Write Operations (Appendix E) attribute wide containers. Each container is 4 bytes wide and the and Layout Generation (Appendix F.) associativity of the L1 cache is 8 in this case. Without cache set colli- sion optimization, the total number of cachable cache lines available when reading several containers in parallel is 8, since the contain- A. PHYSICAL DESIGN AND EXECUTION ers are all aligned to the same cache set and share the same width. Container Alignment Example: As described in Section 3, con- Cache evictions thus occur as soon as more than 8 attributes are read tainer alignment on cache boundaries can have a dramatic effect on in parallel, significantly increasing the number of cache misses (see the number of cache misses. For example, Figure 6 gives the num- Figure 7). By offsetting the containers using the method described ber of cache misses for two containers and for partial projections above, HYRISE is able to read all the containers in parallel without retrieving 0 to 80 attributes from the containers. The first container any early cache eviction (the system can read up to 512 containers in is a 80-attribute wide container while the second container is a 86- parallel in that case). attribute wide container (all attributes are 4 bytes wide). The first Cache set collisions often occur for the L1 cache. They occur less container has a width that is a multiple of the cache line size. The frequently for the L2 cache, which typically contains a much larger 86-attribute container is not aligned to the cache lines and suffers number of sets and has a higher associativity than the L1 cache. more cache misses for the partial projections, although the amount of data retrieved is the same in both cases. If this container were to Prefetcher Selection: In addition to allocating and aligning con- be padded to 384 bytes (instead of using 344 bytes corresponding tainers to minimize cache misses, HYRISE supports several cache to the width of its 86 attributes) then both the 80 and the 86 wide prefetching policies that can be switched on a per-operator basis. containers would behave similarly in terms of cache misses. For this Modern CPUs prefetch cache lines that the processor determines are reason, properly aligning containers as done in HYRISE is essential. likely to be accessed in the future. The advantage of this approach is that the data for a prefetched cache line starts to be loaded while the Cache Set Collision: Cache collisions due to associativity con- previous cache line is still being processed. flicts can be problematic in cache-aware systems. For this reason, Most processors provide several prefetchers and allow applica- HYRISE automatically adjusts its container alignment policy in or- tions to select which prefetcher they wish to use. For example, Intel der to minimize these cache set collisions. processors based on the Intel Core architecture provide two different When the OS allocates a large memory region (for example when L2 hardware prefetchers. The first prefetcher is called Streamer and creating a container), it usually automatically aligns the beginning loads data or instructions from memory to the second-level cache of the region with the beginning of a virtual memory page. Vir- in blocks of 128 bytes. The first access to one of the two cache tual memory pages have a fixed size—the address of their first lines in blocks of 128 bytes triggers the streamer to prefetch the pair byte always is a multiple of the system-level memory page size of lines. The second is the Data Prefetch Logic (DPL) hardware P AGESIZE (which is a system constant that can be determined prefetcher that prefetches data to the second level cache based on by calling getconf PAGESIZE). request patterns observed in L1. The total number of cache sets #sets is equal to Li .n/assoc, DPL is able to detect more complicated access patterns, even where assoc is the associativity of the cache. Each memory address when the program skips access to a certain number of cache lines; it address is mapped to a unique cache set set as follows: is also able to disable prefetching in the presence of random accesses where prefetching may hurt performance. address To evaluate the performance impact of the different hardware mod #sets. set(address) = (21) Li .w prefetchers we created two layouts, λ1 , consisting of a single wide This mapping is cyclic and starts over every #sets ∗ Li .w bytes. container of width w, and λ2 , consisting of a set of containers whose When the memory page size is a multiple of this cycle length, i.e., aggregate width was w. We accessed a list of random positions in when P AGESIZE mod (#sets ∗ Li .w) = 0, the addresses cor- each container, varying the selectivity from 0.0 to 1.0. For accesses responding to the beginning of the containers are all systematically to λ1 there was no visible difference between the two prefetching mapped to the same cache set, thus severely limiting the amount of implementations (Figure 8) but for accesses to λ2 , DPL used 24% cache available when processing several containers in parallel. This fewer CPU cycles as it was able to predict skips between containers 5.5e+06 1.2e+08 1 x 80 attributes container HYRISE Projection Optimized Number of L2 Cache Misses Number of L1 Cache Misses 5e+06 1 x 86 attributes container 1e+08 HYRISE Projection Collisions 4.5e+06 4e+06 8e+07 3.5e+06 6e+07 3e+06 2.5e+06 4e+07 2e+06 2e+07 1.5e+06 1e+06 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 90 100 Number of Attributes in Projection Number of Attributes in Projection Figure 6: L2 Misses for Containers with Different Alignments Figure 7: Experiment from Figure 3(a) with Cache Collision 113

10.Figure 8: Comparison of DPL and Figure 9: Comparison of DPL and Figure 10: Comparison of DPL and Streamer for sequential access and row Streamer for sequential access and col- Streamer for random access and column stores umn stores stores (Figure 9). In a second experiment, we generated a list of sequential {a2 , a4 } (σ1 does not split the partitions any further). The costs positions of varying selectivity and again accessed λ1 and λ2 . In this associated with these partitions are: case the streamer prefetcher was 7% faster than DPL when access- ing λ2 (Figure 10), with no difference when accessing λ1 . Based on 1 1 CostP 1 (W ) = w1 × Cost(π(P1 )) + f3 × Cost(π(P1 )sel=1/N ) (22) 1 these experiments, we use the DPL prefetcher by default in HYRISE. 1 1 CostP 1 (W ) = w2 × Cost(π(P2 )) + f3 × Cost(π(P2 )sel=1/N ) (23) In cases where several small containers (like those in λ2 ) are scanned 2 with a high selectivity, we enable Streamer on a per-operator basis. CostP 1 (W ) = w3 × 1 Cost(π(P3 )sel=1/N ) (24) 3 Database Loading: The output of the logical database designer is an Here Cost(π(P11 )sel=1/N ) reflects the cost of accessing one tuple annotated version of the original schema. When loading the data, we for the selection. adopt several optimization techniques to improve the cache perfor- In this example, merging P11 and P21 would be advantageous for mance: First, all containers are allocated using a memory region that the selection (since it touches all attributes), but introduces some is aligned to the width of a cache line. Second, we perform narrow overhead when performing projections (which never access a1 , a2 row padding to append padding to some of the containers in order and a3 simultaneously). The exact overhead depends on the width of to align their rows to the cache lines (see Section 3.6.) Finally, we the attributes, the cache line size, and the frequency of the operations shift the base pointers of the containers in order to minimize cache (as captured by our cost model in Section 3.) set collisions as described above in Appendix A. Operator Implementation: The layouter component uses the Par- C. BENCHMARK DESCRIPTION tial Projection Model to model materializing projections, predicate Most database benchmarks are built to address a specific market evaluation and the layout-dependent part of our hash join implemen- (e.g., OLAP or OLTP). For example, TPC-C is an OLTP benchmark tation. It uses the Selection Model to capture position-based selec- and TPC-H is an OLAP benchmark. Creating a hybrid of TPC-C tions. The layout-dependent costs of sorting and grouping operators and TPC-H is difficult since each benchmark uses different schemas are modeled by taking into account the data accesses (e.g., projec- and workloads. Consequently, we chose to create a new workload tions, selections of tuples for group-by and join columns) caused by derived from a real database application. As a starting point we used those operations. the sales application of an Enterprise Resource Planning System, Layout-Independent Operations: Depending on the materializa- which covers a full range of operations in a sales scenario. tion strategy chosen for the given query plan not all costs will be The business entities involved in sales are modeled as a large num- layout-dependent. Although all queries of our benchmark only con- ber of relations. This is both due to the application’s use of highly tain layout-dependent costs, for more complex scenarios with differ- normalized OLTP schemas and a result of so-called header-items. ent materialization strategies layout independent operations may be Header-items cause the sales order entity to be partitioned into a needed. For example, there are layout-independent costs (e.g. index sales order header table and a sales line item table. The header con- traversals) that would compete with the amount of cache used by the tains data relevant to the entire sales order. For example, its descrip- other operators; this behavior will need to be modeled, for example, tion, order date, and sold-to-party are stored there. Attributes of the using the notion of the repetitive random accesses presented in [16]. ordered material, number and price are kept in the line item table, As future work, we are investigating building a cost based query with each row representing one item and belonging to one order. In optimizer on top of HYRISE that attempts to choose between layout general, a single sales order consists of several line items. Master dependent and layout independent operators. data tables do not follow this pattern and store data in single tables for each type. For example, in the sales and distribution scenario, material and customer detail tables are both stored. The customer B. LAYOUT SELECTION EXAMPLE details table contains customer attributes, including name, account In this section, we give an example illustrating the candidate gen- type, contact, and billing data. Specifics about a material, such as eration process described in Section 4.2. We consider a very sim- its description, volume, weight and sales-related data are kept in the ple workload on a relation containing N tuples with four attributes material details table and the material hierarchy. In contrast to the a1 , . . . , a4 and consisting of a projection π1 = {a1 , a2 , a4 } with tables used by TPC-E or TPC-C, the tables we consider are modeled weight w1 , a second projection π2 = {a2 , a3 , a4 } with weight w2 , after a real enterprise system and are much wider. The widest tables and a selection σ1 retrieving all attributes of a single tuple with are the sales order line items table with 214 attributes and the mate- weight w3 . The three resulting primary partitions are P11 = {a1 }, rial details table with 204 attributes. The other tables have between P21 = {a2 , a4 }, and P31 = {a3 }, since π1 creates partitions 26 and 165 attributes (e.g. KNA1). {a1 , a2 , a4 } and {a3 } and π2 splits the first partition into {a1 } and Due to the complexity of these schemas, our benchmark queries 114

11. Business Partner KUNNR Business Partner Address # Attributes HYRISE Data Morphing (KNA1) (ADRC) 2 2 1 KUNNR 3 1 1 Sales Document Header Material Text (MAKT) 5 1 7 (VBAK) VBELN MATNR 7 1 38 Sales Document Item Material 9 1 43 (VBAP) MATNR (MARA) 11 1 872 MATNR 13 1 15526 Material Hierarchy (MATH) Table 1: Scaling Comparison: time (in [ms]) to derive layouts Figure 11: Simple Sales Schema for both the optimal HYRISE algorithm and the Data Morphing algorithm for a table of varying width and a single query. focus on sales order processing, leaving out operations for delivery of ordered items and billing of delivered items. Figure 11 illustrates the schema of our benchmark. D. DATA MORPHING Even on a simple workload consisting of only one query, the Hill- C.1 Benchmark Queries Climb algorithm suggested in [12] is not able to scale with the num- Our benchmark includes queries that perform small transac- ber of attributes. Table 1 shows an example where a sample relation tional operations—including writes—as well as more complex, read- is gradually extended one attribute at at time and where a simple mostly aggregates on larger sets of data. query projecting on the first two attributes is run for each configu- These queries composing our workload are as follows: ration. As can be seen on the figure, the cost of running HillClimb Q1 Search for a customer by first or last name (ADRC) becomes rapidly prohibitive, even on this very simple setting. select ADDRNUMBER, NAME CO, NAME1, NAME2, KUNNR Besides this scalability issue, the Data Morphing cost model is from ADRC where NAME1 like (..) OR NAME2 like (..); missing several key concepts that are useful in our context: Q2 Read the details for this customer (KNA1) Partial Projections: Data Morphing only considers full-scan opera- select * from KNA1 where KUNNR = (...); tions on the containers. While the Data Morphing model works fine Q3 Read all addresses belonging to this customer (ADRC) for simple, non-overlapping queries and small, aligned containers, select * from ADRC where KUNNR = (...); it can lead to arbitrarily bad results in more realistic cases. Con- Q4 Search for a material by its text in the material text table (MAKT) sider, for example, performing a projection of the first 4–64 bytes of select MATNR, MAKTX from MAKT where an aligned 256-byte wide container. Here, the Data Morphing cost MAKTX like (..); model will over-estimate the number of misses by a factor of 3 ver- Q5 Read all details for the selected material from the material table sus the true value (correctly predicted by the HYRISE model) since (MARA) only one cache line out of the four required by each row will be read select * from MARA where MATNR = (...); by the scan. This causes Data Morphing to severely penalize wide Q6.a Insert a new row into the sales order header table (VBAK) containers in its designs. We confirmed that real designs produced insert into VBAK (..) values (..); by Data Morphing can show these types of slowdowns vs HYRISE. Q6.b Insert a new row into the sales order line item table based on the results Data Alignment: Data Morphing does not consider data alignment. of query Q5 (VBAP) insert into VBAP (...) values (...); This can lead to very bad cache estimations in many cases, e.g., for Q7 Display the created sales order header (VBAK) containers that are not aligned to the cache (C.o = 0, in terminology select * from VBAK where VBELN = (..); of Section 3), for partial projections (in addition to the case above), Q8 Display the created sales order line items (VBAP) or for selections. For instance, the cost for full but relatively selec- select * from VBAP where VBELN = (..); tive projections (e.g., π.s < 10%) on a 60 byte container is approx- Q9 Show the last 30 created sales order headers (VBAK) imated as 1 cache miss per result by Data Morphing, but is actually select * from VBAK order by VBELN desc limit equal to 1.8 misses per result due to misalignment. 30; Query Plans: In addition to the points above, the data morphing ap- Q10 Show the turnover for customer KUNNR during the last 30 days proach does not include any notion of query plans. This is especially select sum(item.NETWR), header.KUNNR from bad for complex queries which access the same container several VBAK as header, VBAP as item where times. Such repeated accesses are treated as only one full scan by header.VBELN = item.VBELN and Data Morphing. header.KUNNR = $1 and header.AEDAT >= $2; The HYRISE cost model and layout algorithm take those impor- Q11 Show the number of sold units of material MATNR for the next 10 days on a per day basis tant cases into account, leading to superior performance. Further- select AEDAT, sum(KWMENG) from VBAP where more, the grouping and pruning algorithms in Section 4 allow us to MATNR = $1 and AEDAT = (..) group by AEDAT; scale to tables with hundreds of columns (such as those described in Q12 Show the number of sold units of material MATNR for the next 180 our full benchmark above), which Data Morphing is unable to do. days on a per day basis select AEDAT, sum(KWMENG) from VBAP where MATNR = $1 and AEDAT = (..) group by AEDAT; E. WRITING AND TRANSACTIONS Q13 Drill down through the material hierarchy starting on the highest level In this section we briefly describe the effect that transactions and using an internal hierarchy on the table, each drill-down step reduces updates have on cache performance in HYRISE. As noted in Sec- the selectivity, starting from 40% selectivity going down to 2.5% se- tion 2.1, in HYRISE, we use validity (begin and end) timestamps lectivity. for all tuples, as a result we only append new values to existing Queries Q1. . .Q9 can be categorized as typical OLTP queries, tables. This design means that all updates are actually two writes, while queries Q10, Q11, Q12 and Q13 can be categorized as OLAP- one to the end of the table and one to the validity timestamp of the style queries. Q1. . .Q6 are frequent, inexpensive OLTP queries and previous version, although the previous version’s timestamp is very have a weight wi of 100 in our benchmark. The other queries all unlikely to be read again. Furthermore, transactional operations like have a weight of 1. logging and updates read and write sets of data (when using multi- 115

12. Description CPU Cycles L2 Cache Misses No Writes 1, 105, 317 5 Non-temporal Writes 29, 902, 648 11, 683 Normal Writes 40, 289, 157 557, 346 Table 2: Comparing temporal and non-temporal writes version concurrency control) that are not re-accessed frequently and are private to a single transaction. These writes can result in cache pollution with data that is not likely to be read. To avoid this pollution, we use non-temporal writes provided by the Intel SIMD Streaming Extensions in HYRISE. When using non- temporal writes the write-combining buffer of the CPU will capture all writes to a single cache line and then write back the cache line directly to memory without storing the modified data in the cache. To measure the benefit of non-temporal writes, we allocated a single-column container with the size of the L2 cache (6 MB). Af- Figure 12: Contour plot showing the number of OLTP queries ter scanning the container and reading all values we then measured per OLAP query required before workload λ1 ’s cost exceeds the number of cache misses performed by an additional scan over that of λ2 for different values of σ1 and σ2 . the data (which by itself should not produce any L2 cache misses since the data is already cached and the container fits into cache). shows the result of this experiment. The contour lines define the Concurrently, we write data to a second container; we alternate be- region where λ1 ’s cost exceeds that of λ2 for different values of σ1 tween using the SSE extension for non-temporal writes and using and σ2 . For example, when σ1 is .005 and σ2 is .55, if there are plain memcpy(). We use the mm stream si128() function to more than 100 OLTP queries per OLAP query, λ1 is preferable (in generate the required MOVNTDQ assembler instruction, which causes fact, for any σ1 > .01, λ1 is preferable when there are at least 100 the write combined buffer of the CPU to collect all modifications for OTLP queries per OLAP query). a cache-lines worth of data and write it back to main memory without Experiment 2: A related question is how many partitions are typ- storing the data in any cache. This operation models the process of ically generated by our algorithm. In the simplest case, for a table accumulating a read set during the execution of a transaction. with n attributes and i independent queries that access the table on Table 2 shows the results of the experiment. The results show that disjoint attributes with the same weight and selectivity, the table will non-temporal writes use only 75% of the CPU cycles and 0.02% be split into i partitions. of the cache misses when compared to memcpy(), suggesting that For more complex cases, our algorithm will only generate a par- this optimization significantly improves the performance of writes titioning that optimizes the total workload cost, and won’t create a that are not frequently read. separate storage container for infrequently run queries. In most en- terprise applications, there are a small number of query classes (and F. HYRISE LAYOUT SELECTION hence attribute groups) that are accessed frequently; ad-hoc queries In this section we illustrate the influence of workload variations may be run but they will likely represent a very small fraction of on the layouts produced by HYRISE. total access. Hence, these more frequent (highly weighted) queries Experiment 1: In this experiment, instead of simply running our will dominate the cost and be more heavily represented in the storage layouter to find a layout for a given workload we use a workload for layout, resulting in a modest number of storage containers for most which one of two layouts is optimal, given a mix of queries run with workloads (i.e., it is unlikely that a full column-oriented layout will different frequencies (or weights). be optimal for non-OLAP workloads.) We use a table with 10 attributes (a1 . . . a10 ) and a total width Figure 13 shows the result of an experiment where we used our ap- of 56 bytes. The workload consists of two queries: one OLTP-style proximate graph-partitioning algorithm to determine the best layout query that scans attribute a1 and, for the rows that match a highly se- for a wide table of 500 4-byte attributes and an increasing number lective predicate, returns all attributes; and two, an OLAP query that of frequently-posed queries. The experiment starts with a relatively scans attribute a1 , applying a non-selective predicate, and then does expensive OLTP query (selecting all attributes with a selectivity of a GROUP BY on the matching rows of attribute a1 and aggregates 40%). We then iteratively add three random OLAP queries that each over the values of attribute a4 . project a random pair of attributes, until we have a total of 1000 From those two queries one of two layouts is optimal. The first, random queries. As expected, the number of partitions in slowly λ1 , separates attribute (a1 ) and group (a2 . . . a10 ) together; the sec- converges to an “all-column” layout. The time taken to compute the ond layout, λ2 , also separates (a1 ), but in addition splits the remain- layout using our approximate partitioning algorithm and for K = 10 ing group into (a4 ) and (a2 . . . a3 ; a5 . . . a10 ). Given both layouts varies from a few hundred milliseconds initially to a few minutes for we want to observe when the layout algorithm chooses to switch the case with 1000 OLAP queries. from one design to the other. Assuming that OLTP queries occur more frequently than OLAP queries, we wish to visualize when each layout is more appropriate. We vary OLTP selectivities from 0 < σOLT P < 0.5 and OLAP selectivities from 0.5 < σOLAP < 1. We then compute the cost of each query and determine the point at which the layouts have equal cost. x × CostOLT P (λ1 , σ1 ) + CostOLAP (λ1 , σ2 ) = (25) x × CostOLT P (λ2 , σ1 ) + CostOLAP (λ2 , σ2 ) Equation 25 shows the formula used to calculate the cost for layout λ1 and λ2 based on the distinct selectivities σ1 and σ2 . Figure 12 Figure 13: Increasing number of OLAP queries 116