Column Imprints: A Secondary Index Structure
We study indexing techniques for main memory, including hash indexes, binary search trees, T-trees, B+-trees, interpolation search, and binary search on arrays. In a decision-support context, our primary concerns are the lookup time, and the space occupied by the index structure. Our goal is to provide faster lookup times than binary search by paying attention to reference locality and cache behavior, without using substantial extra space.

1 . Cache Conscious Indexing for Decision-Support in Main Memory Jun Rao Kenneth A. Ross∗ Columbia University Columbia University junr@cs.columbia.edu kar@cs.columbia.edu a small space overhead, we can reduce the cost of binary search on the array by more Abstract than a factor of two. We also show that our technique dominates B+-trees, T-trees, and binary search trees in terms of both space We study indexing techniques for main and time. A cache simulation verifies that memory, including hash indexes, binary the gap is due largely to cache misses. search trees, T-trees, B+-trees, interpola- tion search, and binary search on arrays. In a decision-support context, our primary 1 Introduction concerns are the lookup time, and the space As random access memory gets cheaper, it becomes occupied by the index structure. increasingly affordable to build computers with large Our goal is to provide faster lookup times main memories. The recent “Asilomar Report” than binary search by paying attention to ([BBC+ 98]) predicts “Within ten years, it will be reference locality and cache behavior, with- common to have a terabyte of main memory serving out using substantial extra space. We as a buffer pool for a hundred-terabyte database. propose a new indexing technique called All but the largest database tables will be resident “Cache-Sensitive Search Trees” (CSS-trees). in main memory.” But main memory data process- Our technique stores a directory structure ing is not as simple as increasing the buffer pool on top of a sorted array. Nodes in this size. An important issue is cache behavior. The directory have size matching the cache-line traditional assumption that memory references have size of the machine. We store the directory uniform cost is no longer valid given the current in an array and do not store internal-node speed gap between cache access and main memory pointers; child nodes can be found by per- access. So, improving cache behavior is going to be forming arithmetic on array offsets. an imperative task in main memory data processing. In this paper, we focus on how to make indexes cache We compare the algorithms based on their conscious. time and space requirements. We have im- 10000 CPU Performance (60%/yr) plemented all of the techniques, and present DRAM Performance (10%/yr) performance improvement a performance study on two popular mod- 1000 ern machines. We demonstrate that with 100 ∗ This research was supported by a David and Lucile Packard Foundation Fellowship in Science and Engineering, 10 by an NSF Young Investigator Award, by NSF grant number IIS-98-12014, and by NSF CISE award CDA-9625374. 1 1980 1985 1990 1995 2000 Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed Figure 1: Processor-memory performance imbalance for direct commercial advantage, the VLDB copyright notice Index structures are important even in main and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large memory database systems. Although there are no Data Base Endowment. To copy otherwise, or to republish, disk accesses, indexes can be used to reduce overall requires a fee and/or special permission from the Endowment. computation time without using too much extra Proceedings of the 25th VLDB Conference, space. With a large amount of RAM, most of Edinburgh, Scotland, 1999. the indexes can be memory resident. Past work 78

2 . 0.45 0.2 array binary search enhanced B+ tree T tree search full CSS-tree 0.4 enhanced B+ tree 0.19 level CSS-tree full CSS-tree 0.35 level CSS-tree 0.18 hash 0.3 0.17 time (s) time (s) 0.25 0.16 0.2 0.15 0.15 0.14 0.1 0.13 0.05 0.12 0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 space (bytes) space (bytes) (a) Whole Picture (b) Close-up Figure 2: Space/time Tradeoffs on measuring the performance of indexing in main- structures are space and time. Space is critical in memory databases includes [LC86b, WK90], with a main memory database; we may have a limited [LC86b] being the most comprehensive on the spe- amount of space available for precomputed struc- cific issue of indexing. In the thirteen years since tures such as indexes. Given space constraints, we [LC86b] was published, there have been substantial try to optimize the time taken by index lookups. In changes in the architecture of computer chips. The a main-memory database there are several factors most relevant change is that CPU speeds have been influencing the speed of database operations. An increasing at a much faster rate (60% per year) than important factor is the degree of locality in data memory speeds (10% per year) as shown in Figure 1 references for a given algorithm. Good data locality (borrowed from [CLH98]). Thus, the relative cost of leads to fewer (expensive) cache misses, and better a cache miss has increased by two orders of magni- performance. tude since 1986. As a result, we cannot assume that We study a variety of existing techniques, includ- the ranking of indexing algorithms given in [LC86b] ing hash indexes, binary search on a sorted list of would be valid on today’s architectures. In fact, our record identifiers, binary trees, B+-trees [Com79], experimental results indicate very different relative T-trees [LC86a], and interpolation search. We also outcomes from [LC86b] for lookup speed. introduce a new technique called “Cache-Sensitive Another recent development has been the explo- Search Trees” (CSS-trees). CSS-trees augment bi- sion of interest in On-Line Analytical Processing nary search by storing a directory structure on top (OLAP). [Fre95, Fre97] contrasts the requirements of the sorted list of elements. CSS-trees differ from of OLAP with OLTP systems and contends that the B+-trees by avoiding storing the child pointers in real performance gains can be obtained by separat- each node. The CSS-tree is organized in such a way ing the two systems. A dedicated OLAP system that traversing the tree yields good data reference can have a much better query performance if we locality (unlike binary search), and hence relatively are willing to sacrifice update performance. Com- few cache misses. CSS-trees also take advantage of mercial systems such as Sybase IQ [Syb97] were the OLAP context to optimize search performance, designed for such purposes. Thus, typical OLAP at the expense of update performance. workloads are query-intensive, and have infrequent We summarize the space/time tradeoffs of vari- batch updates. For example, a data warehouse for ous methods in Figure 2. Each point for T-trees, a university, containing all the student records, is enhanced B+-trees and CSS-trees corresponds to probably updated once per day. Census data sets a specific node size (multiple of cache line size). are collected periodically and will remain static for Normally a larger node size means less space but a relatively long period of time. These systems are more search time. The stepped line basically tells in the scale of several gigabytes and can already us what’s the optimal searching time for a given fit in RAM today. Given the current trend, we amount of space. Our conclusion is that CSS-trees expect more and more disk-based applications will dominate T-trees and enhanced B+-trees in both be moved to main memory in the future. Since space and time. There are tradeoffs between space updates are batched in those systems, incremental and time for binary search, CSS-trees and hash in- updates of indexes may not be very important. In dices. CSS-trees reasonably balance space and time. fact, we may want to rebuild indexes from scratch We discuss this issue in more detail in Section 7. after a batch of updates, if that leads to faster index searches. In this paper, we focus on such an OLAP environment. 2 Main Memory Databases Two important criteria for the selection of index Main Memory Database Systems: Some past 79

3 .work on main memory databases has addressed used. the problems of concurrency, transaction process- Assumptions: We assume an OLAP environ- ing and logging [GMS86, LN88, JLRS94], and re- ment, so we don’t care too much about updates. covery [Hag86, LC87]. Systems with a significant Our main concerns are the lookup time for an index, query-processing component include OBE [WK90], and the space required to store the index. There MM-DBMS [LC87], and Starburst [LSC92]. More are many applications in this class, as described in recently, the TimesTen corporation (formerly the Section 1. Smallbase project at Hewlett-Packard) has devel- oped a commercial main-memory system, with claims of tenfold speedups over disk-based sys- 3 Cache Optimization on Indexes tems [Sof97]. Most of the systems do not address the issue of cache reference locality. In this section, we first describe cache memories and the impact of cache optimization. We then give a Data Layout: [AHK85] and others describe the survey of the related work. Finally, we analyze the concept of a domain. When data is first loaded into cache behavior of various existing index structures main memory, distinct data values are stored in an for searching and point out their shortcomings. external structure, the domain, and only pointers to domain values are stored in place in each column. Cache memories are small, fast static RAM mem- This has the benefits of: (a) saving space in the ories that improve program performance by holding presence of duplicates, (b) simplified handling of recently referenced data [Smi82]. Memory refer- variable-length fields and (c) pointer comparisons ences satisfied by the cache, called hits, proceed can be used for equality tests. In the main memory at processor speed; those unsatisfied, called misses, database project at Columbia University, we focus incur a cache miss penalty and have to fetch the on an OLAP main memory database system. We go corresponding cache block from the main memory. further than [AHK85] by keeping the domain values A cache can be parameterized by capacity, block in order and associate each value with a domain size and associativity, where capacity is the size of ID (represented by an integer). As a result, we the cache, block size is the basic transferring unit can process both equality and inequality tests on between cache and main memory, and associativity domain IDs directly, rather than on the original determines how many slots in the cache are potential values. Although keeping values in order has extra destinations for a given address reference. cost, we expect the data is updated infrequently. Cache optimization in a main memory database Independently, Tandem Inc.’s InfoCharger storage system is similar to main memory optimization in engine [Eng98] has also chosen to keep domain values a disk-based system. But the management of the in sorted order. The use of domains means that cache is done by the hardware and the database many indexes can be built with smaller keys. system doesn’t have direct control of which block to Indexing in Main Memory Databases: Al- bring into a cache. This makes cache optimization though sequential data access is much cheaper in more subtle. main memory than in disk-based systems, indexing Typical cache optimization techniques include remains very important in main memory databases. clustering, compression and coloring [CLH98]. Clus- First of all, searching an index is still useful for tering tries to pack, in a cache block, data structure answering single value selection queries and range elements that are likely to be accessed successively. queries. Next, cheaper random access makes indexed Compression tries to remove irrelevant data and thus nested loop joins more affordable in main memory increases cache block utilization by being able to databases. Indexed nested loop join is pipelinable, put more useful elements in a cache block. This requiring minimal storage for intermediate results includes key compression, structure encodings such and is relatively easy to implement. As a matter of as pointer elimination and fluff extraction. Caches fact, indexed nested loop join is the only join method have finite associativity, which means that only a used in [WK90]. This approach requires a lot of limited number of concurrently accessed data el- searching through indexes on the inner relations. ements can map to the same cache line without Last but not least, transforming domain values to causing conflict. Coloring maps contemporaneously- domain IDs (as described in the previous section) accessed elements to non-conflicting regions of the requires searching on the domain. cache. A list of record identifiers sorted by some columns Previous research has attacked the processor- provides ordered access to the base relation. Or- memory gap using the above techniques. Wolf and dered access is useful for range queries and for satis- Lam [WL91] exploited cache reference locality to fying interesting orders [SAC+ 79]. A sorted array is improve the performance of matrix multiplication. an index structure itself since binary search can be LaMarca and Ladner [LL96, LL97] considered the 80

4 .effects of caches on sorting algorithms and improved is low. Since the number of key comparisons is still performance by restructuring these algorithms to the same, T-Trees do not provide any better cache exploit caches. In addition, they constructed a behavior than binary search. cache-conscious heap structure that clustered and Another problem with the T-Tree is that it has aligned heap elements to cache blocks. [CLH98] to store a record pointer for each key within a node. demonstrated that cache optimization techniques Since most of the time the record pointers won’t be can improve the spatial and temporal locality of needed, essentially half of the space in each node is pointer-based data structures. They showed im- wasted. Potentially, one could put just RIDs in the provement on various benchmarks. T-tree with no keys, but then search is much slower In [NBC+ 94], Nyberg et al. have shown that due to indirection. for achieving high performance sorting, one should B+-trees: Although B+-trees were designed for focus carefully on cache memory behavior. disk-based database systems, they actually have a Cache conscious algorithms have been considered much better cache behavior than T-trees. In each in database systems also. In [SKN94], the authors internal node we store keys and child pointers, but suggested several ways to improve the cache refer- the record pointers are stored on leaf nodes only. ence locality of query processing operations such as Multiple keys are used to search within a node. If joins and aggregations. They showed that the new we fit each node in a cache line, this means that a algorithms can run 8% to 200% faster. cache load can satisfy more than one comparison. Although cache optimization has been considered So each cache line has a better utilization ratio. on tree-based structures, nobody has looked at the Enhanced B+-trees: In an OLAP environ- influence of the cache on index structures used in ment, we can use all the slots in a B+-tree node database systems. Some other papers have consid- (similar to compact B-Trees [CS83] and the ISAM ered the issue of compact representations of B-tree method used in the IBM OS/360 operating sys- indexes [CS83, JTR87]. These papers appeared too tem [GR93]) and rebuild the tree when batch up- early to consider cache issues since memory speed dates arrive. We can design the node size to be wasn’t too slow compared with CPU speed. In the exactly the same as a cache line and align each node. rest of this section, we study the cache behavior We can also hard-code the node search since we know of various typical index structures used in main all the slots are used. But B+-trees enhanced in memory database systems. these ways still need to store child pointers within Array Binary Search: The problem with bi- each node (even when compact). So for any given nary search is that many accesses to elements of the node size, only half of the space can be used to store sorted array result in a cache miss. We do not get keys. misses for the first references because of temporal In the rest of this paper we shall consider only locality over many searches. We avoid misses for “enhanced B+-trees” as described above; however, the last references, due to spatial locality, if many we may sometimes use “B+-trees” for brevity. records from the array fit inside a single cache line. Hash: Hash indices can also benefit from cache However, when the array is substantially bigger than optimization. The most common hashing method the cache, many of the intervening accesses cause is chained bucket hashing [Knu73]. In [GBC98], the cache misses. In the worst case, the number of authors use the cache line size as the bucket size and cache misses is of the order of the number of key squeeze in as many < key, RID > pairs as possible. comparisons. This can reduce the number of cache misses when T-Trees: T-Trees have been proposed as a better scanning through the buckets. Hash indices are fast index structure in main memory database systems. for searching only if the length of each bucket chain A T-Tree [LC86a] is a balanced binary tree with is small. This requires a fairly large directory size many elements in a node. Elements in a node con- and thus a fairly large amount of space. Skewed tain adjacent key values and are stored in order. Its data can seriously affect the performance of hash aim is to balance the space overhead with searching indices unless we have a relatively sophisticated hash time and cache behavior is not considered (thirteen function, which will increase the computation time. years ago the gap between processor and main mem- Hash indices do not preserve order. To provide ory speeds was not that large). T-Trees put more ordered access, an ordered list has to be kept in keys in each node and give the impression of being addition to the hash indices. cache conscious. But if we think carefully, we can observe that for most of the T-Tree nodes, only 4 Cache Sensitive Search Trees the two end keys are actually used for comparison (in the improved version [LC86b], only one key is In this section, we present our cache conscious used). This means that the utilization of each node searching methods called CSS-trees. Section 4.1 81

5 . internal nodes leaf nodes 0 0 1 15 16 30 31 80 1 2 3 4 5 directory in an array I II 6-10 11-15 16-20 21-25 26-30 31-55 56-80 sorted array a 0 49 50 64 II I Full CSS-Tree (m=4) 65*4 elements in the array Figure 3: Layout of a full CSS-tree introduces the concept of “full” CSS-trees. We talk one cache miss. As a result we get at most logm+1 n about “level” CSS-trees in Section 4.2. cache misses for a lookup, unlike binary search which Suppose that we have a sorted array a[1..n] of gets up to log2 n cache misses. (Even if a node n elements. The array a could contain the record- occupies two cache lines, half the time only one cache identifiers of records in some database table in the miss will be generated, while half the time there will order of some attribute k. a could alternatively be two cache misses.) Second, we can hard-code the contain column-k keys from the records in the table, traversal within a node, so that calculations needed with a companion array holding the corresponding to find the next node happen logm+1 n times rather record-identifiers, using some extra space to avoid than log2 n times. (Note that the total number of an indirect reference during the search. a could comparisons is the same.) alternatively contain records of a table or domain Suppose that we number the nodes and keys values. starting at 0. If we have an internal node numbered Binary search of a has a serious cache usage b, then it is not difficult to show that the children problem as described in Section 3. A second problem of that node are numbered from b(m + 1) + 1 to with binary search is that it requires a calculation b(m + 1) + (m + 1). Within the directory, we have to be performed log2 n times to find the next el- m keys per node, so key number i in the directory ement to search. Even if this calculation uses a array maps to node number mi . As a result, one shift rather than a division by two [WK90], the can find the offset of the first key of the child nodes calculation represents a significant portion of the within the directory array as ( mi ∗ (m + 1) + 1) ∗ m execution time needed. Nevertheless, binary search through ( mi ∗ (m + 1) + m + 1) ∗ m. has the benefit that no additional space beyond a is needed to perform a search. Our goal is to improve A subtle point in the structure of a CSS-tree is upon the search time of binary search without using that we store the leaf nodes in a contiguous array a significant amount of additional space. in key order. This conflicts with the natural order of the CSS-tree that stores the nodes from left to right, level by level. The CSS-tree order would split 4.1 Full CSS-Trees the array, putting the right half of the array (which We create a search tree with nodes containing ex- appears at a higher level than the left half of the actly m keys. (We’ll see how to choose m later.) If array) ahead of the left half of the array. In Figure 3, the depth of the tree is d, then the tree is a complete the natural tree order is to store nodes 16-30 before (m + 1)-ary search tree up to depth d − 1, and at nodes 31-80. However, when the leaves are stored in depth d the leaves are filled from left to right. An a sorted array, nodes 31-80 come before nodes 16-30. example tree is shown for m = 4 in the left diagram Since maintaining a contiguous array in key order of Figure 3 (the numbers in the boxes are node is desirable for other purposes, and since the array numbers and each node has four keys). The nodes is given to us without assumptions that it can be of a CSS-tree can be stored in an array as shown on restructured, we leave the array in key order. To get the right in Figure 3. Note that we do not need to to the correct leaves when searching the CSS-tree, store explicit pointers to child nodes; the children of we modify the natural search algorithm. a node can be computed from offsets in the CSS-tree When performing a search, we move from parent array. (The exact formulas are given below.) to child by recalculating the offset within the direc- Our basic idea is to store a CSS-tree as a directory tory structure as above. We mark the end of the structure on top of the array a. There are two directory structure (in Figure 3, that’s the last key reasons why we expect a traversal of such a tree in node 15), and terminate this portion of the search to be faster than binary search. First, if we choose when the computed offset exceeds the endpoint. If m such that a node fits in a cache line, then all the leaves were stored in the natural CSS-tree order, local searching within a node happens with at most we’d use this offset to look up the directory directly. 82

6 .However, since the two parts of the leaf nodes are way that the leftmost key will always be found in stored in reverse order in a separate array, we need case of duplicates. So we will never reach the leaf to process this offset further. nodes in the deepest level with an index out of the The two parts of the leaf nodes are mapped into range of the first half of the sorted array. the sorted array as shown in Figure 3. We use y to Although it’s difficult to incrementally update a denote the boundary of the two parts of the leaf full CSS-tree, it’s relatively inexpensive to build such nodes, which is the offset of the first key at the a tree from scratch. Our experiments show that to deeper leaf level in the directory array. (In Figure 3 build a full CSS-tree from a sorted array of twenty- that’s the first key in node 31.) Given an offset x of a five million integer keys takes less than one second on key in a leaf node, we compare it with y to determine a modern machine. Therefore, when batch updates which part of the sorted array to look into. If x > y, arrive, we can afford to rebuild the CSS-tree. we find the element at position x − y from the start Searching a Full CSS-Tree: Once a full CSS- of the sorted array a. Otherwise, we find the element tree is built, we can search for a key. We start from at position y − x from the end of a. For example, in the root node. Every time we reach an internal node, Figure 3, the first key in leaf node 30 can be found we do a binary search to find out which branch to go at the first key in node 64 in the sorted array. to. We repeat until we reach a leaf node. Finally, we Note that our techniques apply to sorted arrays map the leaf node into the sorted array and binary having elements of size different from the size of a search the node. key. Offsets into the leaf array are independent of All the searches within a node consist of hard- the record size within the array; the compiler will coded if-else statements. When doing binary search generate the appropriate byte offsets. in the internal nodes, we keep checking whether the The following lemma tells us how to calculate keys in the left part are greater than or equal to the the node number of the last internal node, and the searching key. We stop when we find the first slot node number y. Key offsets can be obtained by that has a value smaller than the searching key and multiplying these numbers by m. then follow the branch on the right. (If such a slot can’t be found, we follow the leftmost branch.) In Lemma 4.1: Let n = N ∗ m be the number of this way, if there are duplicates in a node, we are elements in the sorted array a (N is the number of guaranteed to find the leftmost key among all the leaf nodes). The total number of internal nodes in duplicates. When there are duplicate keys being k −1 k −N indexed, we can return the leftmost match in a a full CSS-tree is (m+1) m − (m+1) m . The first (m+1)k −1 fashion similar to B+-trees. leaf node in the bottom level is number m . In both formulas, k = logm+1 (N ) . 4.2 Level CSS-Trees 4 Notice that CSS-trees are stored in a way similar to heaps. This is possible because of the way we 2 6 “virtually” split the sorted array. B+-trees can’t use 1 3 5 7 the same technique since they require all the leaves to be on the same level. 8 Building a Full CSS-Tree: To build a full CSS- Figure 4: Node with 8 keys tree from a sorted array, we first split the sorted A full CSS-tree with m entries per node will have array logically into two parts (based on Lemma 4.1) exactly m keys, i.e., all the entries are fully used. and establish the mapping between the leaf nodes Figure 4 shows the binary search tree of a node with and elements in the sorted array. We then start m = 23 entries. Out of the nine possible branches, with the last internal node. For each entry in the seven of them need three comparisons and two of node, we fill it with the value of the largest key in them need four. But if we waste one entry and its immediate left subtree. Finding the largest key just put seven keys per node, we will have a full in a subtree can be done by following the link in the binary search tree and all the branches need three rightmost branch until we reach the leaf nodes. comparisons. This may give us some benefit. So for Some internal nodes, namely ancestors of the last m = 2t , we define a tree that only uses m − 1 entries leaf node at the deepest level, may not have a full per node and has a branching factor of m a level complement of keys. In our algorithm, we simply fill CSS-tree. A level CSS-tree will be deeper than the in those dangling keys with the last element in the corresponding full CSS-tree since now the branching first half of array a. So there may be duplicate keys factor is m instead of m+1. However, we have fewer in some internal nodes. In our searching algorithm, comparisons per node. If N is the number of nodes we tune the searching within each node in such a that the elements in the sorted array can form, a 83

7 . Method branching # of levels comparisons per comparisons per factor (l) internal node (nComp) leaf node (Achild ) Binary search 2 log 2 n 1 1 T-trees 2 log 2 mn −1 1 log2 m enhanced B+-trees m 2 log m m n log2 m − 1 log2 m 2 n 2 Full CSS-trees m+1 log m+1 m (1 + m+1 ) log2 m log2 m n Level CSS-trees m log m m log2 m log2 m Method Total comparisons Moving across Level Cache Misses Cache Misses mK mK Level c <= 1 c >1 Binary search log2 n log2 n ∗ Ab log2 n log 2 n T-trees log2 n log2 n ∗ D log2 n log 2 n log2 n enhanced B+-trees log2 n log m mn ∗D log2 m−1 log m n((log 2 mK c )+ mK c ) 2 2 log2 n Full CSS-trees m+3 m+1 logm+1 m log2 n n logm+1 m ∗ Af css log2 (m+1) mK c log m+1 n((log2 c ) + mK ) log2 n Level CSS-trees log2 n logm m ∗ Alcss n log2 m log m n((log2 mK c c ) + mK ) Table 1: Time analysis level CSS-tree has logm N levels while a full CSS-tree raw data in the hash table). c denotes the size of a has logm+1 N levels. The number of comparisons per cache line in bytes, and s denotes the size of a node node is t for a level CSS-tree and t ∗ (1 + m+1 2 ) for in a T-tree, CSS-tree or enhanced B+-tree measured a full CSS-tree. So the total number of comparisons in cache-lines. We choose typical values as follows: for a level CSS-tree is logm N ∗ t = log2 N and that R = K = P = 4 bytes, n = 107 , h = 1.2, c = 64 for a full CSS-tree is logm+1 N ∗ t ∗ (1 + m+1 ) = 2 bytes and s = 1. log2 N ∗ logm+1 m ∗ (1 + m+1 ). The ratio of the 2 Time Analysis: To make the analysis easy, in (m+1) logm (m+1) this section we assume that R, P and K are the former to the latter is m+3 . Thus a level CSS-tree always uses fewer comparisons than a full same. Thus we have a single parameter m, which is CSS-tree for searching. On the other hand, level the number of slots per node. The size of a node in CSS-trees may require logm N cache accesses and cache-lines is given by s = mKc . logm N node traversals, compared with logm+1 N The first table in Table 1 shows the branching for full CSS-trees. Whether we obtain a net gain factor, number of levels, comparisons per internal in speed depends upon how time-consuming a com- node and comparisons per leaf node for each method. parison operation is compared with node traversals B+-trees have a smaller branching factor than CSS- and cache accesses. A level CSS-tree still utilizes trees since they need to store child pointers explic- most of the data in each cache line. It uses a little itly. The total cost of each searching method has more space than a full CSS-tree, but this may be three parts, namely the comparison cost, the cost of desirable for users who want to trade space for time. moving across levels and the cache miss cost. We The building of level CSS-trees is similar to that show the three costs for each method in the second of full CSS-trees. We can also make good use of the table in Table 1. We use D to denote the cost of empty slot in each node. During the population, we dereferencing a pointer, Ab , Af css , Alcss to denote can use that slot to store the largest value in the last the cost of computing the child address for binary branch of each node. We can thus avoid traversing search, full CSS-tree and level CSS-tree respectively. the whole subtree to find the largest element. The First of all, the comparison cost is more or less searching algorithm of a level CSS-tree is similar to the same for all the methods. Full CSS-trees have that of a full CSS-tree. The only difference is the slightly more comparisons than level CSS-trees as calculation of the offset of a child node. we described earlier. Some of the methods find the child node by following pointers and others by 5 Time and Space Analysis arithmetic calculations. The relative cost depends on computation complexity and the hardware. For In this section, we analytically compare the time example, while Ab could be smaller than D, Af css performance and the space requirement for different is likely to be more expensive than D. Nevertheless, methods. We let R denote the space taken by a methods with a higher branching factor have fewer record-identifier, K denote the space taken by a key, levels and thus usually have a lower cost of moving P denote the space taken by a child pointer and across levels. But that doesn’t mean the larger the n denote the number of records being indexed. h branching factor the better. Too large a node size denotes a hashing fudge factor (typically about 1.2, will increase the cache miss cost, which is probably meaning that the hash table is 20% bigger than the the most important factor since each cache miss can 84

8 . Method Space (indirect) Typical Value Space (direct) Typical Value RID-Ordered Access Binary search 0 0 MB 0 0 MB Y nK 2 nK 2 Full CSS-trees sc 2.5 MB sc 2.5 MB Y nK 2 nK 2 Level CSS-trees sc−K 2.7 MB sc−K 2.7 MB Y nK(P +K) nK(P +K) enhanced B+-trees sc−P −K 5.7 MB sc−P −K 5.7 MB Y Hash table (h − 1)nR 8 MB hnR 48 MB N 2nP (K+R) 2nP (K+R) T-trees sc−2P 11.4 MB sc−2P + nR 51.4 MB Y Table 2: Space analysis be an order of magnitude more expensive than a unit column, we do not count the space used by the computation. record-identifiers themselves since all methods share this space requirement. We assume a cold start in the cache. If the node size is smaller than the cache line size, each level has The column “Space (direct)” describes the space one cache miss. When the node size is larger than required by the algorithms if the structure being the cache line size, we estimate the number of cache indexed is a collection of records that cannot be misses per node to be (log2 s)+ 1s = (log2 mK )+ c rearranged. In other words, it is not acceptable for c mK (the total number of cache misses for all the keys a method to store the records internally within its adds up to s ∗ (log2 s) + 1 and we divide that by structure. In this column, we count the space used s assuming each cache-line is equally likely to be by record-identifiers for T-trees and hash tables since chosen). The results are summarized in the last two record-identifiers would not be necessary with other columns of the table. For most reasonable config- methods. urations, the number of cache misses is minimized All methods other than hash tables support ac- when the node size is the same as cache line size. cess in RID-order. The formula for Level CSS-trees In the third column, we can see that binary search assumes that scK is a power or 2. and T-trees always have a number of misses that are independent of m. B+-trees and CSS-trees have only a fraction of the cache misses of binary search. 6 Experimental Results The fractions for CSS-trees are even smaller than that of B+-trees. So CSS-trees have the lowest We perform an experimental comparison of the al- values for the cache related component of the cost. gorithms on two modern platforms. We analyze the As we can see, as m gets larger, the number of wall-clock time taken to perform a large number cache misses for all the methods approaches log2 n of random successful lookups to the index. We (essentially all methods degrade to binary search). summarize our experiments in this section. There is a tradeoff between full CSS-trees and level Experimental Setup: We ran our experiments CSS-trees. While the latter has slightly more cache on an Ultra Sparc II machine (296MHz, 1GB RAM) misses, it also performs fewer comparisons. It’s hard and a Pentium II (333MHz, 128M RAM) personal to compare the moving cost of the two since Alcss is computer. The Ultra machine has a < 16k, 32B, 1 > cheaper than Af css . We will show an experimental (<cache size, cache line size, associativity>) on-chip comparison in Section 6. cache and a < 1M, 64B, 1 > secondary level cache. The PC has a < 16k, 32B, 4 > on-chip cache and To summarize, we expect CSS-trees to perform a < 512k, 32B, 4 > secondary level cache. Both significantly better than binary search and T-trees machines are running Solaris 2.6. We implemented in searching, and also to outperform B+-trees. If a chained bucket hashing, array binary search, inter- bunch of searches are performed in sequence, the top polation search [Pet57], T-tree, enhanced B+-tree, level nodes will stay in the cache. Since CSS-trees full CSS-tree and level CSS-tree in C++. We chose have fewer levels than all the other methods, it will to implement all the existing methods ourselves also gain the most benefit from a warm cache. because we are considering them in the context of Space Analysis: Table 2 lists the space re- main memory OLAP environment. Existing imple- quirements of the various algorithms. The column mentations won’t be optimized for space allocation “Space (indirect)” describes the space required by and cache related issues such as alignment. As a the algorithms if the structure being indexed is a col- result, we believe our implementation will run faster. lection of record-identifiers that can be rearranged Since cache optimization can be sensitive to compil- if necessary. In other words, it is acceptable for a ers [SKN94], we also chose two different compilers: method to store record-identifiers internally within one is Sun’s native compiler CC and the other is its structure, as opposed to leaving the list of record- GNU’s g++. We used the highest optimization level identifiers untouched as a contiguous list. In this of both compilers. However, the graphs for different 85

9 .compilers look very similar, so we only report the Results: In the first experiment, we test how results for CC. All the keys are integers and are long it takes to build a CSS-tree. Both building time chosen randomly between 0 and 1 million. Each curves are linear in the size of the sorted array. Level key takes four bytes. The lookup keys are generated CSS-trees are cheaper to build than full CSS-trees in advance to prevent the key generating time from because they don’t need to traverse each subtree affecting our measurements. We performed 100,000 to find the largest key. We are not claiming that searches on randomly chosen matching keys. We CSS-trees can be built much faster than other index repeated each test five times and report the minimal structures. Instead, we want to show that since the time to find the first matching key. absolute rebuilding time is small (less than a second Algorithm Implementation Details: We for 25 million keys), we can afford to rebuild CSS- tried our best to optimize all the searching methods. trees periodically. The graph is omitted due to space For methods that can have different node sizes, we limitations. implemented specialized versions for selected node We then measure the searching time for all the size. We allocate a large chunk of memory at the algorithms. We first fix the node size and vary the beginning and create all the nodes from that pool size of the sorted array. We choose the node size to save allocation time. We use logical shifts in to be each of the cache line size of the two levels of place of multiplication and division whenever possi- cache in the Ultra Sparc (32 bytes and 64 bytes). ble. For T-trees, B+-trees and CSS-trees, we unfold Figure 5 shows the result on the Ultra Sparc. First the binary search loop in each internal node by of all, when all the data can fit in cache, there hardcoding all the if-else tests to reduce the amount is hardly any difference among all the algorithms of overhead. The search within a leaf node is also (except for interpolation search). As the data size hardcoded. Additionally, once the searching range increases, we can see that our cache conscious CSS- is small enough, we simply perform the equality trees perform the best among all the methods except test sequentially on each key. This gives us better for hashing. T-tree and binary search are the worst performance when there are less than 5 keys in the and run more than twice as slow as CSS-trees. The range. Code specialization is important. When our B+-tree curve falls in the middle. Although these code was more “generic” (including a binary search searching methods all have to do the same number loop for each node), we found the performance to be of key comparisons, they differ on how many of those 20% to 45% worse than the specialized code. comparisons cause a cache miss. T-tree search and The sorted array is aligned properly according to binary search essentially have one cache miss for the cache line size. For T-trees, B+-trees and CSS- each comparison. For CSS-trees, all comparisons trees, all the tree nodes are allocated at once and within a node are performed with only one cache the starting addresses are also aligned properly. miss. B+-trees also have one cache miss per node. We implemented enhanced B+-trees. They use But since a B+-tree stores half as many keys per all the slots in each node.1 Each node is designed to node as CSS-trees, it has more levels than a CSS- have a prespecified size and all the nodes are aligned tree. Although it’s hard to discern visually, the level properly. We also forced each key and child pointer CSS-tree performs slightly better than the full CSS- to be adjacent to each other physically. tree. Across all of our tests we observed that level CSS-trees were up to 8% faster than full CSS-trees. We avoid storing the parent pointer in each The performance of interpolation search depends on node of a T-tree since it’s not necessary for search- how well the data fits a linear distribution. Although ing. We implemented the improved version of T- not shown here, we also did some tests on non- Trees [LC86b], which is a little bit better than the uniform data and interpolation search performs even basic version. For each T-tree node, we store the worse than binary search. So in practice, we would two child pointers adjacent to the smallest key so not recommend using interpolation search. For B+- that they will be brought into the same cache line trees and the two CSS-trees, the numbers in Fig- together. (Most of the time, the improved version ure 5(b) are smaller than that in Figure 5(a). The checks only the smallest key in each node.) reason is that the miss penalty for the second level of For the chained bucket hashing, we followed the cache is larger than that of the on-chip cache. B+- techniques used in [GBC98] by using the cache line trees get more benefit than CSS-trees from having size as the bucket size. Besides keys, each bucket a larger node size. This is because B+-trees always also contains a counter indicating the number of hold half the number of keys per node as CSS-trees occupied slots in the bucket and the pointer to the and thus its benefit from avoiding extra cache misses next bucket. Our hash function simply uses the low is more significant. Although the T-tree has many order bits of the key and thus is cheap to compute. keys per node, it doesn’t benefit from having a larger 1 Since there is always one more pointer than keys, for node size. In this experiment, we chose the hash nodes with an even number of slots, we leave one slot empty. table directory size to be 4 million. Hashing uses 86

10 . 0.5 0.5 array binary search array binary search 0.45 tree binary search 0.45 tree binary search interpolation search interpolation search 0.4 T-tree 0.4 T-tree enhanced B+ tree enhanced B+ tree 0.35 full CSS-tree 0.35 full CSS-tree level CSS-tree level CSS-tree hash hash 0.3 0.3 time(s) time(s) 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 100 1000 10000 100000 1e+06 1e+07 100 1000 10000 100000 1e+06 1e+07 size of sorted array size of sorted array (a) 8 integers per node (b) 16 integers per node Figure 5: Varying array size, Ultra Sparc II 2e+06 2e+06 array binary search array binary search tree binary search tree binary search interpolation search interpolation search T-tree T-tree B+-tree B+-tree secondary cache accesses 1.5e+06 1.5e+06 secondary cache misses full CSS-tree full CSS-tree level CSS-tree level CSS-tree hash hash 1e+06 1e+06 500000 500000 0 0 100 1000 10000 100000 1e+06 1e+07 100 1000 10000 100000 1e+06 1e+07 size of sorted array size of sorted array (a) Total number of key accesses (b) Cache misses in secondary level cache Figure 6: Ultra Sparc II, 16 integers per node only one third of the time of CSS-trees. But we between CPU and RAM widens, this ratio will get have to keep in mind that it’s using 20 times as much higher and the difference between the performance space. of each method will be even more significant. To compare the number of cache misses, we ran all The results on the Pentium PC are very similar the searching methods through a cache simulator.2 and are omitted here. In [LC86a, LC86b], the Figure 6(a) and 6(b) show the total number of authors reported that T-trees perform better than key accesses and the total number of key misses binary search and B+-trees. Our results show the in the secondary level cache in a configuration that contrary conclusion. The explanation is that the matches the Sun Ultra Sparc. As the data set gets CPU speed has improved by two orders of magnitude larger, a larger proportion of the key accesses are relative to memory latency during the past thirteen cache misses. B+-trees have approximately 50% years [CLH98]. 0.5 more cache misses than CSS-trees. Let’s assume that each RAM access takes 180ns.3 Then enhanced 0.45 B+-trees take 0.06 seconds to load data from RAM 0.4 to cache (when there are 25 million keys in the 0.35 T tree sorted array) and this is about 30% of the total 0.3 enhanced B+ tree full CSS-tree time(s) level CSS-tree amount of time taken (0.22 seconds). Out of the 0.04 0.25 hash seconds time difference between B+-trees and CSS- 0.2 trees, about 0.014 seconds is caused by the difference 0.15 in cache miss penalties. Thus the cache simulation 0.1 verifies that 30% of the observed performance gap is 0.05 due to the cache miss behavior. As the speed gap 0 0 20 40 60 80 100 120 140 number of entries per node 2 We implemented the simulator ourselves and instru- mented our code (using #ifdef SIMULATOR macros) to log all Figure 7: Varying node size, 10 million keys memory accesses. 3 We assume the memory has a peak bandwidth of In our next experiment, we fix the size of the 350MB/s [Inc99]. 64/350M=180ns sorted array and vary the node size of T-trees, B+- 87

11 .trees, full CSS-trees and level CSS-trees. Figure 7 shows the result on the Ultra Sparc. For both CSS- 0.1600 0.0350 trees, the lowest point is 16 integers per node, which 0.1400 0.0300 corresponds to the cache line size of the machine. 0.1200 cpu B+-trees have a minimum at 32 integers per node. 0.0250 Although this doesn’t quite fit in the time analysis 0.1000 cpu 0.0200 in Section 5, the difference between 16-integers and 0.0800 cpu cpu cpu cpu 32-integers per node is not very significant. There 0.0600 0.0150 is a bump for full CSS-trees and B+-trees when 0.0100 cache miss 0.0400 each node has 24 integers. One reason is that the 0.0050 cache miss cache miss node size is not a multiple of the cache line size. 0.0200 cache miss cache miss cache miss Thus nodes are not properly aligned with cache 0.0000 0.0000 line and cause more cache misses. The bumps in enhanced B+ level CSS full CSS enhanced B+ level CSS full CSS full CSS-trees are more dramatic. This is because that the arithmetic computation for the child node (a) now (b) in five years is more expensive for m = 24 since division and Figure 8: Time breakdown (16 integers per node) multiplication must be used to compute child nodes 0.04 instead of logical shifts. We also tested the optimal enhanced B+ tree hash directory size of chained bucket hashing. Each full CSS-tree level CSS-tree point corresponds to a directory size from 223 to 218 0.035 with the leftmost point having the largest directory 0.03 size. When the directory size has been increased to 223 , the system starts to run out of memory and thus time (s) 0.025 the searching time goes up. T-trees perform much worse on all the node sizes. 0.02 7 Space/Time Tradeoffs 0.015 An indexing method is measured by the pair (S, T ) 0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 space (bytes) of space required for the whole index structure (S), Figure 9: Space/time Tradeoffs (in five years) and time taken for a single lookup (T ). Figure 2 shows the space requirement and searching time for time being, the cache miss cost corresponds to a each method.4 Each point for T-trees, B+-trees and relatively small fraction of the total cost. Given the CSS-trees corresponds to a specific node size. Since trend shown in Figure 1, in five years, we expect the keeping an ordered RID list is usually necessary, we CPU to be 10 times faster while the memory speed calculate the space requirement using the formula will be only 1.6 times faster. So the breakdown at for “direct” space listed in Table 2. The stepped line that time will probably look like that in Figure 8(b). basically tells us how to find the optimal searching The cache miss cost will be much more significant. time for a given amount of space. All the methods Figure 9 shows the projected space/time tradeoff on the upper-right side of the line are dominated between B+-trees and CSS-trees in five years. As by some methods on the line. T-trees and B+-trees we can see, the optimal CSS-trees are almost 30% are both dominated by CSS-trees. Methods on the faster than the optimal enhanced B+-trees. In the line have tradeoffs between space and time. On the limit, if cache miss cost dominates the total cost, bottom, we have binary search on the sorted array, CSS-trees can be up to 50% faster than enhanced which doesn’t require any extra space, but uses three B+-trees. Also, the optimal node size in five years times as much time as CSS-trees and seven times is different from what it is now. as much as hashing. If we invest a little bit of extra space, we can use CSS-trees, which reduce the 8 Conclusion searching time to one third of binary search. To make another factor of two improvement, we have We studied the cache behavior of various in-memory to pay 20 times as much space as CSS-trees to be indexes in an OLAP environment and showed that able to use hashing. When space is limited, CSS- cache conscious methods such as CSS-trees can im- trees balance the space and time. prove searching performance by making good use of cache lines. As the gap between CPU and memory We show the breakdown of the cost of enhanced speed is widening, we expect the improvement that B+-trees and CSS-trees in Figure 8(a). For the can be achieved by exploiting the cache will be 4 We present the numbers from searching a sorted array of even more significant in the future. Cache conscious 5 million elements on an Ultra Sparc II. searching behavior is just one step towards efficiently 88

12 .utilizing the cache in database systems. We aim to [Knu73] Donald Ervin Knuth. Sorting and Search- study the effect of cache behavior on other database ing, volume 3 of The Art of Computer Program- operations in the future. ming. Addison-Wesley, Reading, Massachusetts, USA, 1973. Acknowledgment: We thank John Grogan for the implementation of our cache simulator. [LC86a] Tobin J. Lehman and Michael J. Carey. Query processing in main memory database management systems. In Proceedings of the ACM SIGMOD Con- References ference, pages 239–250, 1986. [AHK85] Arthur C. Ammann, et al. Design of a memory [LC86b] Tobin J. Lehman and Michael J. Carey. A resident DBMS. In Proceedings of the IEEE COM- study of index structures for main memory database PCOM Conference, pages 54–57, 1985. management systems. In Proceedings of the 12th VLDB Conference, pages 294–303, 1986. [BBC+ 98] Phil Bernstein, et al. The Asilomar report on database research. ACM Sigmod Record, 27(4), [LC87] Tobin J. Lehman and Michael J. Carey. A 1998. recovery algorithm for a high performance memory- [CLH98] Trishul M. Chilimbi, James R. Larus, and resident database system. In Proceedings of the ACM Mark D. Hill. Improving pointer-based codes SIGMOD Conference, pages 104–117, 1987. through cache-conscious data placement. Technical [LL96] Anthony LaMarca and Richard E. Ladner. The report 98, University of Wisconsin-Madison, Com- influence of caches on the performance of heaps. puter Science Department, 1998. ACM Journal of Experimental Algorithmics, 1, 1996. [Com79] D. Comer. The ubiquitous B-tree. ACM Com- [LL97] Anthony LaMarca, et al. The influence of puting Surverys, 11(2), 1979. caches on the performance of sorting. In Eighth [CS83] F. Cesarini, et al. An algorithm to construct a Annual ACM-SIAM Symposium on Discrete Algo- compact B-tree in case of ordered keys. Information rithms, 1997. Processing Letters, 17(1):1612–1630, 1983. [LN88] K. Li and J. F. Naughton. Multiprocessor main [Eng98] InfoCharger Engine. Optimization for de- memory transaction processing. In International cision support solutions (available from http:// Symposium on Databases in Parallel and Distributed www.tandem.com/prod des/ifchegpd/ifchegpd.htm). Systems, pages 177–189, 1988. 1998. [LSC92] Tobin J. Lehman, et al. An evaluation of star- [Fre95] Clark D. French. “One size fits all” database burst’s memory resident storage component. IEEE architectures do not work for DDS. In Proceedings of Transactions on knowledge and data enginnering, the ACM SIGMOD Conference, page 449–450, 1995. 4(6):555–566, 1992. [Fre97] Clark D. French. Teaching an OLTP database [NBC+ 94] Chris Nyberg, et al. Alphasort: a RISC kernel advanced data warehousing techniques. In machine sort. In Proceedings of the ACM SIGMOD Proc. IEEE Int’l Conf. on Data Eng., 1997. Conference, pages 233–242, May 1994. [GBC98] Goetz Graefe, et al. Hash joins and hash teams [Pet57] W. W. Peterson. In IBM J. Res. & Devel., No. in Microsoft SQL server. In Proceedings of the 24th 1, pages 131–132, 1957. VLDB Conference, pages 86–97, 1998. [SAC+ 79] Patricia G. Selinger, et al. Access path selec- [GMS86] Hector Garcia-Molina, et al. High perfromance tion in a relational database management system. transaction processing with memory resident data. In Proceedings of the ACM SIGMOD Conference, In Proceedings of the International Workshop on pages 23–34, 1979. High Performance Transaction Systems, 1986. [SKN94] Ambuj Shatdal, et al. Cache conscious algo- [GR93] Jim Gray, et al. Transaction processing:concepts rithms for relational query processing. In Proceed- and techniques. The Morgan Kaufmann Publishers, ings of the 20th VLDB Conference, 1994. San Francisco, CA, USA, 1993. [Smi82] Alan J. Smith. Cache memories. ACM Com- [Hag86] R. B. Hagmann. A crash recovery scheme for a puting Surverys, 14(3):473–530, 1982. memory-resident database system. IEEE Transac- [Sof97] TimesTen Performance Software. Main-memory tions on Computing, C-35:839–842, 1986. data management technical white paper (available [Inc99] Sun Microsystems Inc. Datasheet on mem- from http://www.timesten.com). 1997. ory (available from http://www.sun.com/ microelec- [Syb97] Sybase Corporation. Sybase IQ 11.2.1, 1997. tronics/datasheets/sme1040/04.html as of feb. 15, 1999). 1999. [WK90] Kyu-Young Whang and Ravi Krishnamurthy. Query optimization in a memory-resident domain [JLRS94] H.V. Jagadish, et al. Dali: A high perfor- relational calculus database system. ACM Trans- mance main memory storage manager. In Proceed- actions on Database Systems, 15(1):67–95, 1990. ings of the 20th VLDB Conference, 1994. [WL91] Michael E. Wolf, et al. A data locality opti- [JTR87] Wiebren De Jonge, et al. Two access methods mizing algorithm. SIGPLAN Notices, 26(6):30–44, using compact binary trees. IEEE Transactions on 1991. Software Engineering, 13(7):799–810, 1987. 89

3 点赞
0 收藏