Cache-Conscious Concurrency Control of Main-Memory Indexes

Recent research addressed the importance of optimizing L2 cache utilization in the design of main memory indexes and proposed the so-called cache-conscious indexes such as the CSB+-tree. However, none of these indexes took account of concurrency control, which is crucial for running the real-world main memory database applications involving index updates and taking advantage of the off-the-shelf multiprocessor systems for scaling up the performance of such applications. Observing that latching index nodes for concurrency control (CC) incurs the so-called

1. Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems Sang K. Cha Sangyong Hwang Kihong Kim Keunjoo Kwon School of Electrical Engineering and Computer Science Seoul National University {chask, syhwang, next, icdi} Abstract DBMS (DRDBMS) in many problem domains. MMDBMS can show orders-of-magnitude higher Recent research addressed the importance of performance than DRDBMS not only for read optimizing L2 cache utilization in the design of transactions but also for update transactions. However, main memory indexes and proposed the so-called such a significant performance gain of MMDBMS over cache-conscious indexes such as the CSB+-tree. DRDBMS does not come automatically by just placing However, none of these indexes took account of the database in memory but requires MMDBMS-specific concurrency control, which is crucial for running optimization techniques. For example, the so-called the real-world main memory database differential logging scheme improves the update and applications involving index updates and taking recovery performance of MMDBMS significantly by advantage of the off-the-shelf multiprocessor enabling fully parallel access to multiple log and backup systems for scaling up the performance of such partition disks [LKC01]. Such a degree of parallelism was applications. Observing that latching index nodes not achieved with the conventional, DRDBMS-oriented for concurrency control (CC) incurs the so-called logging and recovery schemes such as ARIES [MH+92]. coherence cache misses on shared-memory For the efficient processing of read-intensive multiprocessors thus limiting the scalability of transactions, recognizing the ever-widening speed gap the index performance, this paper presents an between CPU and memory, recent research addressed the optimistic, latch-free index traversal (OLFIT) CC problem of optimizing L2 cache utilization in the design scheme based on a pair of consistent node read of main memory indexes and query processing. It has and update primitives. An experiment with been shown that the T-tree, proposed in the 80’s [LC86], various index CC implementations for the B+- shows poor L2 cache utilization with today’s computer tree and CSB+-tree shows that the proposed architecture [RR99]. The so-called cache-conscious index scheme shows the superior scalability on the structures such as the CSB+-tree ([RR00]) reduce the multiprocessor system as well as the cache misses and thereby improve the search performance. performance comparable to that of the sequential The CSB+-tree keeps only one child pointer of the B+- execution without CC on the uniprocessor tree per node, almost doubling the fanout of the tree. system. Recognizing the pointer elimination is not effective for the R-tree whose MBR key is much bigger than a pointer, 1. Introduction the CR-tree focuses on efficiently compressing the MBR With the price of server DRAM modules continues to keys to pack more entries per node [KCK01]. The pkT- drop, the main memory DBMS (MMDBMS) emerges as tree and pkB-tree significantly reduce cache misses by an economically viable alternative to the disk-resident storing partial key information in the index in the face of nontrivial key sizes [BMR01]. While all of these cache-conscious indexes effectively Permission to copy without fee all or part of this material is granted improve the search performance by increasing the index provided that the copies are not made or distributed for direct fanout and reducing the so-called cold and capacity cache commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by misses, they were studied without much consideration of permission of the Very Large Data Base Endowment. To copy otherwise, concurrency control (CC), which is crucial for running the or to republish, requires a fee and/or special permission from the real-world main memory database applications involving Endowment index updates and taking advantage of the off-the-shelf Proceedings of the 27th VLDB Conference, multiprocessor platforms for scaling up the performance Roma, Italy, 2001

2.of such applications. One naïve approach to this problem on an eight-CPU shared-memory multiprocessor system is to use the conventional, disk-resident index CC shows that for the pure read workload, the OLFIT and the schemes such as lock coupling [BS77], which involves physical versioning schemes show almost the same latching nodes during the index traversal. However, this scalable performance as that of no concurrency control. naïve approach leads to the poor scalability because However, the OLFIT is the only scheme that shows latching involves memory write and thus incurs the so- almost linear scalability of update performance with the called coherence cache misses on the shared-memory increasing number of processors. multiprocessor systems [CSG99]. This paper is organized as follows. Section 2 presents The so-called physical versioning was proposed for the background of our study, and section 3 presents the the T-tree to be used in the Dalì main memory storage basic idea of this paper and formulates our problem. system [RS+97]. Its key idea is to create a new version of Section 4 and 5 describe the node and tree operations of the node for the updater so that the index readers do not OLFIT, respectively. Section 6 presents the result of the interfere with the updaters. The major advantage of this experiment conducted to compare OLFIT with other scheme is the latch-free traversal of indexes. As a result, a representative index CC algorithms. Section 7 concludes high degree of concurrency comparable to that of no this paper. concurrency control can be achieved for read transactions. Incorporation of the updated version into the index 2. Background involves obtaining latches either at the tree-level or both on the tree and on the node to update. The major problem 2.1 Memory Hierarchy and Cache Misses with this scheme is the high cost of creating versions. The Today’s computer systems use fast cache memory to fill index performance degrades sharply with the increasing the speed gap between CPU and main memory. For update ratio. The scalability of update performance is also example, on the SUN Enterprise 5500 with 400MHz very poor, even on the dual processor platform where the UltraSparc II CPU and EDO DRAM, L2 cache hit takes reported experiment was conducted. about 8 processor clocks, while L2 cache miss takes about To the best of our knowledge, this paper is the first 100 processor clocks [Sun97]. that addresses the importance of minimizing memory For the shared-memory multiprocessor system, which writes in the CC of main memory indexes to achieve the is the most common form of parallel computing multiprocessor scalability of index search and update environment, there is an additional type of overhead performance. After discussing the coherence cache miss associated with synchronizing multiple caches and overhead associated with latching index nodes, we memory [CSG99]. The so-called coherence cache misses propose a new CC scheme that supports the latch-free occur when one processor updates a specific cache block index traversal for B+-tree variants. Called an optimistic, and the copies of that block that happen to be cached in latch-free index traversal (OLFIT) CC, for each node, this other processors are invalidated. The L1 and L2 cache scheme maintains a latch, a version number, and a link to misses occur when these other processors attempt to the next neighbor node at the same level. The next node access the updated block later. The overhead associated link is borrowed from the Blink-Tree ([LY81]) to facilitate with the cache coherence is severe if multiple processors the split handling. The index traversal involves consistent repeatedly update the same data. Note that the coherence node reads starting from the root. Here, the consistency cache misses do not occur for read-only applications means that no update occurs between the start and the end where each processor simply reads data. of a node read. To ensure the consistency of node reads without latching, every node update first obtains the node 2.2 Cache-Conscious Index Structures latch, updates the node content, increments the version The T-tree, a balanced binary tree with many elements in number, and releases the latch. The node read begins with a node, was proposed in the mid-80’s when the speed gap reading the version number into a register and ends with between CPU and memory was not significant. Its verifying if the node latch is free and the current version traversal pattern of comparing the search key with the two number is equal to the register-stored one. If these two end keys of a node (or only one key in the enhanced conditions are true, the read is consistent. Otherwise, the version) leads to poor L2 cache behavior on today’s node read is retried until the conditions become true. computer systems [RR99]. Recognizing the reasonably To verify the superior scalability of the OLFIT good cache performance of the B+-tree, Rao and Ross scheme, we implemented the OLFIT and a few proposed a variant called CSB+-tree that further enhances representative index CC algorithms for the B+-tree and the cache utilization by eliminating most of child pointers CSB+-tree: lock coupling, tree-level latching, physical [RR00]. This pointer elimination, given the node size in versioning with the node-level latch, and physical the order of the cache block size, almost doubles the versioning with the tree-level latch. We chose not only the fanout and thus leads to significant reduction in the tree CSB+-tree but also the B+-tree because the latter shows height and the cache misses during the index traversal. reasonably good cache performance while not losing good However, because the CSB+-tree stores child nodes in update performance. The result of our experimental study the contiguous memory called the node group, it increases

3.the update cost. The so-called, full CSB+-tree avoids this p1 p2 p3 p4 problem by preallocating the memory for the full node group and handling the node split by shifting nodes in the c1 c2 c3 c4 full node group. n1 n1 n1 n1 n2 n2 n3 n3 2.3 Concurrency Control of Index n4 n5 n6 n7 The lock coupling involves latching index nodes heavily [BS77]. The index traversal proceeds from one node to its step 1 step 2 step 3 step 4 child by holding the latch on the node while requesting the latch on a child. The requested latch is a shared or an Main-Memory n1 exclusive mode, depending on the action to be performed on the target node. The latch on the node is released after n2 n3 the child is latched. One variant of lock coupling is the n4 n5 n6 n7 optimistic descent algorithm. Upon insertion, the algorithm first traverses to a leaf with the optimistic Figure 1. Coherence Cache Miss Caused by Latching assumption that the target leaf is a safe node. If the assumption is not true, it tries again without the optimistic separation does not bring any advantage because the assumption. cache blocks containing the latches are subject to the The Blink-Tree removes the need for lock coupling by coherence cache miss anyway. linking each node to its right neighbor [LY81]. If a node Figure 1 illustrates how the coherence cache misses is split by an insertion, the split node is linked with its occur with the index tree consisting of nodes n1 to n7. For neighbors. In the original B+-tree without links, the index the simplicity, let’s assume that each node corresponds to traversal from one node to its child should hold the latch a cache block and contains a latch. Let the processor p1 on the child before releasing the latch on the node to traverse the path n1, n2, and n4 upon the cold start of the prevent the split of the child. However, in the Blink-Tree, main memory query processing system, and then these the index traversal can release the latch on the node nodes are replicated in the cache c1 of p1. Latches are before holding the latch on the child because it can reach held and released on n1, n2, and n4 on the way. Let p2 the split node from the child through the links. traverse the path n1, n2, and n5, then these nodes are The tree-level locking is a naïve approach that latches copied into c2 and the nodes n1 and n2 cached in c1 are the whole tree in the shared mode for search and in the invalidated by p2 latching on these nodes. Note that these exclusive mode for update. This approach leads to severe cache invalidation occurs even if there is enough room in degradation of concurrency on multiprocessors. the cache c1. If p3 traverses the path n1, n3, and n6, then The physical versioning is a scheme that traverses the the n1 in c2 gets invalidated. Finally, if p4 traverses the index nodes without latching through versioning [RS+97]. path n1, n3, and n7, the n1 and n3 in c3 get invalidated. However, the index update involves significant overhead 3.2 Analysis for Problem Formulation of memory allocation and write. For every update, updaters allocate a new version of a node, copy the To show the problem of latching index nodes for content of the old version to the new one, and update the concurrency control, we first analyze the probability of new one. It then atomically swaps the old version and the coherence cache misses. We assume that index nodes are new one. When node latches are used, the node to update latched once before accessing them. This is a good and its parent are latched. A garbage collector deallocates approximation even for the lock coupling that acquires a the old version of the node when there are no readers that latch and releases it soon during the traversal because the are reading it. We consider two variants of the physical interval between the acquisition and the release is very versioning: one that places latches on nodes, and another short. We also assume that the size of the cache memory with a tree latch only. is infinite and each node has been already cached in a processor to isolate the effect of capacity and cold cache 3. Motivation and Problem Formulation misses, respectively. Finally, we assume that all processors perform the same task and the probability of a 3.1 Coherence Cache Misses Caused by Latching node being accessed by a specific processor is identical to For the node-level CC of main-memory index, latches are its probability of being accessed by another processor. typically placed inside the index node. A latching A cached node is invalidated if another processor, operation, whether it is for acquiring or releasing a latch other than the one that caches the node, accesses the node. or whether it is a shared-mode or an exclusive-mode, Therefore, the probability that the coherence cache miss involves a memory write. With the conventional index CC occurs for a node cached in a processor is the same as the schemes, the invalidation of the cache block containing probability that another processor accesses the same node. the latch occurs even if the index is not updated. It is Let p denote the number of processors in the system. Then, possible to separate latches from index nodes, but such the (p – 1) processors among the p processors can

4.invalidate the node cached in a specific processor. Since 100 Probability of update (%) each processor has the same probability of accessing the node, the probability of coherence cache miss is: 80 experimental p −1 (1) probabilistic 60 p This formula leads to a conclusion that the probability 40 of coherence cache misses increases with the number of processors. Thus, if we don’t pay attention to the 20 coherence cache miss, even the infinite cache memory 0 may not be very helpful for reducing the memory access 1 2 3 4 5 6 7 cost. Level We showed that if we can read nodes consistently without latching them as we propose in the OLFIT Figure 2. Probability of Update with 100% Insert Workload scheme, we can save many cache misses caused by the leaf node of the B+-tree propagates to the upper level latching during index search. Now we will show that we nodes decreases sharply with the distance from the leaf. In can also save many cache misses with latch-free traversal this regard, the conventional index CC schemes that latch during index updates. We support this claim by showing the nodes during the traversal are too conservative when that even during index inserts, many nodes are only read applied to the in-memory B+-tree or the CSB+-tree, without updating them. which is significantly deeper than the disk-resident B+- If we denote the maximum fanout of nodes as fM, the tree because its node size is given in the order of up to a probability that a leaf node split by an insert operation is few times of the L2 cache block size. Most of latches are as follows [JS93]: needlessly acquired and released by both index readers 1 and updaters, especially for the upper level nodes, even if 0 . 69 f M the probability of actually updating such nodes is If we denote the level of a node as l, and the height of extremely low. the tree as h, since a split of a child is an insert of entry The duration of holding a latch is very short. If we into its parent node, the probability of update at a level l consider that the number of leaf nodes is a lot bigger than is: the typical number of processors available in today’s (h−l)  1  (2) multiprocessor platforms, the probability of the conflict   among the concurrent index readers and updaters is also  0 . 69 f M    extremely low. So the latches on leaf nodes are also Figure 2 shows the graph calculated from the equation acquired too pessimistically. (2) as well as the relative frequency of the actual updates In the disk-resident database systems, the cost of at a specific level l measured when 1 million keys are latching may not be significant because of the dominant inserted into a B+-tree with 10 million keys. We choose cost of disk access. However, in the main-memory fM = 15, because it is the maximum fanout of the B+-tree database systems, the latching cost is a dominant portion whose node size is two cache blocks. This B+-tree will be of the processing cost of read transactions, and also is a used as the basis of our experimental evaluation to be nontrivial portion of the processing cost of update presented in section 6. With fM = 15, h = transactions. Thus the latching overhead is one of the log 15×0.69 10,000,000 = 7. The conclusion that we can primary factors limiting the scalability of the index draw from Figure 2 is that most accesses to non-leaf performance on the multiprocessor platform. This will be nodes are read-only even when all index operations are proved later in the experimental evaluation by the poor insertions. scalability of the lock coupling scheme on the multiprocessor platform. 3.3 Optimistic, Latch-Free Index Traversal CC Based on these observations, we propose the OLFIT, The purpose of the index CC is to guarantee that index an index CC scheme that traverses down the index readers reach the correct leaf nodes without interfering optimistically without latching any nodes. The index with concurrent index updaters, and that index updaters updaters start acquiring the latch from the leaf node after do not interfere with each other. One of the key criteria reaching the leaf node to update. The latch acquisition for judging whether a concurrency control scheme is good proceeds upward if the update of a node leads to the node is the degree of parallelism that it provides. The objective split or the node deletion. of this subsection is to draw the rationale for the latch-free To guard the latch-free index traversal from traversal for both the index readers and updaters. interfering with node updates and also to guard the node In the previous analysis, we found that most of actual updates from interfering with other node updates, we B+-tree index node updates occur at the leaf level or near include the version number and the latch in the B+-tree the leaf level. Namely, the probability that the update at index node and provide a pair of node read and update

5. size R3. If latch is locked, go to R1. latch level R4. If the current value of version is different from the copied value in R, go to R1. version keys and ptrs high key link ptr The steps R3 and R4 of ReadNode guarantee readers only Figure 3. Node Structure of B+Tree for OLFIT read a node in a consistent state without holding any latches. Readers can pass both R3 and R4 only if the data primitives that use this information. Updaters hold a latch read in R2 is in a consistent state, otherwise readers start on a node before the update, increment the version again from R1. In other words, if the content of a node number, and release the latch after the update. This guarantees that the updater changes a node from a read in R2 is in an inconsistent state, either the condition consistent state into another consistent state. Readers read in R3 or the condition in R4 becomes true and the reader the node latch to check if there is any updater and read the cannot pass both R3 and R4. version number twice before and after reading the node to An implementation of ReadNode and UpdateNode verify whether the nodes they read are in a consistent state. can be directly derived from the algorithm description if Readers retry reading until the verification succeeds. we just implement latching using one of the atomic read- However, the probability of retrial is very low because of and-write operations such as test-and-set and the first and the second observations that we made in the compare-and-swap that are universally available on the above. modern computer architectures. However, since In the following two sections, we first present the pair consuming two words for the latch and the version per of node read and update operations used by the OLFIT every node is expensive given the node size in the order scheme, and then present how the index-level traversal of one or two cache blocks, we combine these two words and update work. into a single word named ccinfo in the Figure 4. The LSB of ccinfo is used for the latch and other bits are 4. Node Operations for OLFIT used for the version number. This combination enables further optimization of using only one conditional jump Figure 3 shows the structure of the B+-tree node used by for checking the conditions of R3 and R4 in ReadNode the OLFIT. The latch and the version are added for algorithm. the OLFIT. The level represents the level of the node in In Figure 4, the operation compare-and-swap(A, B, the tree and the size represents the number of entries in C) is an atomic operation that compares the values of A the node. The usage of the high key and the link ptr and B, and if they are equal, replaces the value of A by the is explained in the section 5. In the following algorithms value of C and returns the original value of A. Turn-on- for reading and updating nodes, we assume reading and LSB(word) and turn-off-LSB(word) are bit writing of words are atomic, as are in most of the modern architectures. operations that turn on and off the LSB of a word, respectively. Algorithm UpdateNode The procedure update_node in Figure 4 is an U1. Acquire latch. implementation of the algorithm UpdateNode. The step 2 U2. Update the content. of the procedure latch copies the value of ccinfo with U3. Increment version. its LSB turned off. The step 3 of the procedure latch U4. Release latch. checks whether the LSB of ccinfo is turned on or not by comparing the value of ccinfo and the value of t copied Update operations on one node are isolated and in the step 2, and atomically turns on the LSB of ccinfo serialized using the latch on the node. Updaters acquire by replacing the value of ccinfo by t with its LSB the latch on the node for the exclusive access before turned on. The procedure unlatch releases the latch and updating the content of the node: version, level, increments the version number by increasing ccinfo. size, keys, ptrs, high key, and link ptr. The Since the LSB of ccinfo is turned on in the procedure latch is released after the update. Updaters also latch, the increment in the procedure unlatch turns off increment the version before releasing the latch. the latch and increments the version number Differently from node reads, node updaters always simultaneously. succeed and there is no need for retry. The U3 step of The procedure read_node in Figure 4 is an incrementing the version is intended to enable readers implementation of the algorithm ReadNode. The step 4 of to read the node without latching and verify whether the read_node checks conditions in R3 and R4 of node they read is in a consistent state or not. ReadNode simultaneously, because if t is equal to Algorithm ReadNode ccinfo, the LSB of the current ccinfo must be turned R1. Copy the value of version into a register R. off and other bits must have not been changed since step 2. R2. Read the content of the node.

6.procedure latch(word) { procedure traverse(root, key){ 1. do { 1. node:= root; 2. t:= turn-off-LSB(word); 2. while (node is not a leaf) { 3. } while (compare-and-swap 3. t:= turn-off-LSB(node.ccinfo); (word, t, turn-on-LSB(t)) ≠ t); 4. next:= find_next(node, key); } 5. if (node.ccinfo = t) node:= next; 6. } procedure unlatch(word) { 7. return node; word:= word + 1; } } Figure 5. Pseudocode of Tree Traversal procedure update_node(node) { 1. latch(node.ccinfo); The procedure read_node is embedded in the 2. // Update the content of node procedure traverse of Figure 5. The while loop of the 3. unlatch(node.ccinfo); } read_node is removed by assigning the value of next to the variable node only if the value of next is computed procedure read_node(node) { from a node in a consistent state. 1. do { 2. t:= turn-off-LSB(node.ccinfo); 5.3 Dealing with node deletion 3. // Read the content of node The updater can delete a node being read if the node is 4. } while (t ≠ node.ccinfo) empty. To deal with this case, when a node becomes } empty, the updater only removes links directed to the Figure 4.Implementation of UpdateNode and ReadNode node and registers the node into a garbage collector. The value of the link pointer in the empty node is preserved until the node is actually deallocated. The garbage 5. Tree Operations for OLFIT collector actually deallocates the registered node when there are no index operations that can read the node. To 5.1 Dealing with node split determine whether there is any operation that can read the Since the OLFIT does not use lock coupling while node or not, we use the algorithm originally proposed for traversing down a tree index, concurrent updaters may the physical versioning [RS+97]. However, the overhead split the target child node of the traversal before the is quite different, because the physical versioning uses the traversal reaches the child node. Moreover, since no latch garbage collector on every update, while we use it only is held while reading a node, concurrent updaters may when an updater removes an empty node. also split the node currently being read. To deal with this problem, we use the technique of using a high key and a 5.4 Putting together with transaction-duration locking link pointer proposed by [LY81] and improved by To support the serializability of transactions by [Sag85]. All splits are done from the left to the right and transaction-duration locking, both the locking protocol of to each node, a high key and a link pointer are added. The ARIES/KVL [Moh90] and that of ARIES/IM [ML92] can high key is the upper bound of the key values in the node be combined with our OLFIT algorithm without any and the link pointer is the pointer to the right neighbor of modification. The OLFIT replaces only the latching the node. The purpose of the link pointer is to provide an protocols of ARIES/KVL and ARIES/IM. additional method for reaching a node and whether to follow the link pointer or not can be determined by the 5.5 Adaptation to CSB+-tree high key. With this link pointer, since splits are done from In the CSB+-tree, nodes with the same parent are the left to the right and each node has its high key and its clustered in a contiguous space called node group and link pointer to the right neighbor, all nodes split from a non-leaf nodes of the CSB+-tree store only the pointer to node are reachable from the node and the correct child the child node group instead of storing all pointers to node can be reached in the presence of concurrent splits child nodes. CSB+-tree nodes for the OLFIT only store of nodes. high keys without link pointers because the right neighbor in the same node group can be located without a pointer. 5.2 Tree traversal algorithm Only one link pointer to the right node group is stored for Figure 5 shows the pseudo code for the tree traversal. The each node group to locate the right neighbors in other find_next(node, key) primitive finds the next node node groups. The link pointer has its own latch and to traverse. If the key is greater than the high key of the version, because the link pointer does not belong to any node, this operation returns the link pointer. Otherwise, it node in the node group. Here, the CSB+-tree means the returns the pointer to the appropriate child node to full CSB+-tree. Extensions to other variations of the traverse down. CSB+-tree are straightforward.

7. The split operation of the CSB+-tree is different from 3,500 that of the B+-tree because nodes with the same parent are Operations (x1000) / sec 3,000 clustered in a node group. When a node splits, if the node B+-tree 2,500 Search group is not full, all right siblings in the node group shift right to make a room for the split. In this case, the node to 2,000 B+-tree Insert split, all shifting nodes, and the parent node are latched 1,500 before the split. If the node group is full, the node group CSB+-tree 1,000 Search splits into two node groups. In this case, the node to split, the shifting nodes, the parent node, the nodes to be moved 500 CSB+-tree into the new node group, and the link pointer to the right Insert 0 node group are latched. 0 128 256 384 512 Size of a node (bytes) 6. Experimental Evaluation To verify the superior scalability of the OLFIT scheme Figure 6. Performance of OLFIT with Varying Node Size experimentally, we implemented five index CC schemes for the B+-tree and the full CSB+-tree: lock coupling with 32-bit addressing mode. After the initialization, the height node latches (LC), tree-level locking with a tree latch (TL), of B+-trees is 7 and the height of full CSB+-trees is 6, and physical versioning with node latches (VN), physical their sizes are about 140 MB and 190 MB, respectively. versioning with a tree latch (VT), and OLFIT (OL) Indexes are about 70% full because they are initialized by proposed in this paper. The performance of index insertions [JS93]. We chose 128 bytes for the size of index nodes operations without concurrency control (NO) is also because the 128-byte node produces the best performance measured for the 100% search and the single thread when the indexes are built by insertions. The eight-thread experiment. For the lock coupling, we used the optimistic performance of the OLFIT with varying node size is descent algorithm in [BS77], and for the physical shown in Figure 6. The 128-byte node is the best as versioning schemes, we adapted the algorithm for the T- shown in the graph because with 70% full nodes on tree in [RS+97] to the B+-tree and to the full CSB+-tree. average, there is a high probability of accessing only the We ran our experiment on a Sun Enterprise 5500 first 64-byte block of 128-byte nodes. server with 8 CPUs (UltraSPARC II, 400MHz) running Table 1 shows the maximum fanout of nodes when Solaris 7. Each CPU has 8MB L2 cache whose cache line each concurrency control scheme is applied. TL and VT size is 64 bytes. We ran each concurrency control scheme with following workloads: 100% search, 100% insert, allow the largest fanout because they need no concurrency 50% insert + 50% delete and a mixture of search, insert control information on nodes. LC and VN allow the and delete with varying update ratio. We do not show the smaller fanout because they need a latch on each node, graph of 100% insert due to lack of space. The graph is and OL allows the smallest fanout because it needs a latch, very similar to that of 50% insert + 50% delete. a version, a high key and a link pointer on each node. For the fair comparison with the physical versioning 6.1 Pure search performance schemes that require substantial memory allocation for updates, we used memory pools to reduce the overhead of Figure 7 shows the scalability of the search performance system calls for memory allocation. We created the same of the experimented CC schemes. We measured the number of memory pools as the number of processors in throughput of exact match search varying the number of the system, and assigned one to each thread to minimize threads that perform the search operations. The x-axis the contention for the memory allocation. shows the number of threads performing search For each experiment, we used non-unique indexes that operations, and the y-axis shows the aggregated allow duplicate key values, and the indexes are initialized throughput. by the insertion of 10 million uniformly distributed 4- The OLFIT (OL) and the physical versioning schemes bytes integer keys and associated pointers. The size of (VN, VT) are similar to the no concurrency control (NO). pointers is 4 bytes because we ran our experiment in the The gap between these schemes and no concurrency control is the cost of the interaction with the garbage OL LC TL VN VT collector. The gap between OL and NO on the CSB+-Tree is wider than on the B+-tree, because of the different cost Leaf 14 15 15 15 15 of checking the high key. For the B+-tree, high keys need B+-tree Non-leaf 14 15 16 15 16 not be specially treated on non-leaf nodes because Leaf 14 15 15 15 15 traversing to the right neighbor and traversing down to CSB+-tree one of the children are not different. However, for the Non-leaf 28 29 30 29 30 CSB+-tree, traversing to the right and traversing downward are different: positions of children are Table 1. Max Node Fanout of Various CC Schemes

8. Ŭ#Ůũũ Ŭ#Ůũũ Operations (x1000) / sec Operations (x1000) / sec Ŭ#ũũũ Ŭ#ũũũ NO ū#Ůũũ ū#Ůũũ OL ū#ũũũ ū#ũũũ VN Ū#Ůũũ Ū#Ůũũ VT Ū#ũũũ Ū#ũũũ TL LC Ůũũ Ůũũ ũ ũ ũ ū ŭ ů ű ũ ū ŭ ů ű Number of threads Number of threads (a) B+-tree (b) Full CSB+-tree Figure 7. Search Performance of Concurrency Control Schemes ū#Ůũũ ū#Ůũũ Operations (x1000) / sec Operations (x1000) / sec OL ū#ũũũ ū#ũũũ LC Ū#Ůũũ Ū#Ůũũ TL VN Ū#ũũũ Ū#ũũũ VT Ůũũ Ůũũ ũ ũ ũ ū ŭ ů ű ũ ū ŭ ů ű Number of threads Number of threads (a) B+-tree (b) Full CSB+-tree Figure 8. Insert and Delete Performance of Concurrency Control Schemes computed from the pointer to the child node group while show poor update performance due to the high cost of the position of the right neighbor is computed from the versioning, especially for the CSB+-tree where the whole position of the current node. This special treatment on node group must be versioned for each update. Differently high keys consumes slightly more time and makes the from [RS+97], VN is better than VT in update small gap from the physical versioning schemes. performance because we changed the algorithm of VN that The tree-level locking (TL) becomes worse as the was originally proposed for the T-tree in the process of number of threads increases due to the contention at the adapting it to the B+-tree. In [RS+97], if structure tree latch and two more coherence cache misses generated modifications take place, VN holds a tree latch. We by holding and releasing the tree latch. The performance eliminated the need for the centralized tree latch from VN of the lock coupling (LC) is worst and the gap from the no because the split of the B+-tree is different from the concurrency control widens almost linearly with the rotation of the T-tree and the centralized tree latch is not number of threads, because the lock coupling generates needed for structure modifications. The centralized tree many coherence cache misses by latching many nodes. latch degenerates the update performance significantly as the number of threads increases. 6.2 Pure update performance Although the physical versioning with node latches Figure 8 shows the scalability of the update performance (VN) and the lock coupling (LC) do not produce good of the experimented CC schemes. We measured the performance, their performance increases slowly as the update throughput, varying the number of threads that number of threads increases. However, the physical perform updates. The x-axis shows the number of threads, versioning with a tree latch (VT) and the tree-level and the y-axis shows the aggregated throughput. Half of locking (TL) degrades due to the contention on the operations are insertions and the other half are deletions. centralized tree latch as the number of threads increases. In Figure 8, only OLFIT (OL) shows a scalable performance. The physical versioning schemes (VT, VN)

9. Ůũũ Ůũũ ŭŮũ ŭŮũ Operations (x1000) / sec Operations (x1000) / sec ŭũũ ŭũũ NO ŬŮũ ŬŮũ TL Ŭũũ Ŭũũ ūŮũ ūŮũ OL ūũũ ūũũ LC ŪŮũ ŪŮũ VN Ūũũ Ūũũ Ůũ Ůũ VT ũ ũ ũ ūũ ŭũ ůũ űũ Ūũũ ũ ūũ ŭũ ůũ űũ Ūũũ Insert+delete ratio (%) Insert+delete ratio (%) (a) B+-tree (b) Full CSB+-tree Figure 9. Single Thread Performance with Varying Update Ratio Ŭ#Ůũũ Ŭ#Ůũũ Ŭ#ũũũ Operations (x1000) / sec Operations (x1000) / sec Ŭ#ũũũ ū#Ůũũ ū#Ůũũ OL ū#ũũũ ū#ũũũ LC Ū#Ůũũ Ū#Ůũũ TL Ū#ũũũ VN Ū#ũũũ VT Ůũũ Ůũũ ũ ũ ũ ūũ ŭũ ůũ űũ Ūũũ ũ ūũ ŭũ ůũ űũ Ūũũ Insert+delete ratio (%) Insert+delete ratio (%) (a) B+-tree (b) Full CSB+-tree Figure 10. Eight-Thread Performance with Varying Update Ratio versioning schemes (VN, VT) is comparable to that of the 6.3 Performance with varying update ratio OLFIT (OL). However, as the update ratio increases, OL Figure 9 and Figure 10 show the throughput of a single becomes significantly better than any other concurrency thread and eight threads, respectively, for the mixed control algorithms. workload with varying update ratio. The workload The performance gap between OL and other schemes consists of a sequence of one million operations, each of is wider with eight threads than with a single thread. The which is randomly decided to be search, insert, or delete gap widens partly due to the large amount of coherence based on the given update ratio. Updates are evenly cache misses generated by other algorithms and partly due divided into inserts and deletes. to the contention at the centralized tree latch in the case of In Figure 9 showing the result of sequential execution, TL and VT. Note that VT is slightly better than TL but it the OLFIT (OL) and the tree-level locking (TL) are approaches to TL as the update ratio further increases and similar to the no concurrency control (NO). The lock eventually crosses TL due to the high cost of versioning. coupling (LC) shows worse performance than OL and TL The performance drops more sharply for the CSB+- due to the overhead of latching. The physical versioning tree than the B+-tree with the update ratio mainly because schemes (VN, VT) show good performance for no updates, of the higher split cost of the CSB+-tree. In the CSB+-tree, but as the update ratio increases, their throughput drops when a node splits, if the node group that contains the sharply due to the high overhead of versioning. As shown node is not full, all the right neighbors of the node in the in Figure 9 (b), the versioning overhead is even heavier same node group are shifted right, and if the node group is for the CSB+-tree because in the CSB+-tree, nodes with full, half of nodes in the group are moved into a new node the same parent are grouped in a contiguous space and group. Note that VN and VT are even worse than TL as the versions are created per node group basis. update ratio exceeds 20% due to the high cost of Figure 10 shows the result with eight threads. When versioning. the ratio of update is zero, the performance of the physical

10.7. Conclusion and Future Work Database Management Systems,” In Proc. of VLDB Conf., 1986, pages 294-303. This paper addressed the importance of considering the cache effect, especially, the coherence cache miss [LKC01] Juchang Lee, Kihong Kim, and Sang K. Cha, overhead associated with latching main-memory index “Differential Logging: A Commutative and nodes for the concurrency control on multiprocessor Associative Logging Scheme for Highly platforms, and proposed a new optimistic index CC called Parallel Main Memory Database,” In Proc. of OLFIT for the B+-tree and the CSB+-tree. Observing that IEEE ICDE Conf., 2001. most of latches are held too conservatively for the cache- [LY81] Philip L. Lehman and S. Bing Yao, “Efficient conscious main-memory index, this scheme completely Locking for Concurrent Operations on B- eliminates latching during the index traversal. Even the Trees,” ACM TODS, Vol. 6, No. 4, 1981, pages index update does not incur any latching operation until 650-670. the traversal reaches the node to update. Latches are [MH+92] C. Mohan, Donald J. Haderle, Bruce G. requested upward upon the split of nodes. Lindsay, Hamid Pirahesh and Peter Schwarz, To prevent the index updates from interfering with “ARIES: A Transaction Recovery Method index reads and other updates, the OLFIT first provides a Supporting Fine-Granularity Locking and pair of consistent node read and update operations. Based Partial Rollbacks Using Write-Ahead on this pair, we presented the design of the tree search and Logging,” ACM TODS, Vol. 17, No. 1, 1992, update operations. An experiment comparing the OLFIT pages 94-162./ with various representative index CC schemes for the B+- tree and the CSB+-tree shows that the OLFIT shows [ML92] C. Mohan and Frank Levine, “ARIES/IM: An superior update scalability on the eight-CPU Efficient and High Concurrency Index multiprocessor system while showing the read scalability Management Method Using Write-Ahead comparable to those of the no concurrency control. Logging,” In Proc. of ACM SIGMOD Conf., Although we presented the OLFIT algorithm for the 1992, pages 371-380. B+-tree and the CSB+-tree, it can be easily adapted to [Moh90] C. Mohan, “ARIES/KVL: A Key-Value other cache-conscious index schemes such as the CR-tree, Locking Method for Concurrency Control of the cache-conscious version of the R-tree. For the future Multiaction Transactions Operating on B-Tree work, we are integrating the OLFIT with P*TIME, a Indexes,” In Proc. of VLDB Conf., 1990, pages highly parallel transact in memory engine, to investigate 392-405. the impact of the OLFIT in the real main memory [RR99] Jun Rao and Kenneth Ross, “Cache Conscious database applications. Indexing for Decision-Support in Main Memory,” In Proc. of VLDB Conf., 1999, References pages 78-89. [BMR01] Philip Bohannon, Peter McIlroy, and Rajeev [RR00] Jun Rao and Kenneth Ross, “Making B+-trees Rastogi, “Improving Main-Memory Index Cache Conscious in Main Memory,” In Proc. Performance with Partial Key Information,” In of ACM SIGMOD Conf., 2000, pages 475-486. Proc. of ACM SIGMOD Conf., 2001. [RS+97] Rajeev Rastogi, S. Seshadri, Philip Bohannon, [BS77] Rudolf Bayer and Mario Schkolnick, Dennis Leinbaugh, Avi Silberschatz, and S. “Concurrency of Operations on B-Trees,” Acta Sudarshan, “Logical and Physical Versioning Informatica 9, 1977, pages 1-21. in Main Memory Databases,” In Proc. of VLDB [CSG99] David E. Culler, Jaswinder Pal Singh, and Conf., 1997, pages 86-95. Anoop Gupta, Parallel Computer Architecture, [Sag85] Yehoshua Sagiv, “Concurrent Operations on Morgan Kaufmann Publishers, Inc., 1999. B*-Trees with Overtaking,” In Proc. of [JS93] Theodore Johnson and Dennis Shasha, The SIGACT/SIGMOD PODS, 1985, pages 28-37 Performance of Current B-Tree Algorithms, [Sun97] Sun Microsystems, UltraSPARCTM User’s ACM TODS, Vol. 18, No. 1, March 1993, Manual, 1997. pages 51-101. [KCK01] Kihong Kim, Sang K. Cha, and Keunjoo Kwon, “Optimizing Multidimensional Index Trees for Main Memory Access,” In Proc. of ACM SIGMOD Conf., 2001. [LC86] Tobin J. Lehman and Michael J. Carey, “A Study of Index Structures for Main Memory