Dictionary-based Order-preserving String Compression

Column-oriented database systems [19, 23] perform better than traditional row-oriented database systems on analytical workloads such as those found in decision support and business intelligence applications. Moreover, recent work [1, 24] has shown that lightweight compression schemes significantly improve the query processing performance of these systems. One such a lightweight compression scheme is to use a dictionary in order to replace long

1. Dictionary-based Order-preserving String Compression for Main Memory Column Stores Carsten Binnig Stefan Hildenbrand Franz Färber ETH Zurich ETH Zurich SAP AG binnigc@inf.ethz.ch stefanhi@inf.ethz.ch franz.faerber@sap.com ABSTRACT 1. INTRODUCTION Column-oriented database systems [19, 23] perform better than Column-oriented database systems (such as Monet-DB [23] and traditional row-oriented database systems on analytical workloads C-Store [19]) perform better than traditional row-oriented database such as those found in decision support and business intelligence systems on analytical workloads such as those found in decision applications. Moreover, recent work [1, 24] has shown that light- support and business intelligence applications. Recent work [1, weight compression schemes significantly improve the query pro- 24] has shown that lightweight compression schemes for column- cessing performance of these systems. One such a lightweight oriented database systems (called column stores further on) en- compression scheme is to use a dictionary in order to replace long able query processing on top of compressed data and thus lead (variable-length) values of a certain domain with shorter (fixed- to significant improvements of the query processing performance. length) integer codes. In order to further improve expensive query Dictionary encoding is such a light-weight compression scheme operations such as sorting and searching, column-stores often use that replaces long (variable-length) values of a certain domain with order-preserving compression schemes. shorter (fixed-length) integer codes [1]. In order to compress the In contrast to the existing work, in this paper we argue that order- data of a certain column that is loaded into a data warehouse using preserving dictionary compression does not only pay off for at- such a compression scheme, existing column stores usually create tributes with a small fixed domain size but also for long string at- an array of distinct values (i.e., the dictionary) and then store each tributes with a large domain size which might change over time. attribute value of that column as an index into that array. Dictionar- Consequently, we introduce new data structures that efficiently sup- ies are usually used in column stores if the size of the corresponding port an order-preserving dictionary compression for (variable- domain is small. length) string attributes with a large domain size that is likely to Bit packing is then used on top of dictionaries to further com- change over time. The main idea is that we model a dictionary as a press the data [12]. This compression scheme calculates the min- table that specifies a mapping from string-values to arbitrary integer imal number of bits that are necessary to represent the maximal codes (and vice versa) and we introduce a novel indexing approach index into the dictionary. Bit packing makes sense if the size of that provides efficient access paths to such a dictionary while com- the domain is stable (or known a priori). However, in many prac- pressing the index data. Our experiments show that our data struc- tical data warehousing scenarios the domain size is not stable. As tures are as fast as (or in some cases even faster than) other state- an example, think of a cube inside a data warehouse of a big su- of-the-art data structures for dictionaries while being less memory permarket chain which holds the sales of all products per category intensive. (e.g., whole milk, low fat milk, fat free milk) , While the total num- ber of categories is not too large, it is likely that the categories will change over time (i.e., new products are added to the selection of Categories and Subject Descriptors the supermarket). H.2.7 [Database Management]: Database Administration—Data In order to deal with situations where the domain size is not dictionary/directory; E.4 [Data]: Coding and Information The- known a priori, existing column stores usually analyze the first bulk ory—Data compaction and compression; H.3.1 [Information Stor- of data that is loaded in order to find out the current domain size of age and Retrieval]: Content Analysis and Indexing—Dictionar- a certain attribute (e.g., the total number of product categories) and ies, Indexing methods then derive the minimal number of bits (for bit packing). However, if subsequent bulks of data contain new values that were not loaded previously, existing column stores usually have to decode all the General Terms previously loaded data (e.g., the data stored inside the sales cube) Algorithms, Design, Performance and then encode that data again together with the new bulk using more bits to represent the new domain size. This situation becomes even worse if different attributes (that are not known a priori) share the same global dictionary to enable join processing or union oper- Permission to make digital or hard copies of all or part of this work for ations directly on top of the encoded data. personal or classroom use is granted without fee provided that copies are In addition, column stores often use order preserving compres- not made or distributed for profit or commercial advantage and that copies sion schemes to further improve expensive query operations such bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific as sorting and searching because these operations can then be ex- permission and/or a fee. ecuted directly on the encoded data. However, order-preserving SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA. compression schemes either generate variable-length codes (e.g., Copyright 2009 ACM 978-1-60558-551-2/09/06 ...$5.00. 283

2. Product data String-dictionary Product column (non-encoded) (order preserving) (encoded) rid p_name value code rid p_name Query (original): 1 Whole Milk - Gallon … … 1 32000 Select SUM(o_total), p_name … … Whole Milk - Gallon 32000 … … From Sales, Products Query Result String-Dictionary Query Result 499 Whole Milk - Quart Whole Milk - Quart 32100 499 32100 Where p_name='Whole Milk*' (encoded) (order preserving) (non-encoded) Group by p_name … … p_name value code p_name … first bulk 32000 … … Whole Milk - Gallon Query (rewritten): 32050 Whole Milk - Gallon 32000 Whole Milk - Half Gallon rid p_name value code rid p_name 32100 Whole Milk - Half Gallon 32050 Whole Milk - Quart Select SUM(o_total), p_name 500 Whole Milk - Gallon … … … … Whole Milk - Quart 32100 From Sales, Products … … Whole Milk – Gallon 32000 499 32100 Where p_name ≥ 32000 … … 999 Whole Milk - Half Gallon Whole Milk - Half Gallon 32050 500 32000 And p_name ≤ 32100 Whole Milk – Quart 32100 Group by p_name … … second bulk … … 999 32050 (a) Data Loading (2 bulks) (b) Query Compilation (c) Query Execution Figure 1: Dictionary-based order-preserving string compression [2]) that are known to be more expensive for query processing in integer codes for the string values that are already a part of column stores than fixed-length codes [12], or they generate fixed- the dictionary and the second operation is the bulk insertion length codes (e.g., by using indexes in a sorted array) that are more of the new string values as well as the generation of order- difficult to extend when new values should be encoded in an order- preserving integer codes for those new values. As an exam- preserving way. ple, in Figure 1 (a) we see how two bulks of product data In contrast to the existing work, in this paper we argue that order- (i.e., the column p_name) are loaded into the sales cube. preserving dictionary compression does not only pay off for at- • Query Compilation: In order to execute analytical queries tributes with small domain sizes but also for long string attributes directly on top of encoded data, it is necessary to rewrite the with a large domain size. For example, the sales cube mentioned query predicates. If an order preserving encoding scheme before could contain product names of type VARCHAR(100). If is used, this step is trivial: The string constants of equality- we encode one million different product names using a dictionary and range-predicates only have to be replaced by the cor- that generates fixed-length integer codes (e.g., 32-bit), we would responding integer codes. Moreover, prefix-predicates (e.g., get a very good average compression rate for that column. p_name=’Whole Milk*’) can be mapped to range pred- However, using a sorted array and indexes into that array as icates1 . Consequently, a string-dictionary should enable ef- fixed-length integer codes is too expensive for large dictionaries ficient lookups to rewrite string constants as well as string where the domain size is not known a priori. There are different prefixes. As an example, in Figure 1 (b), we see how the reasons for this: First, using a sorted array and binary search as the predicate of a query is rewritten. only access path for encoding data is not efficient for large string • Query Execution: During query execution, the final query dictionaries. Second, if the index into the sorted array is used as in- result (and sometimes intermediate query results) must be teger code, each time a new bulk of string data is loaded it is likely decoded using the dictionary (which can be seen as a semi- that the complete dictionary has to be rebuilt to generate order- join of the encoded result with the dictionary). As most col- preserving codes and all attributes that use that dictionary have to umn stores use vectorized query operations (or sometimes be re-encoded. For the same reasons, strict bit packing on top of even materialize intermediate query results) [12, 24], a string- an order-preserving dictionary compression scheme does not make dictionary should also support the efficient decoding of the sense either. query results for bulks (i.e., bulk-lookups of string values for Motivated by these considerations, this paper introduces data a given list of integer codes). As an example, in Figure 1 structures that efficiently support an order-preserving dictionary- (c), we see how the dictionary is used to decode the column compression of (variable-length) string attributes where the domain p_name of the encoded query result. size is not known a priori (e.g., when the dictionary is shared by different attributes). Furthermore, the integer codes that are gener- While all the tasks above are time-critical in today’s data ware- ated have a fixed length to leverage efficient query processing tech- houses, query processing is the most time-critical one. Conse- niques in column stores and we do not use bit-packing on top of quently, the data structures that we present in this paper should be these integer codes to efficiently be able to support updates. Conse- fast for encoding but the main optimization goal is the performance quently, in this paper we model the dictionary as a table that spec- of decoding integer codes during query execution. In that respect, ifies a mapping from string-values to arbitrary integer codes and in this paper we identify efficient (cache-conscious) indexes that vice versa. Our goal is also to provide efficient access paths (i.e., support the encoding and decoding of string-values using a dictio- index structures) to such a dictionary. More precisely, we identify nary. While there has already been a lot of work to optimize index index structures for a string dictionary that efficiently support the structures for data warehouses on modern hardware platforms (i.e., following tasks: multi-core-systems with different cache levels), much of this work concentrated on cache-conscious indexes for numerical data (e.g., the CSS-Tree [16] and the CSB+ -Tree [17]). However, there has • Data loading: As discussed before, data is usually loaded been almost no work on indexes that enable (cache-)efficient bulk bulk-wise into a data warehouse. This means that the dictio- lookups and insertions of string values. Consequently, we focus on nary must efficiently support the encoding of bulks of string indexes for encoding string data. values using integer codes. The encoding logically consists 1 of two operations: the first operation is a bulk lookup of the In this paper we do not support wildcards at arbitrary positions. 284

3. In addition, the amount of main memory as well as the size of Traditional approach Traditional approach Shared-leaves: the CPU caches of today’s hardware are constantly growing. Con- (direct): (indirect): sequently, data warehouses start to hold all the critical data in main encode memory (e.g., the data that is used for online analytical reporting). index In that respect, in this paper we address the question of how to com- encode decode encode index decode index leaves index index press the dictionary in order to keep as much data as possible in the leaves leaves (code<->value) different levels of the memory hierarchy. Again, while there has leaves leaves (value->rid) (code->rid) decode (value->code) (code->value) been a lot of work on compressing indexes for numerical data [10], Dictionary (rid, value, code) index almost no work exists for string data [5]. Thus, the contributions of this paper are: Figure 2: Traditional approaches vs. Shared-leaves (1) In Section 2, we introduce a new approach for indexing a dictionary of string values (called shared leaves) that lever- ages an order-preserving encoding scheme efficiently. In the insertion of new string values as well as the generation of shared leaves approach, indexes on different attributes (that order-preserving codes for those new values. can be clustered the same way) can share the same leaves in • decode: codes → values: This bulk operation is used dur- order to reduce the memory consumption while still provid- ing query processing in order to decode (intermediate) query ing efficient access paths. results (i.e., a bulk of integer codes) using the corresponding (2) In Section 3, we introduce a concrete leaf structure for the string values (i.e., the values). shared-leaves approach that can be used by the indexes of a dictionary for efficiently encoding and decoding string val- Moreover, the interface of the dictionary should also support the ues while the leaf structure itself is compressed. We also following two operations in order to enable the rewrite of the query discuss the most important operations on this leaf structure predicates (for query processing): (i.e., lookup and update) and analyze their costs. • lookup: (value, type) → code: This operation is used dur- (3) As another contribution, in Section 4, we present two new ing query compilation in order to rewrite a string constant cache-conscious string indexes that can be used on top of (i.e., the value) in an equality-predicate (e.g., p_name = our leaf structure to efficiently support the encoding of string ’Whole Milk - Gallon’) or in a range-predicate (e.g., data in the dictionary. For the decoding of integer codes we p_name ≥ ’Whole Milk - Gallon’) with the cor- argue why the CSS-Tree [16] is optimal in our case. responding integer code (i.e., code). The parameter type (4) Finally, in Section 5, our experiments evaluate the new leaf specifies whether the dictionary should execute an exact-match structure and the new cache-conscious indexes under differ- lookup (as it is necessary for string constants in equality- ent types of workloads and show a detailed analysis of their predicates) or return the integer code for the next smaller (or performance and their memory behavior. As one result, the larger) string value (as it is necessary for string constants in experiments show that in terms of performance our leaf struc- range-predicates). An example will be given in the following ture is as efficient as other read-optimized indexes while us- subsection. ing less memory (due to compression). • lookup: prefix → (mincode, maxcode): This operation is used during query compilation to rewrite the pref ix of a 2. OVERVIEW prefix-predicate (e.g., p_name = ’Whole Milk*’) with In this section, we first discuss the operations that an order- the corresponding integer ranges (i.e., the mincode and the preserving string-dictionary must support. Afterwards, we present maxcode). Again, an example will be given in the following a new idea for indexing such a dictionary (called shared-leaves) subsection. which is not bound to particular index structures and we show how the operations above can be implemented using this approach. Fi- In order to use table T to efficiently support all these operations, nally, we discuss requirements and design decisions for index struc- we propose to build indexes for both attributes of table T (value tures that can be efficiently used for a dictionary together with the and code) because encoding and decoding usually access only a shared-leaves approach. subset of the dictionary. Moreover, indexing the dictionary is es- pecially important if the dictionary is shared between different at- 2.1 Dictionary Operations tributes, because then it is more likely that only a subset of the As mentioned in Section 1, in this paper we model a string dic- dictionary is touched (if the domains of the individual attributes tionary as a table T with two attributes: T = (value, code). Thus, are not completely overlapping). Consequently, we believe that in table T defines a mapping of variable-length string values (defined many cases a sequential scan of the complete dictionary does not by the attribute value) to fixed-length integer codes (defined by pay off. The choice of whether to use a sequential scan or an index the attribute code) and vice versa. In order to support the data load- to access the dictionary, however has to be done as a part of the ing as well as the query processing task inside a column store, the cost-based query optimization (because it strongly depends on the interface of the dictionary should support the following two bulk particular workload). However, this discussion is out of the scope operations for encoding and decoding string values: of this paper. • encode: values → codes: This bulk operation is used dur- 2.2 Shared-leaves Indexing ing data loading in order to encode data of a string column Traditional approaches for indexing can be classified into two (i.e., the values) with corresponding integer codes (i.e., the general categories: direct and indirect indexes. Using these ap- codes). This operation involves (1) the lookup of codes for proaches for indexing the two attributes (value and code) of table those strings that are already in the dictionary and (2) the T would result in the following situations (see Figure 2): 285

4. (1) In the direct indexing approach, two indexes for encoding Bulk Lookup: Bulk Insert / Load: and decoding are created that hold the data of table T directly in their leaves. In this case, the table T itself does not need {zzb, aad, …, aac}> {960, -1, …, -1} {aac, aad} to be explicitly kept in main memory since the data of T is encode index: e encode index: e’ stored in the indexes. shared-leaves: l’ shared-leaves: l (2) In the indirect indexing approach, two indexes for encoding value code value code value code value code aab 10 zzb 960 aab 10 zzb 960 and decoding are created that hold only references to the data aae 20 … zzm 970 aac 13 … zzm 970 inside table T (i.e., a row identifier rid). In this case, the aaf 30 aad 16 table T itself needs to be explicitly kept in main memory. aaz 40 aae 20 decode decode While direct indexing (1) has the disadvantage of holding the {30, 960, …, 20} index: d {aaf, zzb, …, aae} index: d’ data of table T redundantly in the two indexes which is not optimal if the indexes should be main memory resident, indirect indexing (2) (which is standard in main-memory databases [9, 7]) requires one level of indirection more than the direct indexing approach (i.e., pointers into the table T ). Thus, (2) results in higher cache miss Figure 3: Operations on indexes with shared-leaves rates on modern CPUs. Another alternative to index the dictionary data is to extend a standard index (e.g., a B+ -Tree) in order to sup- port two key attributes instead of one (i.e., in our case for value Moreover, the encoding and decoding indexes must be updated if and code) . However, in that case both access paths of the index necessary. need to read the two key attributes during lookup which increases In this paper, we do not focus on how to generate new order- the cache miss rates (especially when decoding the integer codes). preserving integer codes for new string values that are inserted into The new idea of this paper is that the two indexes for encoding the dictionary. In order to generate the codes, we simply partition and decoding share the same leaves (see shared-leaves approach the code range where new string values are inserted in into equi- in Figure 2) where both indexes directly hold the data of table T distant intervals (e.g., in our example the two strings ’aac’ and in their leaves but avoid the redundancy of the direct indexing ap- ’aab’ are inserted into the code range between 10 and 20). The proach. Thus, the shared leaves also avoid the additional indirec- limits of these intervals represent the new codes (e.g., 13 for ’aac’ tion level of the indirect indexing approach. and 17 for ’aad’ in our example). In case that the range is smaller As the string dictionary uses an order-preserving encoding scheme, than the number of new string values that have to be inserted in the string values and the integer codes in table T follow the same the dictionary, re-encoding of some string values as well as up- sort order (i.e., we can have clustered indexes on both columns dating the data (i.e., the columns of a table) that use these string and thus can share the leaves between two direct indexes). Con- values becomes necessary. Analyzing more sophisticated order- sequently, as the attribute values value and code of table T can preserving encoding schemes that are optimal (i.e., that require both be kept in sort order inside the leaves, the leaves can provide minimal re-encoding) under certain workloads as well as strategies efficient access paths for both lookup directions (i.e., for the en- for re-encoding the dictionary data are an avenue of our future re- coding and decoding) using a standard search method for sorted search work. data (e.g., binary search or interpolation search). Moreover, using the shared-leaves for indexing the dictionary means that T does not In the following, we discuss how the shared-leaves approach can have to be kept explicitly in main memory because the leaves hold support the two lookup operations mentioned at the beginning of all the data of table T (as for direct indexes). this section that support the predicate rewrite a column store: The lookup operation which is necessary to rewrite the equality- In the following, we discuss how the shared-leaves approach can and range-predicates is similar to the bulk lookup explained be- support the bulk operations mentioned at the beginning of this sec- fore: the encoding index propagates the string constant to the cor- tion to support the data loading and query processing inside a col- responding leaf and then a standard search algorithm can be used on umn store: the leaf to return the corresponding integer code. For example, in Figure 3 shows an example of how the shared-leaves approach order to rewrite the predicate value ≥ ’zzc’ using the encode can be used to efficiently support the bulk operations for encoding index in Figure 3 (left side), the encode index propagates the string and decoding string values. In order to encode a list of string values value ’zzc’ to the rightmost leaf and this leaf is used to lookup (e.g., the list shown at the top of Figure 3), the encode-index is used the next integer code for that string value that is equal or greater to propagate these values to the corresponding leaves. Once the than the given value (i.e., the integer code 970 for the string value leaves are reached, a standard search algorithm can be used inside ’zzm’). The rewritten predicate thus would be code ≥ 970. a leaf to lookup the integer code for each single string value. The In order to support the other lookup operation that is necessary decoding operation of a list of integer codes works similar. The to rewrite a prefix-predicate, the encoding index needs to propa- only difference is that the decode index is used to propagate the gate the string prefix to those leaves which contain the minimum integer codes down to the corresponding leaves (e.g., the list of and the maximum string value that matches this prefix. For exam- integer codes shown on the bottom of Figure 3). ple, in order to rewrite the predicate value = ’aa*’ using the If some integer codes for string values are not found by the encode index in Figure 3 (left side), the encode index has to propa- lookup operation on the encode-index (e.g., the string values ’aac’ gate the prefix to the first leaf which contains the minimum and the and ’aad’ in our example), these string values must be inserted into maximum string value that matches this prefix. Afterwards, those the dictionary (i.e., the shared-leaves) and new integer codes must leaves are used to map the strings that represent the boundaries for be generated for those values (see the right side of Figure 3). The the given prefix to the corresponding codes (e.g., in our example new codes for these string values have to be added to the result we retrieve the codes for ’aab’ and ’aaz’ and rewrite the predicate (i.e., the list of codes) that are returned by the encoding operation. as 10 ≤ code ≤ 40). 286

5.2.3 Requirements and Design Decisions coded to the corresponding leaves using the encoding index The main requirement is that the dictionary should be fast for and update the encoding index directly (i.e., do updates in- encoding (i.e., the bulk lookup/insert of integer codes for a list of place during propagation). Then, lookup the codes for those string values) but the optimization goal is the performance for de- strings that are already in the leaves. Afterwards, insert the coding (i.e., the bulk lookup of string values for a given list of in- new values that were not found by the lookup into the leaves teger codes). Thus, the data structures of the dictionary (i.e., leaves and generate integer codes for all new string values. Finally, and indexes) should also be optimized for encoding/decoding bulks bulk load a new decode index from the updated leaf level instead of single values. Moreover, the data structures should be (bottom-up). optimized for modern CPUs (i.e., they should be cache-conscious (3) All-In-Place: First, propagate the string values that need to and the operations should be easy to parallelize). In the following be encoded to the corresponding leaves using the encoding we discuss further requirements and design decision for the leaf index and update the encoding index directly (i.e., do up- structure and the indexes of the dictionary. dates in-place during propagation). Then, lookup the codes for those strings that are already in the leaves. Afterwards, Leaf structure: The most important requirement for the leaf insert the new values that were not found by the lookup into structure is that it must be able to hold the string values as well as the leaves and generate integer codes for all new string val- the integer codes in sort order to enable efficient lookup operations ues. Propagate each update on the leaf level that causes an (e.g., binary search) for both encoding and decoding (while the leaf update of the decode index (e.g., a split of a leaf) directly to structure should be optimized for decoding). the decode index and apply the update. As the dictionary must be memory resident, the memory foot- print of the dictionary should be small (i.e., it might make sense In the first two strategies above, the decode index is bulk loaded to apply a lightweight compression scheme such as incremental from the updated leaf level, which means that it should provide encoding to the leaf data). Moreover, the leaf should support the a better search performance for decoding which is our main opti- encoding of variable-length string values. While the lookup opera- mization goal. Consequently, in this paper we focus on the first two tions on a leaf are trivial for fixed-length string-values that are not strategies. compressed, the lookup operations get more complex if the leaves In order to guarantee consistency of the data dictionary, for sim- should also support variable-length string values and compression. plicity we decide to lock the complete indexes as well as the leaves When encoding a bulk of string values, new string values might during data loading (i.e., the encoding of string values) because be inserted into the dictionary which involves updating the shared- this usually happens only at predefined points in time in data ware- leaves. Consequently, the leaf should also enable efficient bulk housing (i.e., once a day). Thus, no concurrent updates and reads loads and bulk updates. are possible during data loading. However, during query process- Finally, note that the leaf structure can be totally different from ing (i.e., for decoding query results), we allow concurrency because the data structure that is used for the index nodes on top. A con- these are read-only operations. crete leaf structure that satisfies these requirements is discussed in For persisting the dictionary, currently we only write the updates detail in the next section. leaves sequentially to disk as a part of data loading. More sophisti- cated persistence strategies are a part of our future work. Encode/Decode index structure: Same as the leaf structure, the indexes for encoding and decoding should keep their keys in sort 3. LEAF STRUCTURE order to enable efficient lookup operations over the sorted leaves. Another requirement is that the encode index must also support the In this section, we present a leaf structure that can be used in the propagation not only of string constants but also of string-prefixes shared-leaves approach (see Section 2) for efficiently encoding and to the corresponding leaves in order to support the predicate-rewrite decoding variable-length string values on a particular platform. The task. Moreover, the indexes should also be memory resident and general idea of this leaf structure is to keep as much string values thus have a small memory footprint. as well as the corresponding fixed-length integer codes sorted and When bulk encoding a list of string values using the encoding compressed together in one chunk of memory in order to increase index, in addition to the lookup of the integer codes for string val- the cache locality during data loading and lookup operations2 . ues that are already a part of the dictionary, it might be necessary to insert new string values into the dictionary (i.e., update the leaves 3.1 Memory Layout as well as the both indexes for encoding and decoding) and gen- Figure 4 shows an example of the memory layout of one concrete erate new order-preserving codes for those values. We propose to instance of a leaf structure that represents the dictionary shown in combine these two bulk operations (lookup and insert) into one op- Figure 1 (c). The leaf structure compresses the string values us- eration. In order to support this, we see different strategies: ing incremental-encoding [21] while each n-th string (e.g., each 16-th string value) is not compressed to enable an efficient lookup of strings without having to decompress the complete leaf data: In (1) All-Bulked: First, propagate the string values that need to be the example, value 16 ’Whole Milk - Gallon’ is not compressed encoded to the corresponding leaves using the encode index and value 17 ’Whole Milk - Half Gallon’ is compressed using and lookup the codes for those strings that are already in the incremental-encoding; i.e., the length of the common prefix com- leaves. Afterwards, insert the new values that were not found pared to the previous value (e.g., 11) is stored together with the by the lookup into the leaves and if appropriate reorganize suffix that is different (e.g., ’Half Gallon’) the updated leaf level (e.g., create a leaf level where all leaves In order to enable an efficient lookup using this leaf structure, an are filled up to the maximal leaf size). Afterwards generate offset vector is stored at the end of the leaf that holds references integer codes for the new string values and bulk load a new (i.e., offsets) and the integer codes of all uncompressed strings of encode and a new decode index from the updated leaf level (in a bottom-up way). 2 In this paper we assume that string values are encoded in ASCII (2) Hybrid: First, propagate the string values that need to be en- using one byte per character. 287

6. 0 W H E A T B R that should be stored in a dictionary. We assume that c = 4 is an … … appropriate choice for this example. 128 W H O L E M I value 16 Defining the value for the decode interval d is more difficult: in (uncompressed) general, we want to make sure that the offset vector once loaded re- 136 L K - G A L L O leaf value 17 mains in the L1 cache (e.g., having a max. size of 32kB) such that data 144 N \0 11 H A L F (prefix-compressed) an efficient binary search is possible for a bulk of strings. With 152 G A L L O N \0 3 the given settings of the example above, we assume that a leaf code 17 160 2 0 5 0 11 Q U A R will be able to store approx. 42000 strings (which results from 168 T \0 3 2 1 0 0 … the max. leaf size of 2MB and the average string length of 50). … …. Moreover, each uncompressed string utilizes 7 bytes of the offset offset 1008 … 3 2 0 0 0 code 16 vector (for the offset and the code). Consequently, the offset vec- vector 1016 1 2 8 3 0 4 0 0 0 tor can store max. 32k/7 ≈ 4681 entries (i.e., offsets and codes) for uncompressed strings which means that we could store each offset 16 code 0 offset 0 d ≈ 42000/4681 ≈ 9th string uncompressed in the example. Figure 4: Leaf for variable-length string values 3.2 Leaf Operations The most important operations on the leaf structure are the lookup operations for encoding string values with their integer codes and a leaf also in a sorted way. For example, the offset 128 and the decoding the integer codes with their string values. Moreover, the code 32000 are stored in the offset vector for value 16 (see Figure leaf also supports updates (i.e., inserting new strings). In the fol- 4). The integer codes of the compressed string-values are stored lowing paragraphs, we examine these operations in detail. together with the compressed string values in the data section and not in the offset vector (e.g., the code 32050 for value 17). The off- Bulk Lookup: In this paragraph, we first explain how the set vector is stored in a reverse way to enable efficient data loading lookup works for a single value and then discuss some optimiza- while having no negative effect on the lookup operations. tions for bulks. The leaf supports the following lookup operations: In order to adapt the leaf structure to a certain platform (e.g., to one to lookup the code for a given string value (i.e., value v → code the cache sizes of a CPU) different parameters are available: c) and another one that supports the lookup vice versa (i.e., code c → value v). In order to search the code c for a given value v, the • Leaf-size l: Defines the memory size in bytes that is initially procedure is as follows: allocated for a leaf. In our example, we used l = 1024. 1. Use the offset vector to execute a binary search over the un- • Offset-size o: Defines the number of bytes that are used as compressed strings of a leaf in order to find an uncompressed offset inside the leaf. This parameter has influence on the string value v ′ that satisfies v ′ ≤ v and no other uncom- maximal leaf size. If we use o = 2 bytes, as in the example pressed value v¯ exists with v ′ < v¯ < v. in Figure 4, then the maximal leaf size is 216 bytes (64kB). 2. If v ′ = v return the corresponding code c for v that is stored • Code-size c: Defines the number of bytes that are used to in the offset vector. represent an integer codeword. If we use c = 4, we can 3. Otherwise, sequentially search value v from value v ′ on until generate 232 different codewords. v is found or the next uncompressed value appears. In the • Prefix-size p: Defines the number of bytes that are used to first case return the code c, in the second case indicate that encode the length of the common prefix for incremental en- the value v was not found. coding. This parameter is determined by the maximal string length. For example, if we use strings with a maximal length Note that for sequentially searching over the incrementally en- of 255, then we can use p = 1 because using one byte allows coded string values no decompression of the leaf data is necessary. us to encode a common prefix length from 0 to 255. Algorithm 1 shows a search function that enables the sequential • Decode-interval d: Defines the interval that is used to store search over the compressed leaf data. The parameters of this func- uncompressed strings (16 in our example). This parameter tion are: the leaf data (i.e., leaf ), the offset where to start and end has influence on the size of the offset vector (i.e., the smaller the sequential search (i.e., start and end), as well as the value that this interval is, the more space the offset vector will use). we search for (i.e., v). The return value is the corresponding code c if the value is found, otherwise the algorithm returns −1. The ba- We do not allow the definition of these parameters on the gran- sic idea of the algorithm is that it keeps track of the common prefix ularity of bits because most CPUs have a better performance on length (i.e., pref ix_len) of the current string (at the offset start) data that is byte-aligned (or even word- or double-word aligned on and the search string v. If this common prefix length is the same as modern CPUs). We decide to use byte-aligned data (and not word- the length of the search string then the correct value is found and or double-word aligned data) because the other variants might lead the code can be returned. The variables p and c are constants that to a dramatic increase in the memory consumption of the leaves. represent the prefix-size and the code-size to increment the offset As an example, assume that we want to tune the leaf structure value start. for a platform using one CPU (with one core) having an L2 cache The lookup operation to find a string value v for a given code of 3MB and an L1 cache of 32kB. Moreover, assume that the leaf c works similar as the lookup operation mentioned before using should only store strings with an average length of 50 characters the offset vector. The differences are that the first step (i.e., the and a maximum length of 255 characters (which means that p = 1 search over the offset vector) can be executed without jumping into can be used). In that case it would make sense to use a leaf size the leaf data section because the codes of the uncompressed strings not bigger than the L2 cache size, say max. 2MB. Consequently, are stored together with the offset vector. In contrast to the other an offset-size o = 3 is sufficient to address all values inside such a lookup operation, for the search over the offset vector we theoreti- leaf. The code-size depends only on the overall number of strings cally expect to get only one L1 cache miss using a simplified cache 288

7.Algorithm 1 Sequential search of string v on compressed leaf Ideally, the new string values start after the last value of the existing function S EQUENTIAL S EARCH(leaf, start, end, v) leaf. In that case, we only have to compress the new string values v ′ ← leaf [start] ⊲ read string v ′ at offset start without decoding the leaf. If the list of string values and the existing start ← start + size(v ′ ) ⊲ increment offest by string size leaf data do not fit into one leaf anymore, the data has to be split. pref ix_len ← pref ix_len(v, v ′ ) ⊲ calculate common prefix len However, as the split strategy depends on the index structure that while start ≤ end and pref ix_len < |v| do curr_pref ix_len ← leaf [start] ⊲ get curr. prefix len we build on top of the leaves, we discuss this in the next section. start ← start + p ⊲ increment offest by prefix-size p = 1 v ′ ← leaf [start] 4. CACHE-CONSCIOUS INDEXES start ← start + size(v ′ ) if curr_pref ix_len <> pref ix_len then In this section, we present new cache-conscious index structures continue ⊲ prefix of curr. value v ′ too short/long that can be used on top of the leaf structure presented in the pre- else if compare(v ′ , v) > 0 then vious section. These indexes support one of the first two update return -1 ⊲ curr. value v ′ comes after search value v strategies (All-Bulked and Hybrid) discussed in Section 2. For the end if encoding index, we present a new cache sensitive version of the pref ix_len ← pref ix_len + pref ix_len(v, v ′ ) start ← start + c ⊲ increment offest by code-size c = 4 patricia trie (called CS-Array-Trie) that supports the Hybrid up- end while date strategy and a cache sensitive version of the Prefix-B-Tree [5] if pref ix_len = |v| then (called CS-Prefix-Tree) that supports the All-Bulked update strat- return leaf [start − c] ⊲ string v found: return code egy. else As decoding index, we reuse the CSS-Tree [16] which is known return -1 ⊲ string v not found to be optimized for read-only workloads. We create a CSS-Tree end if end function over the leaves of the dictionary using the minimal integer codes of each leaf as keys of the index (i.e., the CSS-Tree is only used to propagate the integer values that are to be decoded to the corre- sponding leaves). As the CSS-Tree can be bulk loaded efficiently model (if the offset vector fits into the L1 cache). Another differ- in a bottom-up way using the leaves of the dictionary, it satisfies the ence of this lookup operation is that during the sequential search requirements for both update strategies (Hybrid and All-Bulked). the string values have to be incrementally decompressed. When executing both lookup operations, we expect to theoret- 4.1 CS-Array-Trie ically get one L2 cache miss in a simplified cache model if we As a first cache-conscious index structure that can be used as assume that loading the complete leaf causes one miss and the leaf an encode index to propagate string lookup probes and updates to fits into the L2 cache3 . The average costs for these two lookup the corresponding leaf in a shared leaf approach, we present the operations are as follows (where n is the number of strings/codes CS-Array-Trie. Compared to existing trie implementations the CS- stored in a leaf and d is the decode interval): Array-Trie uses read-optimized cache-conscious data structures for the index nodes and does not decompose the strings completely. O(log(n/d)) + O(d) 4.1.1 General Idea In order to optimize the lookup operation for bulks (i.e., a list Many existing trie implementations are using nodes that hold an of string values or a list of integer codes), we can sort the lookup array of pointers to the nodes of the next level with the size of the probe. By doing this, we can avoid some search overhead by mini- alphabet (e.g., 128 for ASCII). While such an implementation al- mizing the search space after each search of a single lookup probe lows efficient updates and lookups on each node, it is not memory- (i.e., we do not have to look at that part of the offset vector/leaf data efficient because the array trie allocates space for a pointers per that we already analyzed). node, where a is the size of the alphabet (while a pointer uses 8 bytes on a 64-bit system). Bulk Update: In this paragraph, we explain how to insert new Other trie implementations avoid the memory overhead and use strings into the leaf structure. As we assume that data is loaded in a sorted linked list [13] to hold only the characters of the indexed bulks, we explain the initial bulk load and the bulk insert of strings strings together with a pointer to the next level of the trie. While into an existing leaf. this implementation still offers efficient node updates, the lookup In order to initially bulk load a leaf with a list of string values, must execute a search over the characters stored in the linked list. we first have to sort the string values. Afterwards, the leaf data However, linked lists are known to be not very cache efficient on can be written sequentially from the beginning of the leaf while the modern CPUs because of pointer-chasing [20]. Moreover, most offset vector is written reversely from the end. If the string values existing trie implementations decompose the indexed strings com- do not utilize the complete memory allocated for a leaf (because pletely (i.e., each letter of a string results in a node of the trie). we do not analyze the compression rate before bulk loading) then Compared to the implementations mentioned above, a node of the offset vector can be moved to the end of the compressed data the CS-Array-Trie uses an array instead of a linked list to store the section and the unused memory can be released. characters of the indexed string values. Compared to a linked list an In order to insert a list of new string values into an existing leaf, array is not efficient when sequentially inserting single values into we again have to sort these string values first. Afterwards, we can a trie. However, when bulk inserting new values into a CS-Array- do a sort merge of these new string values and the existing leaf in Trie, we need grow the array of a node only once for each bulk. In order to create a new leaf. The sort merge is cheaper if we can reuse order to lookup a string value of a CS-Array-Trie, the search over as much of the compressed data of the existing leaf as possible the characters of a node (i.e., the array) is more efficient than the and thus do not have to decode and compress the leaf data again. sequential search on a linked list because the array supports binary 3 A more fine grained cache model would respect cache lines and search and all characters are stored clustered in memory. cache associativity. However, as we focus on bulk lookups our The second key idea of the CS-Array-Trie is that it does not de- simplified model is sufficient to estimate the costs of cache misses. compose the string values completely. The CS-Array-Trie stores 289

8.a set of strings that have the same prefix together using the leaf leaves (i.e., the new strings are inserted into the leaves), new in- structure that we discussed in the previous section. A leaf of the teger codes are generated for the new string values. This is done CS-Array-Trie stores the complete strings and not only their suf- by analyzing the number of strings that are inserted in between two fixes (without the common prefix) in order to enable efficient de- existing string values. For example, in order to generate codes for coding of integer codes using the same leaves (as described in our the three new string values that are inserted into the first two leaves shared-leaves approach). Moreover, storing the complete strings in between ’aam’ and ’amd’ in Figure 5 (right side), we have to gen- a leaf (i.e., repeating the same prefix) is still space efficient because erate three new codes that must fit into the range between 40 and we use incremental encoding to compress the strings. Finally, for 50. Using our equi-distance approach (discussed in Section 2), we those trie nodes that only hold a pointer to one child node, we use generate 43, 45, and 48 as new codes. Therefore, the trie must al- the path compression used in patricia tries [15]. low us to sequentially analyze all leaves in sort order (i.e., each leaf Figure 5 (left side) shows an example of a CS-Array-Trie that has a pointer to the next leaf). Finally, after generating the integer indexes nine different strings. If we follow the path ’aa’ in the trie, codes for the new string values of the trie, we use the buffer pages we reach a leaf that holds only strings that start with the prefix ’aa’. that we kept on the leaf level (e.g., the new strings in the buffers The leaves are shown as uncompressed tables for readability. The (2), (4), and (5) in the example) to lookup the integer codes for the physical memory layout is as discussed in Section 3. new string values and use the sequence number to put the integer code at the right position in the answer (see bottom of Figure 5). 4.1.2 Index Operations Another task that can efficiently be supported using the CS-Array- In order to encode a bulk of string values during data loading Trie, is the predicate rewrite: for equality- and range-predicates the using that trie, we implement the Hybrid update strategy for the constants are simply propagated through the trie without buffering. CS-Array-Trie which means that new string values are inserted into For prefix-predicates, the prefix is used to find the minimal and the trie when the strings that should be encoded are propagated maximal string value that matches this prefix (which is also trivial through the encoding index (i.e, the trie). In order to leverage the when using a trie). fact of having bulk inserts, we propagate the string values (in pre- order) to the leaves using variable buffers at each node of the trie 4.1.3 Cost Analysis to increase the cache locality during lookup as described in [22]. For propagating a bulk of string values from the root of a CS- Moreover, when using buffers at each node we can grow the array Array-Trie to the leaves, we theoretically expect to get one L2 data of characters stored inside a node only once per bulk. The work in cache miss for each node in our simplified cache model. In addi- [22] showed that this effectively reduces data cache misses for tree- tion, for leaf that has to be processed during the lookup, we expect based indexes and results in a better overall lookup performance. to get another L2 data cache miss using our simplified cache model. Figure 5 (right side) shows an example for encoding a bulk of six Moreover, the input and output buffers of a node should be de- strings using the existing encode index (i.e., the trie shown on the signed to fit into the L2 cache together with one node (to allow left side). First, all strings that need to be encoded are propagated efficient copying from input to output buffers). In that case, we from the root node of the trie to the first level of nodes creating expect to get one cache miss for each buffer page that needs to be three buffers ((1), (2), and (3)). In order to keep the order of strings loaded. Moreover, generating the new integer codes will cause one in the lookup probe for creating the encoded result (at the bottom of L2 data cache miss for each leaf of the trie. Finally, executing the Figure 5), a sequence number is added for each value in the buffers lookup of the new integer codes, will also cause one cache miss for (as suggested in [22]). The strings that are propagated from the root each buffer page that has to be loaded plus the L2 cache misses for to the first level of buffers are analyzed and the missing character executing the lookup operations on the leaves (as discussed in the ’m’ for the string ’mzb’ is added to the root node4 . section before). Afterwards, the buffer (1) is propagated to the next level creating The costs for encoding a bulk that contains no new string values two new buffers (4) and (5) and buffer(1) is returned to a buffer pool (i.e., a pure lookup) are composed of the costs for propagating the (to avoid expensive memory allocation for buffer pages). Next, strings through the trie plus the costs for the lookup of the codes buffers (4) and (5) are processed, which means that values for the for all strings using the leaves. If we assume that all strings in the codes for existing string values are looked up and new strings are lookup probe have the same length m and are distributed uniformly, inserted into the existing leaves with a placeholder as their integer the height of the trie is loga (s/l) in the best case and equal to the code (e.g., −1): While buffer (4) contains two new strings ’aax’ length of the string m in the worst case (where s is the total number and ’aay’ that are inserted to the leftmost leaf, buffer (5) contains of strings, l is the number of strings that fit in one leaf, and a the only one new string ’amc’ that is inserted the next leaf. Note, that size of the alphabet). The lookup costs on each node are O(log(a)). the new values in buffers (4) and (5) are not yet deleted and kept Thus, the average costs for propagation are: to lookup the integer codes for those string values (that are not yet generated). O(s ∗ ((loga (s/l) + m)/2) ∗ log(a)) A question that arises here is, whether the new strings fit into the existing leaf (i.e., the new leaf size is expected to be less than 4.1.4 Parallelization the maximal leaf size) or whether the leaf must be split up into The propagation of string values from the root of the trie to the several leaves. In order to estimate the expected leaf size, we add leaves (including the update of the nodes) can be easily parallelized the size (uncompressed) of all new strings in a buffer page as well for different sub-tries because sub-tries share no data. as the size of their new codes (without eliminating duplicates) to Moreover, the generation of new integer codes can be done in the current leaf size. If the bulk is heavily skewed and contains parallel as well (without locking any data structures). For this we many duplicates it is likely that we decide to split a leaf even if it is need to find out which leaves hold contiguous new string values not necessary. (i.e., sometimes a contiguous list of new string values might span When all string values are propagated to their corresponding more than one leaf as shown in Figure 5 on the right side for the 4 All updates on the trie and the leaves are shown in light gray in first two leaves). Figure 5 (right side). Finally, the lookup operation of the new string values can also be 290

9. String values: {zzb, aax, amc, amo, aay, mzb} a m z Encode index a z (1) <aax,1> <amc,2> o o o o o <amo,3> <aay,4> a m a m o o o o (4) <aax,1> (5) <amc,2> (2) (3) <mzb,5> <zzb,0> <aay,4> <amo,3> value code value code value code Shared leaves value code value code value code value code aab 10 amd 50 zzb 80 aab 10 amc 48 mzb 75 zzb 80 aae 20 amk 60 zzm 90 aae 20 amd 50 zzm 90 aaf 30 amo 70 aaf 30 amk 60 aam 40 aam 40 amo 70 aax 43 aay 45 Integer codes: {80, 43, 48, 70, 45, 75} Figure 5: Encode data using the CS-Array-Trie parallelized without locking any data structures. In Figure 5 (right the new string values (that are not yet a part of the dictionary) must side), the new values in the buffer pages (2), (4), and (5) can be first be inserted into the leaves (in sort order) and then the index processed by individual threads for example. We can see that each can be created. thread writes the resulting integer codes to different index positions Thus, if the first bulk of string values should be encoded using of the result (shown at the bottom) using the sequence number of the All-bulked update strategy, the complete leaf level has to be the buffer pages. Thus, no locks on the result vector are necessary. built using these string values. Therefore, we reuse the idea in [18] and create a trie (more precisely a CS-Array-Trie) to partition the 4.2 CS-Prefix-Tree string values into buckets that can be sorted efficiently using multi- As a second cache-conscious index structure that can be used as key quicksort [6]. The sorted string values can then be used to an encode index to propagate string lookup probes and updates to create leaves that are filled up to the maximum leaf size. From the corresponding leaf in a shared leaf approach, we present the CS- these leaves, a new encode index (i.e., a CS-Prefix-Tree) is bulk Prefix-Tree. This index structure combines ideas from the Prefix- loaded in a bottom-up way. B-Tree [5] and the CSS-Tree [16]. Figure 6 shows an example of a CS-Prefix-Tree. In the follow- ing, we describe the bulk load procedure of a CS-Prefix-Tree from 4.2.1 General Idea a given leaf level: Same as the Prefix-B-Tree, a node of a CS-Prefix-Tree contains the shortest prefixes that enable the propagation of string values 1. In order to bulk load the CS-Prefix-Tree, we process the leaf to the corresponding child nodes. However, instead of storing a level in sort order: We start with the first two leaves and pointer to each child, the CS-Prefix-Tree uses a contiguous block of calculate the shortest prefix to distinguish the largest value memory for all nodes and offsets to navigate through this block (as of the first leaf and the smallest value of the second leaf. In the CSS-Tree). This effectively reduces memory consumption and our example it is sufficient to store the prefix ’am’ in order avoids negative effects on the performance due to pointer chasing. to distinguish the first two leaves. This prefix is stored in a In order to further reduce the memory footprint of the CS-Prefix- node of the CS-Prefix-Tree. Note, a node does not store a Tree, we only store the offset to the first child node explicitly. Since pointer to each child since we can derive an offset into the the nodes have a fixed size s, we can calculate the offset to a child leaf level (i.e., the sorted list of leaves). Since we do not node using offset arithmetics (i.e., the i-th child of a node can be know the size of the offset vector in advance, we write the found at offset o = of f set(f irst_child) + (i ∗ s)). offset vector from left to right and store the keys from right In order to enable fast search over the variable-length keys of a to left. The example assumes a fixed node size of 32 bytes node (e.g., binary search), we store the offsets to the keys (i.e., the and therefore we store offset 29 in the offset vector and write string prefixes) in an offset vector at the beginning of each node. the prefix at the corresponding position. Moreover, the node size has to be fixed in order to use offset arith- 2. Next, we calculate the shortest prefix to distinguish the largest metics for computing the index to the child nodes. Thus, the num- value of the second leaf and the smallest value of the third ber of children of a node is variable because we store variable- leaf and so on until all leaves are processed. If a node is full, length keys inside a node. we start a new node and store the index to the first leaf that will be a child of this new node as an anchor. In our exam- 4.2.2 Index Operations ple the first node covers the first four leaves and therefore the The CS-Prefix-Tree (as the CSS-Tree) can only be bulk loaded index of the first child in the second node is 4. Note that the in a bottom-up way which means that it is only suitable for the All- nodes are stored continuously in memory. bulked update strategy discussed in Section 2. Using the All-bulked 3. As long as more than one node is created for a certain level of update strategy for the encoding a list of string values means that the CS-Prefix-Tree, we add another level on top with nodes 291

10. Offset to root = 2 Offset to 1st child No of keys Offset vector Keys (reversed) Encode index … 0 1 [29] [bcz] Idx of 1st child No of keys Offset vector Keys (reversed) Idx No Offset vector Keys … Shared leaves 0 3 [29, 25, 22] [bc, amq, am] 4 2 [29, 25] [dfe, cd] value code value code value code value code value code value code value code aab 10 amd 50 amq 80 bca 110 bcz 140 cda 170 dfe 200 aae 20 amk 60 bbd 90 bct 120 bdf 150 cdd 180 aaf 30 amo 70 bbk 100 bcx 130 bdl 160 dfb 190 Figure 6: Example of a CS-Prefix-Tree that store prefixes that distinguish their child nodes. In our number of nodes that we need to store the data. One possibility to example we have one node on top of the lowest level of the overcome this problem is to leverage the idea of a CSB-Tree [17] CS-Prefix-Tree. This node is the root of the tree and we store to use a mix of pointers and offset arithmetics (e.g., one pointer per the offset to this node in the contiguous memory block (2 in node) to identify the correct child and thus allow multiple blocks our example). Since the tree is built bottom up, the nodes of memory instead of one single block. have to be stored in that sequence in memory. 4.2.3 Cost Analysis For subsequent bulks, we use the existing CS-Prefix-Tree to prop- We suggest to set the size of a node of a CS-Prefix-Tree at most agate the string values that are to be encoded to the corresponding to the size of the L2 cache. Thus for propagating a bulk of string leaves. In our current implementation we buffer the string values values from the root of a CS-Prefix-Tree to the leaves, we theoreti- only on the leaf level (and not within the CS-Prefix-Tree). After- cally expect to get one L2 data cache miss (in our simplified cache wards, we do a sort-merge of the existing leaf with the new string model) for each node that is traversed during the lookup for each values stored in the buffers. If the new string values in the buffers string value and another L2 data cache miss when the value is writ- and the values of the existing leaf do not fit into one leaf, we sim- ten to the corresponding buffer of the leaf (which should also be ply create another leaf. This very simple sort-merge strategy can designed to fit into the cache). Moreover, the generation of new be improved in different ways: for example, one could try to create codes and the encoding itself will each cause one L2 data cache equally sized leaves to avoid degenerated leaves that contain only a miss for each leaf in the leaf level (if the leaf and the output buffer small number of strings. Moreover, after encoding several bulks it fit in the L2 cache together). might make sense to reorganize the complete leaf level by filling all The costs for encoding a bulk that contains no new string values leaves to their maximum. Once all updates are processed we bulk (i.e., a pure lookup) is composed of the costs for propagating the load a new CS-Prefix-Tree from the merged leaf level. strings through the tree plus the costs of the lookup in the leaf. If For example, in order to propagate a bulk of strings to the leaves we assume that all strings in the lookup probe have the same length using the CS-Prefix-Tree, we proceed as follows: assume we are and are distributed uniformly, the height of the tree is logk (s/l) looking for value ’amk’ in our example in Figure 6. We start at the (where s is the total number of strings, l is the number of strings root node and see that ’amk’ is smaller than ’bcz’, thus we proceed that fit in one leaf, and k the number of keys that fit into one node). at the first child of this node at offset 0. We find that ’amk’ is The lookup costs on each node are O(log(k)). Thus, the average between ’am’ and ’amq’ and thus proceed at the second child of this costs for propagation are: node at index 1 (calculated from the index of the first child). The CS-Prefix-Tree stores the information that nodes below a certain offset point to leaves instead of internal nodes. O(s ∗ logk (s/l) ∗ log(k)) To rewrite query predicates using the CS-Prefix-Tree, we do a Compared to the CS-Array-Trie, it is more expensive to build a simple lookup with the string constants if it is an equality-predicate CS-Prefix-Tree because the data has to be sorted first and then the or a range-predicate. If it is a prefix-predicate, the prefix is used to CS-Prefix-Tree is loaded bottom-up as opposed to the CS-Array- find the minimal string value that matches the prefix. The lookup Trie that is loaded top-down and implicitly sorts the data during for the prefix will end up at leaf that contains this value even if the that process. We show in our experiments that the CS-Prefix-Tree value itself is not in the dictionary. From that leaf on, we execute performs slightly better than the CS-Array-Trie for a pure lookup a sequential search for the maximum string value that matches the workloads (i.e., encoding a bulk of strings that does not contain prefix. We could save some effort compared to a sequential search, new values) since on average the tree is expected to be less high if we also use the index to find the leaf that holds the maximum than the trie and the leaves are organized more compact. value directly (similar to a skip list). One problem of using a contiguous block of memory and offsets 4.2.4 Parallelization is that the memory has to be allocated in advance. We calculate the Our current implementation supports multiple threads at the leaf maximum amount of memory that all nodes of the CS-Prefix-Tree level. Once the bulk is completely buffered at the leaves, the lookup need by introducing an artificial limit on the maximum length of can be executed in parallel as described for the CS-Array-Trie. the keys in the tree. We then can calculate the minimum number Since we want to support variable length keys we cannot parallelize of keys that fit into one node and thus can estimate the maximal the bottom-up bulk loading of the tree. 292

11. Currently, the bulk lookup on the index is single threaded since each workload uses a different fixed string-length and thus repre- we are not buffering the requests within the tree and thus have no sents a different number of strings. We only used distinct unsorted point to easily distribute the workload to multiple threads. One workloads (i.e., no skew) because these workloads represent the way to parallelize the lookup would be to partition the bulk be- worst case for all lookup operations (i.e., each string value/integer fore accessing the index and process these partitions using multiple code of the leaf is encoded/decoded once for each workload). We threads. Since the buffers at the leaf level would then be shared by used each of these workloads to load a set of leaves that hold the several threads, either locking would be needed or the partitioning workload in a sorted way while for each workload we used differ- has to be done according to the sort order, which is expensive. ent leaf sizes varying from 64kB to 16MB (resulting in a different set of leaves for each combination of workload and leaf size). 5. PERFORMANCE EXPERIMENTS The first goal of this experiment is to show the costs (of the bulk This section shows the results of our performance experiments loading and bulk lookup operations) caused by the different work- with the prototype of our string-dictionary (i.e., the leaf structure loads (of approximately the same size in memory) using leaves with discussed in Section 3 and the cache conscious indexes discussed different sizes without the overhead of an encoding and decoding in Section 4). We executed three experiments: the first experiment index on top (by simulating the pure lookup on the leaves). We (Section 5.1) shows the efficiency of our leaf structure and com- measured the time as well as the L2 cache misses that resulted from pares it to other read-optimized indexes5 , the second experiment executing the bulk loading of the leaves and executing the lookup (Section 5.2) examines the two new cache conscious indexes using operations for encoding as well as decoding. different workloads, and finally the last experiment (Section 5.3) In order to load the leaves, we first sorted the workloads and then shows the overall performance and scalability of our approach. bulk loaded each leaf up to its maximal size (see Section 3). After- We implemented the data structures in C++ and optimized them wards, we generated the integer codes for these leaves. As shown for a 64-bit Suse Linux Server (kernel 2.6.18) with two Intel Xeon in the table before, we use 8 bytes for the integer code in this ex- 5450 CPUs (each having four cores) and 16 GB of main memory. periment to show the memory consumption expected for encoding Each CPU has two L2 caches of 6MB (where two cores share one attributes with a large domain size. In order to measure the pure L2 cache) and one L1 cache for each core with 32kB for data as lookup performance of the leaf structures, we assigned each string well as 32kB for instructions. value of the workloads mentioned above to the corresponding leaf In order to generate meaningful workloads, we implemented a using a buffer and then we looked up the code for each string in the string data generator that allows us to tune different parameters like individual buffers (i.e., we created a corresponding encoded work- the number of strings, string length, alphabet size, the distribution, load for each leaf). Finally, we used the encoded workload to exe- and others. We did not use the TPC-H data generator dbgen, for cute the lookup operation (in the same way as described before) on example, because most string attributes either follow a certain pat- the leaves to decode the integer codes again. tern (e.g., the customer name is composed of the prefix ’Customer’ A second goal of this experiment is to show a comparison of and a unique number) or the domain size of such attributes is too the 16MB leaf structure (which can be used for encoding as well as low (e.g., the name of a country). Thus the data generated by dbgen for decoding) and two cache-conscious read-optimized index struc- does not allow us to generate workloads that let us analyze our data tures using the workloads (1) and (2): for encoding the string values structures with workloads that have certain interesting properties. we compare the leaf structure to the compact-chain hash table [4] We will show the properties of the workloads that we generated for and for decoding integer codes we compare the leaf structure to the each experiment individually. CSS-tree [16]. 5.1 Efficiency of Leaf Structure The main results of this experiment are that (1) the leaf structure In this experiment, we analyze the efficiency of the leaf struc- is optimal when having a medium size (∼ 512kB) and (2) the per- ture discussed in Section 3. The general idea of this experiment formance of our leaf structure is comparable to the read-optimized is to show the performance and memory consumption of the leaf index structures mentioned above while using less memory: operations for two different workloads. We used the parameters in the following table to configure the leaf structure (first part) and to • Figure 7 (a) shows the time and memory that is needed to generate our workloads (second part). bulk load the leaf structure (of 16MB size) compared to bulk loading the compact-chain hash table and the CSS-tree. As a Parameter Value result, we can see that bulk loading the leaf structure is faster Leaf-size l 64kB - 16MB for both workloads (i.e., string length 25 and 100) and uses Offset-size o 4 bytes less memory compared to the compact-chain hash table and Code-size c 8 bytes the CSS-tree. Prefix-size p 1 byte • Figure 7 (b) shows that the time and the L2 data cache misses Decode-interval d 16 (L2CM) for encoding the two workloads of this experiment String-length (1) 25, (2) 100 (using the bulk loaded leaves) are increasing for large leaf String-number (1) ∼ 450000, (2) ∼ 150000 sizes. Moreover, in terms of performance we can see that Alphabet-size 128 the 16MB leaf structure is comparable to the compact-chain Distribution Distinct (unsorted) hash map (Map) while offering sorted access (i.e., the leaf structure is a little slower). The two different workloads ((1) and (2)) were designed that • Finally, in Figure 7 (c) we can see that our leaf structure each of these workloads fits into a leaf with a size of 16 MB while is optimized for decoding: While decoding 450k strings of length 100 using the smallest leaf size takes about 50ms, it 5 We used PAPI to measure the performance counters: takes 200ms to encode them. Moreover, the L2 cache misses http://icl.cs.utk.edu/papi/. for encoding are almost twice as high as for decoding these 293

12. 400 4.5e+06 120 1.5e+06 160 24 Time25 4e+06 Time100 22 Time Map25 L2CM Map25 140 Memory25 350 Time Map100 L2CM Map100 100 1.25e+06 Memory100 20 Time Leaf25 L2CM Leaf25 3.5e+06 120 18 Time Leaf100 L2CM Leaf100 300 3e+06 80 1e+06 L2 cache misses L2 cache misses 16 Time Tree25 L2CM Tree25 100 Time Tree100 L2CM Tree100 Memory [MB] Time [ms] Time [ms] 14 2.5e+06 Time Leaf25 L2CM Leaf25 Time [ms] 80 12 250 60 Time Leaf100 L2CM Leaf100 750000 2e+06 10 60 8 200 1.5e+06 40 500000 40 6 1e+06 4 20 150 20 250000 2 500000 0 0 Leaf Hash Map CSSTree 100 0 0 0 64 128 256 512 1024 2048 4096 8192 16384 64 128 256 512 1024 2048 4096 8192 16384 Leaf size (log) [kB] Leaf size (log) [kB] (a) Bulk Loading (b) Bulk Lookup (Encoding) (c) Bulk Lookup (Decoding) Figure 7: Performance and memory overhead of the leaf structure strings. Finally, compared to the CSS-Tree (Tree) our 16MB 50 leaf structure is again only a little slower when used for de- CS-Array-Trie coding. 40 CS-Prefix-Tree List-Trie 5.2 Efficiency of Encoding Indexes 30 This experiment shows the efficiency of our new encoding in- Time [s] dexes compared to an implementation of the list-trie that decom- 20 poses the strings completely (i.e., it does not use our leaf struc- ture). The idea of this experiment is to show the costs for encod- 10 ing workloads that produce different update patterns on the indexes (which cause different costs). Therefore, we first bulk load a dictio- 0 nary with 10m strings and afterwards encode another bulk of 10m Bulkload Pure lookup Interleaved 10 Interleaved 50 Interleaved 100 Append 100 strings that represents a certain update pattern (details later). All workloads consist of string values with a fixed length of 20 char- acters (while the other parameters to generate the workload are the same as in the experiment before). For the leaves we fixed the max- imum leaf size to be 512kB (while the other leaf parameters are the same as in the experiment before). As update patterns we used: Figure 8: Lookup/update costs of encoding indexes • No-updates: The second bulk contains no new strings. • Interleaved 10%: The second bulk contains 10% new string values where each 10th string in sort order is new. which has more new string values) has a better performance on • Interleaved 50%: The second bulk contains 50% new string the CS-Array-Trie. Comparing the costs of our CS-Array-Trie to values where every other string in sort order is new. the pure list-trie (that also uses buffers to speed-up the lookup of bulks), we can see that our index performs much better. • Interleaved 100%: The second bulk contains 100% new string values whereas each new string is inserted between two string values of the first bulk. 5.3 Scalability of the Dictionary • Append: The second bulk contains 100% new string values This experiment shows the performance of our data structures for whereas all string values are inserted after the last string of unsorted workloads of different sizes with 1m up to 64m distinct the first bulk. strings (with a fixed string length of 10). The remaining configu- The interleaved patterns cause higher costs (independent from ration for data generation was the same as in Section 5.1. For the the index on top) because all leaves must be decompressed and leaves, we used the same configuration as in the experiment before. compressed again to apply the inserts of the new strings. Figure Figure 9 shows the time for encoding these workloads using dif- 8 shows the time to first bulk load the dictionary with 10m strings ferent encoding indexes for the dictionary (starting with an empty and afterwards encode the other 10m strings for the different update dictionary). After encoding, we bulk loaded a decode index (i.e., patterns (when we either use the CS-Array-Trie or the CS-Prefix- a CSS-Tree) from the leaves of the encoding indexes and used the Tree as the encoding index). encoded workloads that we created before for decoding. For exam- While the CS-Prefix-Tree is more expensive to bulk load (be- ple, encoding 8m strings with the CS-Array-Trie takes 4.8s and the cause we need to sort the complete input before bulk loading the decoding takes 1.9s (while using one thread for both operations). index) than the CS-Array-Trie (which supports efficient top-down We also scaled-up the number of threads (up to 16) to show the bulk loading), the workload which represents the no-update pat- effects of parallelization on the CS-Array-Trie. For example, us- tern is faster on the CS-Prefix-Tree because the CS-Prefix-Tree is ing 8 threads reduces time for encoding 8m strings from 4.8s to (1) read-optimized and (2) not as high a the CS-Array-Trie. Fur- 2s. Figure 9 does not show the results for 16 threads because the thermore, in general a workload that is more update-intensive (i.e., performance slightly decreased compared to 8 threads due to the 294

13. 100 discussed a concrete leaf structure and two new cache-conscious indexes that can leverage the shared-leaves indexing approach. As a part of our future work we plan to investigate different or- der preserving encoding schemes that are optimal for certain up- date patterns. Moreover, we want to analyze how to efficiently per- 10 sist the dictionary data. Finally, the distribution of the dictionary Time (log) [s] over different nodes is very important for efficiently supporting the scale-up of the data in column stores. 1 8. REFERENCES CSATrie, enc [1] D. J. Abadi, S. Madden, and M. Ferreira. Integrating Compression CSPTree, enc and Execution in Column-Oriented Database Systems. In SIGMOD CSATrie, dec CSPTree, dec Conference, pages 671–682, 2006. CSATrie, enc, 4T [2] G. Antoshenkov, D. B. Lomet, and J. Murray. Order Preserving CSATrie, enc, 8T String Compression. In ICDE, pages 655–663, 1996. 0.1 [3] N. Askitis and R. Sinha. HAT-Trie: A Cache-Conscious Trie-Based 1 2 4 8 16 32 64 Data Structure For Strings. In ACSC, pages 97–105, 2007. Millions of Strings (log) [4] N. Askitis and J. Zobel. Cache-Conscious Collision Resolution in String Hash Tables. In SPIRE, pages 91–102, 2005. Figure 9: Scalability of the Dictionary [5] R. Bayer and K. Unterauer. Prefix B-Trees. ACM Trans. Database Syst., 2(1):11–26, 1977. [6] J. L. Bentley and R. Sedgewick. Fast Algorithms for Sorting and overhead for thread synchronization. Moreover, pinning individual Searching Strings. In SODA, pages 360–369, 1997. threads to a single CPU to avoid thread migration did not improve [7] P. Bohannon, P. McIlroy, and R. Rastogi. Main-Memory Index Structures with Fixed-Size Partial Keys. In SIGMOD Conference, the performance as well. pages 163–174, 2001. [8] Z. Chen, J. Gehrke, and F. Korn. Query Optimization In Compressed Database Systems. In SIGMOD Conference, pages 271–282, 2001. [9] H. Garcia-Molina and K. Salem. Main Memory Database Systems: 6. RELATED WORK An Overview. IEEE Trans. Knowl. Data Eng., 4(6):509–516, 1992. There has been recent work on dictionary compression in column- [10] J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing Relations stores [1, 12, 24]. As mentioned before, this work focuses on small and Indexes. In ICDE, pages 370–379, 1998. dictionaries for attributes with a stable domain size. Other work on [11] G. Graefe and P.-Å. Larson. B-Tree Indexes and CPU Caches. In dictionary compression of strings [2, 8] generates variable-length ICDE, pages 349–358, 2001. integer keys to support the order-preserving encoding of attributes [12] S. Harizopoulos, V. Liang, and D. J. Abadi. Performance Tradeoffs in with a variable domain size. However, none of this work addresses Read-Optimized Databases. In VLDB, pages 487–498, 2006. [13] D. E. Knuth. The Art of Computer Programming, Volume III: Sorting the problem of efficiently encoding a huge set of variable-length and Searching. Addison-Wesley, 1973. string values using fixed-length integer keys and how to efficiently [14] D. B. Lomet. The Evolution of Effective B-tree: Page Organization support updates on such a dictionary without having to re-encode and Techniques. SIGMOD Record, 30(3):64–69, 2001. all existing data. Furthermore, to the best of our knowledge, there [15] D. R. Morrison. PATRICIA - Practical Algorithm To Retrieve exists no work that exploits the idea of different indexes sharing the Information Coded in Alphanumeric. J. ACM, 15(4):514–534, 1968. same leaves. [16] J. Rao and K. A. Ross. Cache Conscious Indexing for Moreover, there exists a lot of work on cache-conscious indexes Decision-Support in Main Memory. In VLDB, pages 78–89, 1999. and index compression for numerical data [16, 17, 10, 14, 11]. [17] J. Rao and K. A. Ross. Making B+ -Trees Cache Conscious in Main However, not much work is focused on cache-conscious indexing Memory. In SIGMOD Conference, pages 475–486, 2000. of string values. Compared to our indexes, [3] presents a cache- [18] R. Sinha, J. Zobel, and D. Ring. Cache-Efficient String Sorting Using Copying. ACM Journal of Experimental Algorithms, 11, 2006. conscious trie that holds the string values in a hash table as its [19] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, leaf structure and thus does not hold the strings in sort order. [5] M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O’Neil, P. E. O’Neil, presents a string index that holds the keys in sort order but is not A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A Column-oriented designed to be cache-conscious. DBMS. In VLDB, pages 553–564, 2005. An area that is also generally related to our work is the work on [20] S. P. Vanderwiel and D. J. Lilja. Data Prefetch Mechanisms. ACM indexing in main memory databases [7, 9]. Finally, we applied the Comput. Surv., 32(2):174–199, 2000. ideas of [22] for buffering index lookups to increase cache locality [21] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes (2nd to our indexes and showed the benefits that buffering can also help ed.): Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999. for bulk loading these indexes. [22] J. Zhou and K. A. Ross. Buffering Accesses to Memory-Resident Index Structures. In VLDB, pages 405–416, 2003. 7. CONCLUSIONS AND FUTURE WORK [23] M. Zukowski, P. A. Boncz, N. Nes, and S. Héman. MonetDB/X100 - A DBMS In The CPU Cache. IEEE Data Eng. Bull., 28(2):17–22, In this paper, we have shown an approach for efficiently using 2005. dictionaries for compressing a large set of variable-length string [24] M. Zukowski, S. Héman, N. Nes, and P. A. Boncz. Super-Scalar values with fixed-length integer keys in column stores. The dic- RAM-CPU Cache Compression. In ICDE, page 59, 2006. tionary supports updates (i.e., inserts of new string values) without changing codes for existing values in many cases. Furthermore, we have presented a new approach for indexing such a dictionary (called shared-leaves) that compresses the dictionary itself while offering efficient access paths for encoding and decoding. We also 295