SQL Server Column Store Indexes

The SQL Server 11 release (code named “Denali”) introduces a new data warehouse query acceleration feature based on a new index type called a column store index. The new index type combined with new query operators processing batches of rows greatly improves data warehouse query performance: in some cases by hundreds of times and routinely a tenfold speedup for a broad range of decision support queries. Column store indexes are fully integrated with the rest of the system, including query processing and optimization.

1. SQL Server Column Store Indexes Per-Åke Larson, Cipri Clinciu, Eric N. Hanson, Artem Oks, Susan L. Price, Srikumar Rangarajan, Aleksandras Surna, Qingqing Zhou Microsoft {palarson, ciprianc, ehans, artemoks, susanpr, srikumar, asurna, qizhou}@microsoft.com ABSTRACT SQL Server column store indexes are “pure” column stores, not a The SQL Server 11 release (code named “Denali”) introduces a hybrid, because they store all data for different columns on new data warehouse query acceleration feature based on a new separate pages. This improves I/O scan performance and makes index type called a column store index. The new index type more efficient use of memory. SQL Server is the first major combined with new query operators processing batches of rows database product to support a pure column store index. Others greatly improves data warehouse query performance: in some have claimed that it is impossible to fully incorporate pure column cases by hundreds of times and routinely a tenfold speedup for a store technology into an established database product with a broad broad range of decision support queries. Column store indexes are market. We’re happy to prove them wrong! fully integrated with the rest of the system, including query To improve performance of typical data warehousing queries, all a processing and optimization. This paper gives an overview of the user needs to do is build a column store index on the fact tables in design and implementation of column store indexes including the data warehouse. It may also be beneficial to build column enhancements to query processing and query optimization to take store indexes on extremely large dimension tables (say more than full advantage of the new indexes. The resulting performance 10 million rows). After that, queries can be submitted unchanged improvements are illustrated by a number of example queries. and the optimizer automatically decides whether or not to use a column store index exactly as it does for other indexes. Categories and Subject Descriptors H.2.4 [Database Management]: Systems – relational databases, We illustrate the potential performance gains by an example query Microsoft SQL Server against a 1TB TPC-DS [10] test data warehouse. In this database, the catalog_sales fact table contains 1.44 billion rows. The General Terms following statement created a column store index containing all Algorithms, Performance, Design 34 columns of the table: CREATE COLUMNSTORE INDEX cstore on catalog_sales Keywords ( cs_sold_date_sk, cs_sold_time_sk, ··· Columnar index, column store, OLAP, data warehousing. ···,cs_net_paid_inc_ship_tax, cs_net_profit) 1. INTRODUCTION We ran the following typical star-join query on a pre-release build Database systems traditionally store data row-wise, that is, of Denali, with and without the column store index on the fact values from different columns of a record are stored together. This table. All other tables were stored row-wise only. The test data organization works well for transaction processing where machine had 40 cores (hyperthreading was enabled), 256GB of requests typically touch only a few records. However, it is not memory, and a disk system capable of sustaining 10GB/sec. well suited for data warehousing where queries typically scan select w_city, w_state, d_year, many records but touch only a few columns. In this case, a SUM(cs_sales_price) as cs_sales_price column-wise organization where values from the same column in from warehouse, catalog_sales, date_dim different records are stored together performs much better. It where w_warehouse_sk = cs_warehouse_sk reduces the data processed by a query because the query reads and cs_sold_date_sk = d_date_sk only the columns that it needs and, furthermore, column-wise data and w_state = 'SD' can be compressed efficiently. Systems using column-wise and d_year = 2002 storage are usually referred to as column stores. group by w_city, w_state, d_year order by d_year, w_state, w_city; SQL Server is a general-purpose database system that stores data in row format. To improve performance on data warehousing The query was run twice, first with a cold buffer pool and then queries, SQL Server 11.0 (code named “Denali”) adds column- with a warm buffer pool. With a warm buffer pool, the input data wise storage and efficient column-wise processing to the system. for the query all fit in main memory so no I/O was required. This capability is exposed as a new index type: a column store index. That is, an index can now be stored either row-wise in a B- Table 1: Observed CPU and elapsed times (in sec) tree or column-wise in a column store index. Cold buffer pool Warm buffer pool Permission to make digital or hard copies of all or part of this work for CPU Elapsed CPU Elapsed personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that Row store only 259 20 206 3.1 copies bear this notice and the full citation on the first page. To copy Column store 19.8 0.8 16.3 0.3 otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Improvement 13X 25X 13X 10X SIGMOD’10, June 12–16, 2011, Athens, Greece. Copyright 2011 ACM 978-1-4503-0661-4/11/06…$10.00. 1177

2.The results are shown in Table 1. The column store index Compressed improves performance dramatically: the query consumes 13 times A B C D column segments less CPU time and runs 25 times faster with a cold buffer pool and 10 times faster with a warm buffer pool. SQL Server column store Encode, technology gives subsecond response time for a star join query compress against a 1.44 billion row table on a commodity machine. This level of improvement is significant, especially considering that SQL Server has efficient and competitive query processing capabilities for data warehousing, having introduced star join Encode, query enhancements in SQL Server 2008. compress The machine used has a high-throughput I/O system (10GB/sec) which favors the row store. On a machine with a weaker I/O system, the relative improvement in elapsed time would be even Encode, higher. compress The rest of the paper provides more detail about column store indexes. Section 2 describes how they are stored including how they are compressed. Section 3 describes extensions to query Figure 1: Converting rows to column segments processing and query optimization to fully exploit the new index type. Section 4 provides some experimental results and section 5 summarizes related work. 2. INDEX STORAGE SQL Server has long supported two storage organization: heaps (unordered) and B-trees (ordered), both row-oriented. A table or a materialized view always has a primary storage structure and may have additional secondary indexes. The primary structure can be either a heap or a B-tree; secondary indexes are always B-trees. SQL Server also supports filtered indexes, that is, an index that stores only rows that satisfy a given selection predicate. Column store capability is exposed as a new index type: a column store index. A column store index stores its data column-wise in compressed form and is designed for fast scans of complete columns. While the initial implementation has restrictions, in principle, any index can be stored as a column store index, be it primary or secondary, filtered or non-filtered, on a base table or on a view. A column store index will be able to support all the Figure 2: Storing column segments same index operations (scans, lookups, updates, and so on) that additional metadata about each segment such as number of rows, heaps and B-tree indices support. All index types are functionally size, how data is encoded, and min and max values. equivalent but they do differ in how efficiently various operations can be performed. Storing a column store index in this way has several important benefits. It leverages the existing blob storage and catalog 2.1 Column-Wise Index Storage implementation - no new storage mechanisms are needed – and We now outline how a column store index is physically stored. many features are automatically available. Locking, logging, Figure 1 illustrates the first step that converts rows to column recovery, partitioning, mirroring, replication and other features segments. The set of rows to be stored is first divided into row immediately work for the new index type. groups, each group consisting of, say, one million rows. Each row group is encoded and compressed independently. The result is one 2.2 Data Encoding and Compression compressed column segment for each column included. Figure 1 Data is stored in a compressed form to reduce storage space and shows a table divided into three row groups where three of the I/O times. The format chosen allows column segments to be used four columns are included in the column store index. The result is without decompression in query processing. Compressing the nine compressed column segments, three segments for each of columns in a segment consists of three steps. columns A, B, and C. 1. Encode values in all columns. The column segments are then stored using existing SQL Server 2. Determine optimal row ordering. storage mechanisms as shown in Figure 2. Each column segment 3. Compress each column is stored as a separate blob (LOB). Segment blobs may be large, 2.2.1 Encoding requiring multiple pages for storage, but this is automatically The encoding step transforms column values into a uniform type: handled by the existing blob storage mechanisms. A segment a 32-bit or 64-bit integer. Two types of encoding are supported: a directory keeps track of the location of each segment so that all dictionary based encoding and a value based encoding. segments of a given column can be easily located. The directory is stored in a new system table and visible through the catalog view The dictionary based encoding transforms a set of distinct values sys.column_store_segments. The directory also contains into a set of sequential integer numbers (data ids). The actual 1178

3.values are stored in a data dictionary, essentially an array that is 2.3 I/O and Caching indexed by data ids. Each data dictionary is stored in a separate A blob storing a column segment or dictionary may extend over blob and kept track of in a new system table which is visible multiple disk pages. When brought into memory, column through the catalog view sys.column_store_dictionaries. segments and dictionaries are stored not in the page-oriented The value based encoding applies to integer and decimal data buffer pool but in a new cache designed for handling large types. It transforms the domain (min/max range) of a set of objects. Each object in the cache is stored in consecutive storage, distinct values into a smaller domain of integer numbers. The not scattered across discrete pages. This simplifies and speeds up value based encoding has two components: exponent and base. scanning of columns because there are no “page breaks”. For decimal data types the smallest possible positive exponent is To improve I/O performance, read-ahead is applied both within chosen so that all values in a column segment can be converted and among segments. That is, when reading a blob storing a into integer numbers. For example, for values 0.5, 10.77, and column segment, read-ahead is applied at the page level. A 1.333, the exponent would be 3 (1000) and the converted integer column may be stored as multiple segments so read-ahead is also numbers would be 500, 10770, and 1333 correspondingly. applied at the segment level. Finally, read-ahead is also applied to data dictionaries. For integer data types the smallest possible negative exponent is chosen so that the distance between the min and the max integer For on-disk storage additional compression could be applied. values in a column segment is reduced as much as possible When a column segment is written to disk, it could be further without losing precision. For example, for values 500, 1700, and compressed by applying some standard streaming compression 1333000, the exponent would be -2 (1/100) and the converted technique and automatically decompressed when being read into integer numbers would be 5, 17, and 13330. memory. Whether or not to apply additional compression is a tradeoff: it reduces both disk space and I/O requirements but Once the exponent is chosen and applied, the base is set to the min integer number in the column segment. Each value in the column increases CPU load. segment is then adjusted (rebased) by subtracting the base from 3. QUERY PROCESSING AND the value. For the decimal example above, the base would be 500 and the final encoded values (data ids) would be 0, 10270, and OPTIMIZATION 833. For the integer example, the base would be 5 and the final 3.1 Query Processing Enhancements encoded values would be 0, 12, and 13325. For queries that scan a large number of rows, using a column store index may reduce the amount of data read from disk by orders of 2.2.2 Optimal Row Ordering magnitude. Such a large reduction in disk I/O very likely causes Significant performance benefits accrue from operating directly CPU resources to become the next bottleneck. To keep the system on data compressed using run-length encoding (RLE), so it is balanced it was thus necessary to significantly reduce CPU important to get the best RLE compression possible. RLE gives consumption for queries processing large numbers of rows. the best compression when many identical values are clustered together in a column segment. Since the ordering of rows within a Standard query processing in SQL Server is based on a row-at-a- row group is unimportant, we can freely rearrange rows to obtain time iterator model, that is, a query operator processes one row at the best overall compression. For a schema containing only a a time. To reduce CPU time we introduced a new set of query single column, we get the best clustering simply by sorting the operators that instead processes a batch of rows at a time. As has column as this will cluster identical values together. For schemas been observed before [8], batch-at-a-time processing significantly with two or more columns it is not that simple - rearranging rows reduces the overhead for data movement between operators. The based on one column can negatively affect clustering of identical batch operators are optimized for data warehouse scenarios; they values in other columns. are not intended as replacements for row-at-a-time operators in OLTP workloads. We use the Vertipaq™ algorithm to rearrange rows within a row group in an order that achieves maximal RLE compression. This We chose not to implement a new engine specialized for data patented compression technology is shared with SQL Server warehousing applications but instead opted to extend the existing Analysis Services and PowerPivot. SQL Server engine. This has several advantages. 2.2.3 Compression 1. Customers don’t have to invest in a new engine and transfer Once the rows within a row group have been rearranged, each data between engines. It is the same SQL Server system that column segment is compressed independently using RLE they are already used to; it just has a new index type that compression or bit packing. greatly speeds up decision support queries. This is an important and tangible customer benefit. RLE (run-length encoding) compression stores data as a sequence of <value, count> pairs. The actual value is a 32-bit or 64-bit 2. It greatly reduces implementation costs. SQL Server is a number containing either an encoded value or a value stored as is. mature product with lots of features that are automatically available, for example, query plan diagrams, query execution RLE compression thrives on long runs of identical values. If a statistics, SQL profiling, SQL debugging, and so on. column contains few long runs, RLE compression may even increase the space required. This occurs, for example, when all 3. A query plan can mix the two types of operators. The query values are unique. Since values in a column segment get encoded optimizer has been extended to select the best operator type into a smaller domain of integer numbers (data ids) in most cases, to use. It typically choses the new faster operators for the the actual range of encoded values will usually require fewer bits expensive parts of a query that process large numbers of to represent each encoded value. Therefore, we also support a bit- rows. pack compression and different bit-pack compression sizes. 1179

4.4. Queries may dynamically switch at runtime from using batch Column store indexes are further complemented by the new batch operators to using row operators as necessary. For example, operators. The set of batch operators is limited in the initial currently joins and aggregations that spill to disk release and it is up to the query optimizer to ensure batch automatically switch to row-at-a-time operators. execution is used where possible and beneficial. Mixing batch and 5. We get feature orthogonality. For example, the new operators row operators is possible but converting data from one format to support the majority of existing SQL Server data types, the other has its own costs. Rather than creating a complex cost session level settings and so on. Any SQL query can take based mechanism to decide when to convert between the two advantage of the faster processing offered by the new modes the initial implementation uses a more conservative operators, not just stylized star-join queries. approach that limits the number of conversions. The new batch iterator model is independent of the access We introduced the concept of a batch segment in a query plan. A methods supported by different data sources. Similarly to access batch segment consists of a sequence of batch operators that methods for other data sources, access methods for column store execute using the same set of threads. An operator can only indexes support filtering by predicates and bitmap filters. extend the segment to a single child. In our implementation a join Whenever possible they perform operations directly on always extends its segment to the probe side. A new compressed data. The new batch operators are then typically used BatchHashTableBuild operator builds the hash table for a batch for the data intensive part of the computation, performing more hash join. It is a batch operator that can accept input either in complex filtering, projections, joins and aggregation. Row batch or row mode. operators may sometimes be needed to finish the computation. In order to make generation of these plan shapes possible, the Certain access methods may expose additional optimizations such query optimizer introduced a new physical property used to as delayed string materialization, and the new iterators are distinguish between batch and row output of an operator. Using designed to transparently take advantage of these features this property, it is now easy to make sure we do not have whenever possible. unnecessary transitions between row and batch operators. All batch operators will request that their children provide batches While processing batches of rows at a time can by itself achieve and all row operators will request their children to provide rows. very significant reduction in processing time, it cannot achieve Transitioning between the two modes is achieved by the ability of orders of magnitude reduction. some operators to output either rows or batches. Several additional improvements were also implemented: As stated above, batch hash join does not build its hash table; it is 1. The new iterators are fully optimized for the latest generation built by the BatchHashTableBuild operator. BatchHashTableBuild of CPUs like Intel Nehalem or AMD Opteron architecture. In also has the ability to build a bitmap (Bloom filter) on the hash particular, the algorithms were designed to take advantage of key; this bitmap can be used to filter out rows early from the increased memory throughput and better utilize 64-bit probe side input. These bitmaps are even more important when processing and multicore CPUs. multiple batch hash joins are stacked. Bitmap filters are typically very effective in filtering out rows early. In the same way as 2. The bitmap filter implementation was modified to have regular selection predicates they can be pushed down all the way several specific data layouts based on data distribution. into the storage engine (access method). 3. Runtime resource management was improved by making We also introduced new methods for handling multiple joins. The operators share available memory with each other in a more SQL Server optimizer tries to collapse inner joins into a single n- flexible manner. ary join operator. Our plan generation for column store indexes For queries that process large numbers of rows, the net result of uses the n-ary join as the starting point because the n-ary join all these improvements is order of magnitude performance gives us the benefit of being able to inspect the entire join graph at improvements. Row-at-a-time based operators are still preferred once rather than having to deal with just a subset of the tables for short OLTP queries. The query optimizer automatically involved. Our algorithm consists of several steps. chooses the proper operators at query compilation time; no We first go over the expressions joined together and analyze the application changes are required. join predicates involved, trying to identify which join keys are 3.2 Query Optimization unique. Using the uniqueness information we identify fact tables In order to leverage the storage engine and query execution as the tables that do not have any unique key involved in a join. capabilities described above, several changes were required in the Once this is done, starting from the smallest fact table, we expand query optimizer. Unlike regular indexes, column store indexes do the join graph to cover as many dimension tables as possible by not (efficiently) support point queries and range scans, do not only traversing many-to-one relationships. This will build a snow- offer any sort order and, since sorting is not required to build flake around the fact table. We then continue with the other fact them, they do not offer any statistics. All these properties were tables in increasing size order. This step ends when there are no “de facto” assumptions in the row store world. On the other hand, expressions from the original set left. column store indexes offer high data compression, highly If multiple fact tables were identified initially, we will have effective usage of modern CPUs, and much reduced I/O. While multiple snowflakes and these snowflakes need to be joined point queries will still heavily favor row stores, queries requiring together. Beginning with the snowflake with the largest fact table, expensive scans will benefit greatly from column store indexes. we recursively add neighboring snowflakes as dimensions of the The query optimizer can get very accurate information about the snowflake built so far. At this point we have a single snowflake actual on-disk size of columns and uses this information to expression and can start generating the final plan shape. estimate the amount of IO required. 1180

5.First we try to identify which joins are worthy candidates for 4.2 Example queries building bitmaps. Once the bitmap expressions are identified we In this section we illustrate the performance improvements that start building a right deep join tree with the dimensions on the left can be achieved by using column store indexes. Four queries side and the fact table on the right side of the rightmost join. Keep against a TPC-DS [10] database at scale factor 100 are included. in mind that each dimension can in turn be another snowflake and the algorithm has to expand them recursively. At each join, certain A TPC-DS database at scale factor 100 is intended to require conditions are checked to ensure batch execution compatibility. If about 100GB for the base tables. For SQL Server the space the fact table does not meet the criteria for batch execution then requirements were 92.3 GB for data, 15.3GB for secondary (row the tree generated cannot use batch hash joins. If the fact table store) indexes and 36.6GB for column store indexes covering all meets the criteria, then each join is analyzed and all the joins that columns on every table. Our example queries used the five tables are batch-able are placed at the bottom and the remaining joins are listed below. Each table had a B-tree clustering index and a left at the top. column store index that included all columns. The two larger (fact) tables had one secondary index each but not the three In practice, this algorithm reliably builds maximal-size batch hash smaller (dimension) tables. We found that dropping the column join pipelines for star join sub-expressions within a query, for star store indexes on the three small tables had virtually no effect on joins centered on a fact table with a column store index. the observed times. The optimizer chooses the best plan based on estimated costs. The Catalog_sales (144M rows) optimizer’s cost model of course had to be augmented to include the new batch operators and column store indexes.  Clustering index: cs_sold_date_sk_cluidx (cs_sold_date_sk)  Secondary index: cs_item_sk_cs_order_number_idx 4. EXPERIMENTAL RESULTS (cs_item_sk, cs_order_number) This section presents early experimental results on compression  Column store index: catalog_sales_cstore(all columns) rates for six databases and performance improvement for four Catalog_returns (14.4M rows) example queries. All results were obtained on a pre-release build of SQL Server 11.0 (code named “Denali”).  Clustering index: cr_returned_date_sk_cluidx (cr_returned_date_sk) 4.1 Data Compression  Secondary index: cr_item_sk_cr_order_number_idx Compression ratios for artificially generated data are uninteresting (cr_item_sk, cr_order_number) at best and misleading at worst; what matters are the results on  Column store index: catalog_returns_cstore(all columns). real data. Table 2 below shows the size of an uncompressed fact Customer_address (1M rows) table, the size of a column store index containing all columns of the table, and the compression ratio for six real data sets. The  Clustering index: pk_ca_address_sk(ca_address_sk) “Cosmetics” data set is from an orders fact table from a cosmetics  Column store index: customer_address_cstore (all columns) manufacturer. The “SQM” data set is from an internal Microsoft Item (204,000 rows) data warehouse that tracks SQL Server usage patterns. The  Clustering index: pk_i_item_sk(i_item_sk) “Xbox” data set tracks use of Xbox Live. “MSSales” contains  Column stored index: Item_cstore (all columns) Microsoft sales data. The “Web Analytics” data set contains web clickstream information. The “Telecom” data set contains call Date_dim (73049 rows): detail records from a telecommunications data warehouse.  Clustering index: pk_d_date_sk(d_date_sk) Table 2: Column store compression on real data sets  Column store index: date_dim_cstore (all columns) Uncompressed Column Com- The experiments were run on a small commodity server with the Data Set  table size store index pression following characteristics: Nehalem EP Xeon L5520 with two (MB)  size (MB)  Ratio  four-core processors for a total of eight cores running at 2.27GHz, hyper-threading off, 24G of memory, four 146G SAS drives in Cosmetics  1,302  88.5  14.7 RAID0 configuration, each with read throughput in the range 100 SQM  1,431  166  8.6 to 125 MB/sec. Xbox  1,045  202  5.2 We ran the four queries in two ways: a) restricting the optimizer to use only row store indexes and b) allowing, but not forcing, the MSSales 642,000 126,000 5.1 optimizer to use column store indexes. Each query was run twice, Web Analytics  2,560  553  4.6 in isolation, first with a cold buffer pool and second with a warm buffer pool. The database is large enough that all the required data Telecom  2,905  727  4.0 did not fit in memory. Tables 3 and 4 below show the observed   elapsed times and total CPU times (in seconds) and the improvement when using column store indexes. On average, column store compression is about 1.8 times more We will discuss the individual queries in more detail below but effective than SQL Server PAGE compression, the highest form they can be briefly characterized as follows. Query 1 is a of compression implemented for the SQL Server row store straightforward star-join query with one fact table and two structures (heaps and B-trees), which was introduced in SQL dimension tables. Query 2 exemplifies a drill-down query with a Server 2008. In other words, a column store index is about 1/1.8 = very restrictive predicate where using a column store index 0.56 times the size of a PAGE compressed B-tree or heap provides little or no benefit. Query 3 is a narrow single-table containing the same data. 1181

6.query with several expensive expressions. Query 4 is a complex (Bloom filters). The predicate d_year > 2001 is pushed all the way query with a common table expression and a subquery. down and reduces the stream from date_dim_cstore from 73K to As shown in tables 3 and 4, all but Q2 show an improvement of 35.8K. While the hash table on date_dim is built, a bitmap on the over 10X in elapsed time with a warm buffer pool. With a cold join column d_date_sk is constructed. The input from buffer pool the improvements in elapsed time are somewhat less catalog_sales_cstore is immediately filtered using the bitmap but still very substantial. reducing it from 144M to 29.1M rows. The data is reduced in two ways: a) some column segments can be immediately eliminated Table 3: Observed query times (sec) with warm buffer pool based on segment metadata alone and b) in the remaining Warm Row store only Column store Improvement segments, rows with no match in the bitmap are eliminated. Query Elapsed CPU Elapsed CPU Elapsed CPU Q1 4.9 36.4 0.3 1.9 16.4X 19.3X Q2 0.2 1.4 0.3 1.2 0.8X 1.2X Q3 21.0 166.9 1.8 13.4 11.9X 12.5X Q4 49.2 101.6 4.9 30.0 10.1X 3.4X Table 4: Observed query times (sec) with cold buffer pool Cold Row store only Column store Improvement Query Elapsed CPU Elapsed CPU Elapsed CPU Q1 19.7 35.6 1.6 2.0 12.3X 18.2X Q2 1.7 0.9 1.3 1.3 1.3X 0.7X Q3 55.1 168.4 8.5 13.7 6.5X 12.3X Q4 55.5 102.3 7.2 20.5 7.7X 3.4X 4.2.1 Query one This query is a typical OLAP query: a simple star-join query over three tables with catalog_sales as the fact table and two dimension tables, date_dim and item. The only selection predicate is on 4.2.2 Query two date_dim. The next query is similar to Q1 but with a very restrictive select i_brand, count(*) selection predicate. It is an example of a drill-down query that an from catalog_sales, date_dim, item analyst might issue to get more detailed information. It illustrates where cs_sold_date_sk = d_date_sk and the fact that the optimizer may choose a plan that includes both cs_item_sk = i_item_sk and d_year > 2001 row store and column store indexes. group by i_brand select i_brand, count(*) from catalog_sales, date_dim, item The query plan selected by the optimizer when allowed to use where cs_sold_date_sk = d_date_sk and column store indexes is shown in Figure 3. This plan was 16.4X cs_item_sk = i_item_sk and d_year = 2001 and faster with a warm buffer pool and 12.3X faster with a cold buffer d_moy = 11 and d_weekend = 'Y' pool than the best plan with row store indexes only. The query group by i_brand processed over 144M tuples in less than a third of a second. With a warm buffer pool, the row store only plan is slightly faster The optimizer chose to use column store indexes for all three but, with a cold buffer pool, it is slightly slower. tables. Execution begins by scanning the two dimension tables and building hash tables. Next multiple threads scan different The optimizer selects all indexes based on estimated cost. When segments of catalog_sales column store index and probe the hash allowed to use column store indexes, the optimizer chose a mix of tables in parallel. Batched partial aggregation is then performed column store and row store indexes. Execution begins by scanning on each stream of joined tuples after which the tuples are date_dim_cstore which produces only nine rows because of the redistributed on i_brand to be able to finish the aggregation. Final highly restrictive predicate. The result is then joined with aggregation is completed in parallel on each stream and finally the catalog_sales using a nested-loop join against the clustering index result tuples are gathered into a single output stream. which is a row store index. This produces an output of 1.4M rows which is then preaggregated on cs_item_sk reducing it to 102K The plan also illustrates how batch and row processing can be rows. The result is hash-joined with the item table using the mixed in the same plan. The bulk of the processing is done in column store index item_cstore and final aggregation is done batch mode. Only the final processing, from repartition on, is using hash aggregation. done in row mode but by then only 5416 rows remain. Evaluation of certain predicates can be pushed all the way down 4.2.3 Query three into the storage engine. This includes selections with bitmaps Q3 (on the next page) is an aggregation query over a single table where the aggregation contain somewhat more complex (and 1182

7.expensive) expressions. Batched expression evaluation is much more efficient than evaluation one row at a time. select cs_warehouse_sk, sum(cs_sales_price*(1-cs_ext_discount_amt) as s1, sum(cs_sales_price*(1-cs_ext_discount_amt)* (1 + cs_ext_tax)) as s2, avg(cs_quantity) as avg_qty, avg(cs_coupon_amt) as avg_coupon from catalog_sales where cs_sold_date_sk > 2450815+500 and cs_sold_date_sk < 2452654-500 and cs_quantity >=1 group by cs_warehouse_sk order by cs_warehouse_sk; With a warm buffer pool, the column store plan reduces the elapsed time from 21.0 sec to 1.8 sec and the CPU time from 166.9 sec to 13.4 sec. We will outline the main causes of this reduction in a moment. The query plan using the column store index is shown in Figure 4. The selection predicates are all pushed into the storage engine so Query four is shown in Figure 5 below. It is a complex query with no filtering operator is required. The “compute scalar” operator a common table expression and a subquery in the where-clause. It computes the arithmetic expression used in the aggregates. Each illustrates the fact that the full power of SQL Server’s query parallel stream is then preaggregated, after which the streams are processing is available; arbitrarily complex queries can make use gathered for the final aggregation which is done in row mode. of column store indexes. The row-store only plan is exactly the same except it begins with The use of column store indexes reduced the elapsed time from a range scan of the clustering index. The predicate on cs_sold_- 49.2 sec to 4.9 sec (with warm buffer pool) and CPU time from date_sk defines the range so only part of the index is scanned. 101.6 sec to 30.0 sec. So if the plans are the same, what explains the huge difference in The execution plan is too large to include here but the overall elapsed time and CPU time? The reduced elapsed time is caused structure is relatively simple. The outer query joining partly by reduced CPU time and partly by reduced I/O time. The customer_total_return, customer_address, and customer is reduced I/O time is caused by two factors: scanning only a few computed entirely in batch mode using column store indexes. The columns and quick elimination of many segments based on subquery is then computed, with the joins done in batch mode but metadata alone. The reduced CPU time is caused by the fact that aggregation done in row mode. Next, the result of the subquery is the batched operators and evaluation of expressions are much joined with the result of the outer query. Up to this point, more efficient. In batched expression evaluation, significant everything has been done in parallel. Finally, the different streams savings accrue from switching between the query execution are gathered and the top operator evaluated. The plan processes a component and the expression evaluation component once per total of 87.9M rows and about 62% is done in batch mode. With batch instead of once per row. The average cost of arithmetic around 1/3 of the rows processed in row mode, we can get at most operations is reduced to around 10~20 cycles per operation. 1/(1/3)=3 times improvement if the plan shape remains similar. 4.2.4 Query four We obtained 3.4 X CPU improvements but the improvement in with customer_total_return as (select cr_returning_customer_sk as ctr_customer_sk, ca_state as ctr_state, sum(cr_return_amt_inc_tax) as ctr_total_return from catalog_returns, date_dim, customer_address where cr_returned_date_sk = d_date_sk and d_year >=2000 and cr_returning_addr_sk = ca_address_sk group by cr_returning_customer_sk, ca_state ) select top 5 c_customer_id, c_salutation, c_first_name, c_last_name, ca_street_number, ca_street_name, ca_street_type, ca_suite_number,ca_city, ca_county, ca_state, ca_zip, ca_country, ca_gmt_offset, ca_location_type, ctr_total_return from customer_total_return ctr1, customer_address, customer where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 from customer_total_return ctr2 where ctr1.ctr_state = ctr2.ctr_state) and ca_address_sk = c_current_addr_sk and ca_state = 'GA' and ctr1.ctr_customer_sk = c_customer_sk order by c_customer_id, c_salutation, c_first_name, c_last_name, ca_street_number, ca_street_name,ca_street_type, ca_suite_number, ca_city, ca_county, ca_state, ca_zip, ca_country, ca_gmt_offset, ca_location_type, ctr_total_return; Figure 5: Query 4 1183

8.elapsed time is much higher. This is because the row-mode-only 7. ACKNOWLEDGMENTS execution plans causes some intermediate results to spill to disk Many other people have contributed to the success of this project. while the batch mode plan avoids this. We would especially like to thank Amir Netz for his many ideas 5. WORK IN PROGRESS and for challenging us, Cristian Petculescu for answering endless For reasons of scope and schedule, direct update and load of questions, Hanuma Kodavalla for initiating and nurturing the tables with column store indexes is not supported in the Denali project, and Jose Blakeley and Mike Zwilling for their advice and release of SQL Server, but we still accommodate the need to continuing support of the project. modify and load data. In Denali, you can add data to tables in a number of ways. If the table is small enough, you can drop its 8. REFERENCES [1] Abadi, D.J., Madden, S.R., and Ferreira, M.: Integrating column store index, perform updates, and then rebuild the index. compression and execution in column-oriented database Column store indexes fully support range partitioning. So for systems. SIGMOD, 2006, 671-682. larger tables, you can use partitioning to load a staging table, index it with a column store index, and switch it in as the newest [2] Abadi, D.J., Myers, D.S., DeWitt, D.J., and Madden, S.R.: partition. This can handle many loads per day, allowing you to Materialization strategies in a column-oriented DBMS. keep data current. We plan to support direct update in a future ICDE, 2007, 466-475. release. [3] Abadi, D.J., Madden, S.R., and Hachem, N.: Column-stores In SQL Server, the primary organization of a table can be either a vs. row-stores: how different are they really? SIGMOD, heap or a B-tree. Again, for reasons of scope and schedule, using 2008, 981-992. a column store index as the primary organization is not supported [4] Batory, D. S.: On searching transposed files. ACM Trans. in the initial release; they can only be used for secondary indexes. Database Syst. 4, 4 (1979), 531-544. We plan to lift this restriction in a future release. [5] Copeland, G.P., Khoshafian, S.N.: A Decomposition Storage 6. RELATED WORK Model. In Proc. SIGMOD, 1985, 268-279. Storing data column-wise is a rather old idea. The concept of [6] Harizopoulos, S., Liang, V., Abadi, D.J., and Madden, S.: decomposing records into smaller subrecords and storing them in Performance tradeoffs in read-optimized databases. VLDB, separate files goes back to the early 1970s. Hoffer and Severance 2006, 487-498. [7] published a paper as early as 1975 on the optimal decomposition into subrecords. Batory [4] investigated how to [7] Jeffrey A. Hoffer, Dennis G. Severance: The Use of Cluster compute queries agains such files in a paper from 1979. In a paper Analysis in Physical Data Base Design. VLDB 1975: 69-86 from 1985, Copeland and Khoshafian [5] studied fully [8] S. Padmanabhan, T. Malkemus, R. Agarwal, and A. decomposed storage where each column is stored in a separate Jhingran. Block oriented processing of relational database file, that is, a column store. operations in modern computer architectures. ICDE, 2001, 567-574. Research on column stores then lay largely dormant for almost twenty years. The 2005 paper on C-Store by Stonebraker et al [9] [9] M. Stonebraker et al. C-Store: A Column-oriented DBMS. and subsequent papers [1][2][3] [6] revived interest in column VLDB, 2005, 553-564. stores. [10] TPC Benchmark DS (Decision Support), Draft Specification, A considerable number of prototype and commercial systems Version 32, available at http://tpc.org/tpcds. relying on column-wise storage have been developed. References [11] Aster Data, http://www.asterdata.com [4][5] include references to several prototype systems built during [12] ExaSolution, http://www.exasol.com the 1970s. [13] Greenplum Database, http://www.greenplum.com Several commercial systems are available today. They are targeted for data warehousing and most are pure column stores, [14] InfoBright, http://www.infobright.com that is, data is stored only in column format. The earliest such [15] Ingres VectorWise, systems are Sybase IQ [19] and MonetDB [16], which have been http://www.ingres.com/products/vectorwise available for over a decade. Newer players include Vertica [20], [16] MonetDB, http://monetdb.cwi.nl Exasol [12], Paraccel [17], InfoBright [14] and SAND [18]. [17] ParAccel Analytic Database, http://paraccel.com SQL Server is the first general-purpose database system to fully integrate column-wise storage and processing into the system. [18] SAND CDBMS, http://www.sand.com Ingres VectorWise [15] is a pure column store and engine [19] Sybase IQ Columnar database, embedded within Ingres but it does not appear to interoperate with http://www.sybase.com/products/datawarehousing/sybaseiq the row-oriented Ingres engine, that is, a query cannot access data [20] Vertica, http://www.vertica.com both in the VectorWise column store and the standard Ingres row store. Greenplum [13] and Aster Data [11] offer systems targeted for data warehousing that began as row stores but have now added column store capabilities. However, we have found no information on how deeply column-wise processing has been integrated into their engines. 1184