Sort vs. Hash Revisited
1.934 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 6, DECEMBER 1994 Sort vs. Hash Revisited Goetz Graefe. Ann Linville, and Leonard D. Shapiro Abstract- Efficient algorithms for processing large volumes in the form of hash indices [ a ] . Only in the last decade of data are very important both for relational and new object- have hash-based query-processing algorithms gained interest, oriented database systems. Many query-processing operations acceptance, and popularity, in particular for relational database can be implemented using sort- or hash-basedalgorithms, e.g., in- tersection, join, and duplicate elimination. In the early relational machines such as Grace ,  and Gamma [lo], , but database systems, only sort-based algorithms were employed. In also for sequential query execution engines , . Rea- the last decade, hash-based algorithms have gained acceptance sons why hash-based algorithms were not considered earlier and popularity, and are often considered generally superior include that large main memories are required for optimal to sort-based algorithms such as merge-join. In this article, performance, and that techniques for avoiding or resolving we compare the concepts behind sort- and hash-based query- processing algorithms and conclude that 1) many dualities exist hash table overflow were needed, i.e., algorithms to handle between the two types of algorithms, 2) their costs differ mostly the case where none of the sets to be processed fits in main by percentages rather than factors, 3) several special cases exist memory. that favor one or the other choice, and 4) there is a strong reason Hash-based algorithms are now widely viewed as signif- why both hash- and sort-based algorithms should be available icantly faster than their sort-based equivalents, and major in a query-processingsystem. Our conclusions are supported by experiments performed using the Volcano query execution engine. database system vendors are incorporating hash join and ag- gregation into their products, e.g., Tandem . Furthermore, Index Terms-Database query processing, value-matching,per- hash-based algorithms are frequently associated with parallel formance, sorting, merge-join, hashing, hash join, hybrid hash join, comparison, duality query processing and linear speedup, even though hash-based partitioning of data to several processors can also be combined with sort-based algorithms, as the Teradata machine proves I. INTRODUCTION . In fact, the choices of partitioning and local processing ITH the emergence of relational query languages and methods are independent or orthogonal from one another. algebra, database systems required algorithms to oper- In this article, we compare sort- and hash-based algorithms, ate on large sets, e.g., for join, intersection, union, aggregation, and argue, contrary to current “wisdom,” as follows. and duplicate elimination. For today’s emerging database 1) Many dualities exist between the two types of algo- systems and their projected applications, algorithms for ma- rithms. nipulating large data volumes remain very important because 2) Their costs differ mostly by percentages rather than they are the key to providing acceptable performance not only factors. for traditional value matching such as relational joins but also 3) Many special cases exist that favor one or the other for manipulation of large set-valued attributes, maintenaince choice. of some access paths such as access support relations , 4) There is a strong reason why both hash- and sort-based and data reduction in statistics and decision support. algorithms should be available in a query-processing In early relational research and implementation efforts, e.g., system. Ingres , , [ a ] , System R [23, , PRTV , and The remainder of this article is organized as follows. We ABE , only sort-based methods were employed, and sort discuss sort- and hash-based algorithms as used in real systems costs were one (or even the) major component of query- or proposed in the literature in Section 11. In Section 111, processing costs. Consequently, ordering of stored relations we consider dualities and differences between sort- and hash- and intermediate query-processing results were an important based query-processing algorithms. An experimental compar- consideration in query optimization and led to the concept of ative study of sort- and hash-based algorithms follows in interesting orderings in System R . Section IV, using relational join as a representative for binary Although set processing was based on sorting, even early set matching algorithms. Section V contains a summary and systems employed hash-based algorithms and data structures our conclusions. Manuscript received July 1991; revised December 1991. This work was supported in part by the National Science Foundation under Grants IRI- 8996270, IRI-8912618, and IRI-9006348, and in part by the Oregon Advanced 11. RELATEDWORK Computing Institute (OACIS), ADP, Intel Supercomputer Systems Division, After the investigations of Blasgen and Eswaran , , and Sequent Computer Systems. G. Graefe is with Microsoft Corp., Redmond, WA USA. merge-join was universally regarded as the most efficient join L. D. Shapiro is with the Department of Computer Science, Portland State method for large input files. After sorting both join inputs on University, Portland, OR 97207475 1 USA. the join attribute, tuples with matching join attribute values can A. Linville is with the Department of Computer Science, University of Colorado at Boulder, CO 8 0 3 0 9 4 3 0 USA. be found efficiently and without much memory, independently IEEE Log Number 9213324. of the file sizes. 1041-4347/94$04.00 0 1994 IEEE ~~ __
2.GRAEFE et al.: SORT VS. HASH REVISITED 935 or for partitioning data evenly across a set of machines to achieve good load balancing. In this article, we do not concem ourselves much with parallelism, because we believe that the issues of data manipulation and parallelism can be made orthogonal , , and that our conclusions are directly applicable to algorithms used in parallel environments. For duplicate elimination and aggregate functions, e.g., a sum of salaries by department, Epstein’s work has led to Fig. 1. Naive merging. the use of sorting for aggregation, too . Aggregation and grouping are frequently assumed to require sorting. It is interesting to note that sorting for aggregation permits a clever optimization . Instead of sorting the input file completely and then combining (adjacent) duplicates, aggregation can be done early, namely, whenever two records with matching grouping attributes are found while writing a run file. Consider an aggregation with 100 000 input records being aggregated into 1000 groups using a sort with a maximal fan-in of Fig. 2. Optimized merging. 10. If aggregation is done separately from sorting; i.e., after sorting is complete, the largest run file may contain 10000 Significant effort has been spent on devising and improving records. If aggregation is done early, the largest run file sort algorithms for database systems; recent work includes will contain at most 1000 records. If the reduction factor [ 11, . The main memory algorithms employed in all these (output over input size) is larger than the maximal fan-in, studies are either quicksort or replacement selection. The significant improvements can be realized. In the extreme case, variations and new ideas mainly concern optimizing the I- if replacement selection is used for creating initial runs and O cost of writing and merging temporary files or runs by the output (not the input) fits into memory, the entire sort may considering larger units of 1-0 than pages at the expense be accomplished without any run files on disk. of smaller merge fan-in, i.e., the number of runs merged Starting in about 1983, query-processing algorithms based simultaneously. Larger units of 1-0 allow for faster 1-0 on hashing experienced a sudden surge of interest , [lo], because the number of seek operations and rotational latencies , predominantly for relational join. Because they were is reduced. However, since one input buffer is required for each used in a number of relational database machines, hash-based input run during merging, the fan-in is decreased with larger join algorithms were frequently identified with parallel query units of 1-0. Considering that the number of merge levels, i.e., execution [ 1 11, [ 121, [ 181, even though they make equal sense the number of times each record is merged from one run into in sequential environments. In its simplest form, called classic another, is the logarithm of the number of initial runs using the hashjoin in , a join algorithm consists of two phases. First, fan-in as base, the number of merge levels may increase with an in-memory hash table is built using one input (typically reduced fan-in. The most interesting recent insight was that it the smaller input) hashed on the join attribute. Second, tuples may be beneficial to use larger units of 1 - 0 even if the fan- from the second input are hashed on their join attribute, and in is decreased and the number of merge levels is increased the hash table is probed for matches. [211, P91. The various forms of hash join differ mainly in their Another important optimization for sorting concems the strategies for dealing with hash table overflow, i.e., the case merge strategy. Let us explain it with an example shown in that the smaller input (and therefore the hash table) is larger Fig. 1. Consider a sort with a maximal fan-in of 10 and an than main memory. All overflow strategies use overflow files, input file that requires 12 initial runs. Instead of merging only either one per input or many partition files for each input runs of the same level, it is better to delay merging until the [ I l l . Overflow avoidance as used in the Grace database end of the input has been reached, and then merge first three of machine [ 181 builds the overflow files before any overflow the 12 runs, and finally to merge the output with the remaining can occur. Bucket tuning and dynamic destaging can be used nine runs, as shown in Fig. 2. The 1 - 0 cost (measured by the to optimize the performance of overflow avoidance , . number of memory loads that must be written to disk for all Overflow resolution creates overflow files after it has occurred. + + of the runs created) for the first strategy is 12 10 2 = 24, A clever combination of in-memory hash table and overflow + wheeas for the second strategy it is 12 3 = 15, meaning resolution called hybrid hash join [lo],  optimizes the I- that the first strategy requires 60% more 1-0 than the second O for overflow files by retaining as much as possible of the one. The general rule is to merge just the right number of runs first input relation in memory; i.e., one of the partition files after the end of the input file has been reached, and to always is kept in memory and probed immediately as the other input merge the smallest runs available for merging. More detailed is partitioned. If the partition or overflow files are still larger examples are given in . than memory, they can be partitioned further using a recursive Recently, parallel sorting has found increased interest, e.g., algorithm until classic or hybrid hash join can be applied. in 131, 151, 1153, , 1231, 1251, 1331, 1341, 1401. Most in- Fig. 3 shows how two inputs, say, R and S , are partitioned vestigations concem either clever designs for parallel merging recursively in hash join algorithms. In practice, the files on
3.936 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 6, DECEMBER 1994 TABLE I DUALITY OF SORT-AND HASH-BASED QUERY-PROCESSING ALGORITHMS Aspect sorting Hashing In-memory algorithm Quicksort Classic Hash Divide-and-conquerpara- Physical division, logical Logical division, physical digm combination combination Large inputs Single-level merge Partitioning into overflow files U 0 Pattern Seuuential write. random read Random write, sequential read Fai-in Fan-out W Optimization Read-ahead, forecasting Write-behind Very large inputs Multi-level merge Recursive overtlow resolution Number of merge levels Recursion depth Non-optimal final fan-in Non-optimal hash table size Optimizations Merge optimizations Bucket tuning Better use of memory Reverse runs & LRU Hybrid hash Replacement selection ? ? Single input in memory Aggregation Aggregation in replacement Aggregation in hash table selection Interesting orderings Merge-Joinwithout sorting N-way joins, hash-merging 1 Fig. 3. Recursive hash join. - Fig. 4. Duality of partitioning and merging. If a data set fits into memory, quicksort can be employed for sorting, and classic (in-memory) hash can be used as a hashing technique. Both quicksort and classic hash are also used in memory to operate on subsets after “cutting” an entire each level will not be of exactly equal size, depending on the large data set into pieces. The cutting process is part of the data values and the hash function. If, in the deepest partitioning divide-and-conquer paradigm employed for both sorting and level, some of the R-outputs fit into the available memory, hashing. This is an important similarity of sorting and hashing bucket tuning will choose which ones to keep in memory for and has been observed before, e.g., by Bratbergsengen  and immediate join processing while partitioning the S input. On Salzberg . There exists, however, an important difference. the other hand, hybrid hash join will retain some of the R-files In the sort-based algorithms, a large data set is divided into in memory without regard to their final size. subsets using a physical rule, namely, into chunks as large Hashing can also be used for aggregation and duplicate as memory. These chunks are later combined using a logical elimination by finding duplicates while building the hash step, merging. In the hash-based algorithms, the large data set table. Overflow occurs only if the output does not fit into is cut into subsets using a logical rule, by hash values. The main memory, independently of the size of the input. Once resulting partitions are later combined using a physical step, overflow does occur, however, input records have to be written simply concatenating the subsets or result subsets. In other to overflow files, including records with duplicate keys that words, a single-level merge in a sort algorithm is a dual to eventually will have to be combined. partitioning in hash algorithms. Fig. 4 illustrates this duality and the opposite directions. 111. DUALITY OF SORTING AND HASHING This duality can also be observed in the behavior of a disk In this section, we outline the similarities and duality of sort- arm performing the 1-0 operations for merging or partitioning. and hash-based algorithms, but also point out where the two While writing initial runs after sorting them with quicksort, the types of algorithms differ. We try to discuss the approaches 1-0 is sequential. During merging, read operations access the in general terms, ignoring whether the algorithms are used many files being merged, and require random 1-0 capabilities. for relational join, union, intersection, aggregation, duplicate During partitioning, the 1-0 operations are random, but when elimination, or other operations. When appropriate, however, reading a partition later, they are sequential. we indicate specific operations. For both approaches, sorting and hashing, the amount of Table I gives an overview of the features that correspond to available memory limits not only the amount of data in a one another. Both approaches permit in-memory versions for basic unit processed using quicksort or classic hash but also small data sets and disk-based versions for larger data sets. the number of basic units that can be accessed simultaneously.
4.GRAEFE et al.: SORT VS. HASH REVISITED 937 For sorting, it is well known that merging is limited to the of saving 1-0 operations. This can be done particularly easily quotient of memory size and buffer space required for each if the initial runs are written in reverse (descending) order run, called the merge fun-in. Similarly, partitioning is limited and scanned backward for merging. However, if one does not to the same fraction, called the fun-out, because the limitation believe in buffer hints or prefers to absolutely ensure desired is encountered while writing partition files. 1-0 savings, using a final memory-resident run explicitly in In order to keep the merge process active at all times, the sort algorithm and merging it with the disk-resident runs many merge implementations use read-ahead controlled by can guqantee this effect. forecasting , trading reduced 1 - 0 delays for a reduced A well-known technique to improve sort performance is to fan-in. The dual to read-ahead during merging is write-behind generate runs twice as large as main memory using a priority during partitioning, i.e., keeping a free output buffer that can heap for replacement selection . If the runs’ sizes are be allocated to an output file while the previous page for that doubled, their number is cut in half. Therefore, merging can be file is being written to a disk. reduced to some amount, namely, by 10gF(2) = l/log,(F) Considering the limitation on fan-in and fan-out, additional merge levels where F is the fan-in of the merge, i.e., the techniques must be used for very large data sets. Merging number of run files that can be combined in a single step. can be performed in multiple levels, each combining multiple Note that if the fan-in F is large, the effect of replacement runs into larger ones. Similarly, partitioning can be repeated selection and larger runs on the merge depth and the total recursively, i.e., partition files are repartitioned, the result merge effort is negligible. However, if two sort operations repartitioned, and so forth, until the partition files fit into main feed into a merge-join and both final merges are interleaved memory. During merging, the runs grow in each level by a with the join, each merge can employ only half the memory, factor equal to the fan-in. For each recursion step, the partition and cutting the number of runs in half (on each merge level, files decrease in size by a factor equal to the fan-out. Thus, including the last one) allows performing the two final merges the number of levels during merging is equal to the recursion in parallel. depth during partitioning. There are two exceptions to be made The effect of cutting the number of runs in half offsets a regarding hash value distribution and relative sizes of inputs disadvantage of sorting in comparison to hashing when used to in binary operations such as join. We ignore those for now join (intersect, union) two data sets. In hash-based algorithms, and come back to them later. only one of the two inputs resides in or consumes memory If merging is done in the most nalve way, i.e., merging all beyond a single input buffer, not both, as in two final merges runs of a level as soon as their number reaches the fan-in, the concurrent with a merge-join. last merge on each level might not be optima; i.e., it might Heap-based run generation has a second advantage over not use the maximal possible fan-in. During hashing, if the quicksort; this advantage has a direct dual in hashing. If a hash highest possible fan-out is used in each partitioning step, the table is used to compute an aggregate function using grouping, partition file in the deepest recursion level might be smaller e.g., sum of salaries by department, hash table overflow occurs than memory, and less than the entire memory is used when only if the operation’s output does not fit in memory. Consider, processing files on that level. Thus, in both approaches, the for example, the sum of salaries by department for 100000 memory resources are not used optimally in the most na’ive employees in 1000 departments. If the lo00 result records fit version of the algorithms. in memory, classic hashing (without overflow) is sufficient. On In order to make best use of the final merge (which, the other hand, if sorting based on quicksort is used to compute by definition, includes all output items), this merge should this aggregate function, the input must fit into memory to proceed with the maximal possible fan-in. This can be ensured avoid temporary files.’ If replacement selection is used for by merging fewer runs than the maximal fan-in after the end run generation, however, the same behavior as that achieved of the input file has been reached (as illustrated in the previous with classic hash is easy to achieve. section). There is no direct dual in hash-based algorithms for The final entry in Table I concerns interesting orderings this optimization. With respect to memory utilization, the fact used in the System R query optimizer , and presumably that a partition file, and therefore a hash table, might actually other query optimizers as well. A strong argument in favor of be smaller than memory is the closest to a dual. Using memory sorting and merge-join is the fact that merge-join delivers its more effectively and using less than the maximal fan-out in output in sorted order; thus, multiple merge-joins on the same hashing has been addressed in research on bucket tuning . attribute can be performed without sort operators between The development of hybrid hash algorithms [IO],  was merge-join operators. For joining three relations, as shown in a logical consequence of the advent of large main memories Fig. 5, pipelining data from one merge-join to the next without that had led to the consideration of hash-based join algorithms sorting translates into a 3 : 4 advantage in the number of sorts in the first place. If the data set is only slightly larger than the compared to two joins on different join keys. For joining N available memory, e.g., 10% larger or twice as large, much relations on the same key, only N sorts are required, instead of the input can remain in memory and is never written to a of 2N2 for joins on different attributes. disk-resident partition file. To obtain the same effect for sort- Hash-based algorithms tend to produce their outputs in based algorithms, if the database system’s buffer manager is a very unpredictable order. To take advantage of multiple sufficiently smart or receives and accepts appropriate hints, it is possible to retain some or all of the pages of the last ’A scheme using quicksort and avoiding temporary 1-0 in this case could be devised, but would be cumbersome. We do not know of any report or run written in memory, and thus to achieve the same effect system with such a scheme.
5.938 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 6, DECEMBER 1994 MergeJoin b=b Volcano project is to provide efficient, extensible tools for / \ Merge-Join a=a query and request processing in novel application domains, Son on b Son on b / \ particularly in object-oriented and scientific database systems, O1 I I Merge-Join a=a Son on a and for experimental database performance research. Volcano Merge-Joina=a Input I3 / \ I includes its own file system, which is similar to WiSS . / \ Sort on a Son on Input I3 Smona I Smona I I Input I1 I Input I2 Much of Volcano’s file system is rather conventional. It provides data files, B+-tree indices, and bidirectional scans Input I1 Input I2 with optional predicates. The unit of 1-0 and buffering, called Fig. 5 . The effect of interesting orderings. a cluster in Volcano, is set for each file individually when it is created. Files with different cluster sizes can reside on joins on the same attribute, the equality has to be considered the same device and can be buffered in the same buffer pool. in the logical step of hashing, i.e., during partitioning on Volcano uses its own buffer manager and bypasses operating the input side. In other words, such join queries could be system buffering by using raw devices. executed effectively by a hash join algorithm that has N Queries are expressed as complex algebraic expressions; the inputs, partitions them all concurrently, and then performs N - operators of this algebra are query-processing algorithms. All way joins on each N-tuple of partition files (not pairs as in algebra operators are implemented as iterators; i.e., they sup- binary hash join with one build and one probe file for each port a simple open-next-close protocol similar to conventional partition). However, since such an algorithm is cumbersome file scans. Associated with each operator is a state record. The to implement, in particular if some of the “join” operations arguments for the algorithms, e.g., hash table size or a hash can actually be semijoin, outer join, set intersection, union, function, are part of the state record. or difference, it might well be that this distinction, joins on Since almost all queries require more than one operator, the same or on different attributes, determines the right choice state records can be linked by means of input pointers. All between sort- and hash-based algorithms for complex queries. state information for an iterator is kept in its state record; Another use of interesting orderings is the interaction of thus, an algorithm may be used multiple times in a query (sorted, B-tree) index scans and merge-join. Although it has by including more than one state record in the query. The not been reported explicitly in the literature, it is perfectly input pointers are also kept in the state records. They are possible to implement a join algorithm that uses two hash in- pointers to a quadruple of pointers to the entry points of the dices like merge-join uses two B-trees, provided that the same three procedures implementing the operator (open, next, and hash function was used to create the indices. For example, it close) and a state record. An operator does not need to know is easy to imagine “merging” the leaves (data pages) of two what kind of operator produces its input, and whether its input extendable hash indices [ 171, even if the key cardinalities and comes from a complex query tree or from a simple file scan. distributions are very different. We call this concept anonymous inputs or streams. Streams are In summary, there exist many dualities between sorting a simple but powerful abstraction that allows combining any using multilevel merging and recursive hash table overflow number of operators to evaluate a complex query. Together management. Since there are so many similarities, it is inter- with the iterator control paradigm, streams represent the most esting to compare their costs in detail. This is done in the next efficient execution model in terms of time and space for single section. process query evaluation. Calling open for the topmost operator results in instanti- COMPARISON OF SORTING Iv. EXPERIMENTAL ations for the associated state record, e.g., allocation of a AND HASHING hash table, and in open calls for all inputs. In this way, all iterators in a query are initiated recursively. In order to process In this section, we report on a number of experiments to the query, next for the top-most operator is called repeatedly demonstrate that the duality of sorting and hashing leads to until it fails with an end-of-stream indicator. Finally, the similar performance in many cases, to illustrate transfer of close call recursively “shuts down” all iterators in the query. optimization ideas from one type of algorithm to the other, and This model of query execution matches very closely the to identify the main decision criteria for the choice between model used in relational products, e.g., DB2, Ingres, Informix, sort-based and hash-based query-processing algorithms. We and Oracle, but also the iterator concept in the E database have chosen relational join as a representative of binary set language  and the algebraic query evaluation system of matching algorithms because it is a very frequently used the Starburst extensible-relationaldatabase system . Table database operation, and because many fundamental operations I1 gives algorithm outlines for some operators’ open, next, useful in all database systems, e.g., intersection, union, and and close procedures. difference, can all be realized with sort- and hash-based join Fig. 6 shows a simple query plan that might illustrate the algorithms. We first describe the experimental environment interaction of operators and their procedures. Calling open on and then report on a series of experiments. the print operator results in an open call on the hash join operator. To load the hash table, hash join opens the left file A . Experimental Environment scan, requests all records from the file scan by calling its next The test bed for our experiments was the Volcano extensible function, and closes it. After calling open on the right file and parallel query-processing engine . The purpose of the scan, the hash join operator is ready to produce data. Its open
6.GRAEFE et al.: SORT VS. HASH REVISITED 939 TABLE I1 EXAMPLES OF ITERATOR FUNCTIONS Iterator Open Next Close Print open input call next on input; for- close input mat the item on screen SCan open file read next item close file Select open input call next on input until close input an item qualifies Hash join allocate hash directory; call next on probe in- close probe input; (without open left "build input; put until a match is deallocate hash direc- overflow build hash table calling next found tory resolution) on build input; close build input; open right "probe"in- put Merge-Join open both inputs get next item from in- close both inputs put with smaller key until a match is found sort open input; build all initial determine next output destroy remaining run run files calling next on in- item; read new item files put and quicksortor re- from the c o m t run placement selection; close file input; merge run files until only one merge step is left; open the remaining run files .. . Rint Grace hash join [ 181 and overflow resolution as the original I Hash Join hybrid hash join [lo], . We are currently studying how to incorporate bucket tuning and management of skew into the recursive overflow resolution algorithm. / \ File Scan File Scan For creating initial runs in Volcano's sort operator, we decided to use quicksort, not replacement selection, even Fig. 6. A simple query plan. though replacement selection can create runs larger than memory. The basic idea of replacement selection is that after a record has been written to a run file, it is immediately replaced procedure terminates, and print's open procedure retums to in memory by another record from the input. Because the the driver module. new input record can frequently be included in the current Now the entire query evaluation plan is ready to produce output run, runs tend to be about twice as large as memory. data. Calling next on the print operator results in a nexr call In a page-based environment, however, the advantage of of the hash join operator. To produce an output item, the hash larger initial runs is not without cost. Either the heap size is join operator calls nexr on the right input until a match is found reduced by about one-half to account for record placement that can be retumed to the print operator. After formatting the and selective retention in input pages (which would offset record on the screen, the print operator's nexr function retums. the expected increase in run length), or a record holding area The query execution driver calls the topmost operator's next and another copying step are introduced. We considered this function repeatedly until it receives an error status. When, in prohibitively expensive,* unless the previous query operator a subsequent nexr call, the right file scan retums an end-of- must perform a copy step anyway that can be moved into stream status, the hash join and then the print operators retum the sort operator, and we abandoned the idea of using heaps this status. Query execution completes with a close call to the for creating initial runs. Furthermore, replacement selection print operator, which results in close calls for the hash join with copying into a holding area does not work easily for and the right file scan operators. variable-length records. Volcano's one-to-one match operator implements all func- Volcano is operational on a variety of UNIX machines, tions in which a record is included in the output, depending on including several parallel systems [ 191, . The experiments the match with another record, namely, join, semijoin, outer were run on a Sun SparcStation running SunOS with two CDC join, antijoin, union, intersection, difference, antidifference, Wren VI disk drives. One disk was used for normal UNIX aggregation, and duplicate elimination . Volcano includes file systems for system files, source code, executables, and so both a sort- and a hash-based implementation of the one-to- forth, and the other was accessed directly by Volcano as a raw one match operator. The sort-based version combines a sort device. operator that includes aggregation and duplicate elimination  with a generalized merge-join operator. The hash-based *Note that many recent computer systems have been designed and op- version is a recursive implementation of hybrid hash join timized for a high MIPS number, sometimes without similar performance advances in mundane tasks such as copying . In a shared-memory parallel hash augmented with aggregation during the build phase and machine in which bus bandwidth may be scarce, avoiding copying is even parameterized to allow both overflow avoidance similar to more important.
7.940 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 6, DECEMBER 1994 5000 Thus, the total 1-0 for sorting with quicksort is proportional to 137 memory loads for each input. For replacement selection, 2000 there would have been about 51 runs, each about twice as Elapsed Time 500 large as memory, for which one final merge would suffice. [seconds1 2oo Thus, the total 1-0 for sorting with replacement selection is proportional to 100 memory loads for each input. For hybrid 100 50 hash, the entire inputs have to be partitioned into overflow files of about 0.39 megabytes (50 megabytes/l27). Each file will 2 3 5 7.510 1520 30 50 75100 fit into memory when joining partition files. Thus, the total 1-0 will be proportional to the input sizes, or 100 memory Size of Each Input [MB], ‘h MB Memory loads for each input, exactly the same as for sorting using Fig. 7. Join performance for equal input sizes. replacement selection. We would like to discuss why we have obtained different results than Schneider and DeWitt  and Shapiro . First, B. Joins with Equal Input Sizes Schneider and DeWitt joined two relations with different sizes In order to demonstrate the relative performance of sorting (about 2 megabytes and 20 megabytes). Later we come back to and merge vs. hybrid hash join, we repeatedly joined two join inputs of different size. A second reason is that we used a relations similar to the Wisconsin benchmark [ 141. The two more sophisticated sort operator than was implemented in the relations had the same cardinality, and each tuple was 208 Gamma database machine at the time. Gamma’s sort operator bytes long. The join attribute was a 4-byte integer; each value was the same as WiSS’s ; i.e., it sorts from a disk-resident between 1 and the relation cardinality appeared exactly once. file into a disk-resident file. Therefore, an intermediate result Each tuple (in either relation) had exactly one match in the must be written to disk before it can be sorted, rather than other relation, and the join result cardinality was equal to each being sorted into initial runs before the first write step, and input cardinality. The join result tuples were immediately dis- the entire sorted file is written back to disk rather than being carded after the join, because we were interested in the relative pipelined into the next operation, e.g., a merge-join. Thus, the join performance, not in the performance of writing results WiSS sort algorithm can easily require three trips to disk when to disk. The memory allocated for quicksort, for merging, for actually one could have sufficed. Furthermore, neither heap- partitioning, and the hash table was l / 2 megabyte. The cluster based run creation nor merge optimizations are implemented size (unit of 1-0) was 4 kilobytes. in WiSS. Thus, the comparison in  is biased against sort- Fig. 7 shows the performance for merge and hybrid hash based algorithms. Shapiro  analyzed only the case in which join for input sizes between about 2 megabytes (10 000 tuples) hybrid hash’s advantage is most pronounced, i.e., when less and about 100 megabytes (500000 tuples). Sort and merge- than one full recursion level is required, based on the argument join performance is indicated with circles (O), hybrid hash that most memories are fairly large and multilevel recursion with squares (U). Note that both axes are logarithmic. The or merging are not common. This argument does not always performance is not exactly linear with the input sizes, because hold, however, as discussed in the next section. both algorithms, merge and hybrid hash join, require multi- ple levels of merging or overflow resolution for the larger inputs. C. Performance Optimizations The difference between merge-join and hybrid hash join is In this section, we focus on using duality to transfer tuning small, certainly far from an order of magnitude. The difference ideas from sorting to hashing, and vice versa. Originally, the in the performance of sort- and hash-based joins stems from performance of sorting and merge-join in Volcano had been the fact that sorting requires both inputs in memory, whereas clearly inferior to that of hybrid hash join, in particular for hashing “filters” the second input through the hash table, which input sizes relatively close to memory size. The big advantage contains only items from the first input. As expected from of hybrid hash over na’ive overflow avoidance (write all the discussion in the section on duality, this disadvantage partitions to disk, do not retain some data in memory) is that as of sorting could be offset by using replacement selection much data as possible can be kept in memory; i.e., it is never for creating initial sorted runs. To verify this claim for the written to temporary files. This led us to search for a dual in concrete example, we calculate the relative 1-0 required for the realm of sorting. To achieve the same effect, we changed sorting using quicksort, sorting using replacement selection, Volcano’s sort operator so that it retains data in memory from and hybrid hash to join two 50-megabyte inputs. We calculate the last quicksort until the first merge. In order to achieve write costs for only one input because the 1-0 is equal for both that, it writes runs in reverse, i.e., in descending order for an inputs, and all files written will be read exacrly once. Using ascending sort, and for the clusters written after the end of the quicksort, 50 megabytes of data divided by 1/2 megabyte input has been found, it gives a hint to the buffer manager to of memory results in 100 runs. Because each sort can use ensure that those clusters are replaced in a LRU discipline. As a final merge fan-in 64(1/2 megabytes/4 megabytes/2), 100 many clusters as possible will remain in the 1-0 buffer until runs must be reduced to 64 by using a fan-in of 127 ( l / 2 the first merge, which is ascending and uses a backward scan + megabytes/4 kilobytes - 1 requiring 37 (100 - 64 1) original on the run files. Therefore, these clusters are never written to runs to be merged into one larger run. disk, and a similar effect to hybrid hash join could be achieved.
8.GRAEFJZ et al.: SORT VS. HASH REVISITED 94 1 / = loo0 OMergeJoin 500 0Hybrid Hash Join 700 Elapsed Time [seconds] 1 2 3 4 5 6 7 8 9 10111213141516 2 3 5 7.510 1520 30 50 75100 Cluster Size [x 4 KB], 20 MB Inputs Sue of Large Input [MB]. Small Input 2 MB, M MB Memory Fig. 8. Join performance by cluster size. Fig. 9. Join performance for different input sizes. This optimization has been analyzed in some studies, e.g., match in the larger relation. For sorting input of a merge- , but was not considered a dual of hybrid hash. Without join, each input determines the number of merge levels. The the focus on duality, we probably would have overlooked it. large input is merged over more levels than the small input. This optimization makes the most difference for inputs only The only possible optimization we found is the division of slightly larger than main memory, precisely the same case memory between the two final merges (of the two inputs), when hybrid hash join shows the largest difference to nalve which are overlapped with the actual merge-join. To determine overflow avoidance. the optimal memory division between two final merges, we In a recent study of sequential and parallel sorting, we approximated the sum of two sort costs with a continuous found that the unit of 1-0 can have a significant impact on function and found that the memory allocated to each final sort performance  beyond the effect of read-ahead and merge should be proportional to the size of the inputs. In the double buffering . In Volcano, the cluster size is defined following experiments, we divided memory proportionally to for each file individually. Small clusters allow high fan-ins and the input sizes. For equal input sizes, the two final merge fan- therefore few merge levels. Large clusters restrict the fan-in ins were equal; for extremely different sizes, the smaller input and may force more merge levels, but they allow more efficient is merged into one run, so that the final merge is actually just 1-0, because more data is moved with each 1-0, and each a file scan. merge level can be completed with fewer seeks. For sorting, For hashing, the build input determines the recursion depth, we found that the optimal performance is typically obtained because partitioning can be terminated as soon as the build with moderate merge fan-ins and relatively large clusters. If partition fits into memory. The recursion depth does not merging and partitioning are indeed duals, the same effect of depend at all on the size of the probe input. This is the cluster size on hybrid hash performance can be expected. reason why the smaller of two relations should be chosen to Fig. 8 shows the performance of joins of two 20-megabyte be the build input into a binary hash operation. Reversing the inputs for various cluster sizes. As can be seen, hash perfor- roles of build and probe inputs dynamically, e.g., after a first mance is as sensitive to cluster size as sorting. A similar effect partitioning step, is possible, but is not considered further in was considered in the Gamma database machine [ 131, but only this article. for cluster sizes that did not change the recursion depth in Fig. 9 shows the performance of merge- and hybrid hash hash table overflow resolution. Both algorithms perform best join for input of equal to very different size. The smaller with large cluster sizes and moderate fan-in or fan-out, even (build) input size is fixed at 2 megabytes, and the larger (probe) if multiple merge or recursion levels are required. Around the input size varies between 2 megabytes and 100 megabytes. As optimal cluster size, the effect of small changes in the cluster can be seen, the performance advantage of hybrid hash join size is fairly small, making a roughly optimal choice sufficient. increases with the size difference between the input relations. In an earlier study, we found that the optimal cluster size The reason is that for hybrid hash join, 1/4 of the build for sorting (when one ignores the effects of rounding in the relation fits into memory and 3/4 of both relations is written precise cost function) depends only on the memory size, not on to overflow files independently of the probe input size. For the input sizes . The same is true for hashing. Exploiting merge-join, sorting the larger input dominates the total cost and a proven sort optimization for hash-based algorithms is the makes merge-join the inferior join method for unsorted inputs second optimization we transferred, based on our duality of very different size. Similarly, algorithms for semijoin, outer considerations. In the following experiments, we used clusters join, intersection, union, and difference derived from merge- of 32 kilobytes and fan-ins and fan-outs of 15. and hybrid hash join will perform very differently for inputs of different sizes. On the other hand, if the query optimizer cannot reliably predict which input is smaller, merge-join may D. Joins with Different Input Sizes be the superior choice. As suggested by Bratbergsengen , we decided to include joins of relations with different sizes in the comparison of E . Joins with Skewed Data and Hash Value Distributions sorting and hashing. We adjusted the data generation function Finally, we experimented with some skewed join value so that each tuple in the smaller relation has exactly one distributions. Instead of using a uniform random number
9.942 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 6, DECEMBER 1994 0.05 O . 4 0.01 0.005 0.002 . . . . . . 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 25 50 75 100 Random Number Domain, 1-100 Data Skew [z]. 2 MB Inputs, H MB Memory Fig. 10. Probability distributions for selected values of 2. Fig. 1 1 . Join performance for skewed data. generator to create test data, we used a generalized random is equally efficient for uniform and skewed data. Hashing, function borrowed from Knuth . Using a continuous however, divides the inputs logically, by hash value. Thus, it parameter, z probabilities are assigned to the numbers 1 to is susceptible to skewed hash value distributions. Obviously, N as Pi = l / i " / c for i = 1,.. . , N , where c = N cj=l l/j" skewed hash value distributions are undesirable and are against is a normalization factor used to ensure that the sum of all the idea of hashing, i.e., randomizing, the data. To counteract probabilities is 1. For z = 0, this random function creates and possibly even exploit hash value skew, we are working uniform data; for z = 1, the function can be used to create on using hash value skew to assign hash values to the in- random data according to Zipf's law .The reason why memory hash table and to partition files, applying overflow Zipfian distributions are relevant for our purpose is that they avoidance, hybrid hash, or nested loops join for partition files were defined to model real data and their frequencies. as appropriate. Fig. 10 shows the probability of values N = l , . . . ,100 with z = i / 5 for i = 0 , . . . ,5. Since the domain of N is discrete, it is not entirely right to draw the probability functions V. CONCLUSION with continuous lines; however, we have taken the liberty In this paper, we have outlined many dualities between to indicate which data points belong to the same values of sort- and hash-based query processing algorithms, e.g., for z. Note that the y-axis is logarithmic. z = 0 is shown by intersection, join, and duplicate elimination. Under many cir- the horizontal line, a uniform distribution. With increasing z, cumstances, the cost. differs by percentages rather than by the distribution becomes increasingly skewed. For z = 1, the factors, presuming that the algorithms have been implemented probability values at N = 100 is two orders of magnitude and tuned with similar care. We expected this result from the smaller than for N = 1 following Zipf's law. Probabilities large number of dualities and verified it with the Volcano with more skew can be obtained with higher values of z. query-processing system. We used the same skewed data distribution in both inputs. Two special cases exist that favor one or the other, however. Compared to uniform distributions of join keys in both inputs, First, if the inputs of a binary operator are of very different size this increases the number of matches between the inputs, (and the query optimizer can reliably predict this difference), resulting in significantly more data copying to create new hash-based algorithms will outperform sort-based algorithms, records and in more backing-up in the inner input of merge- because only the smaller of the two inputs determines how join. many recursion levels are required or what fraction of the Fig. 11 shows the effect of skew on the performance of input files must be written to temporary disk files during merge- and hybrid hash join, including the relative perfor- partitioning whereas each file determines its own disk 1-0 in mance of merge- and hybrid hash join under skew. It is evident sorting. In other words, sorting the larger of two join inputs that merge-join is less affected by the skew. For uniform is more expensive than writing a small fraction of that file data, hybrid hash join outperforms merge-join, as shown in to hash overflow files. Second, if the hash value distribution the previous figures. For highly skewed data, sorting and is not uniform, hash partitioning performs very poorly and merge-join outperforms hybrid hash join. The reason is that creates significantly higher costs than sort-based methods do. the partitioning is not even; for z = 1, a large fraction of the If the quality of the hash function cannot be predicted or build and probe inputs (3/4 of their data items) is written to improved (tuned) dynamically, sort-based query processing al- one pair of overflow files. Therefore, instead of performing gorithms are superior, because they are less vulnerable to data the join with a single level of overflow resolution, multiple distributions. Since both cases, join of differently sized files levels are needed. and skewed hash value distributions, are realistic situations The reason for this difference between sort- and hash-based in database query processing, we recommend that both sort- algorithms is that sort-based algorithms divide the input file and hash-based algorithms be included in a query-processing into physical units; i.e., run files are built according to memory engine and be chosen by the query optimizer according to size, and an input record is written to a particular run file the two cases above. If both cases arise simultaneously, i.e., solely because of it position in the input, without regard if a join of differently sized inputs with unpredictable hash for its sort key. Thus, dividing a sort input into run files value distribution, the query optimizer must estimate which
10.GRAEFE, er al.: SORT vs. HASH REVISITED 943 one poses the greater danger to system performance and D.J. Dewitt, J. Naughton, and D. Schneider, “Parallel sorting on a predictability, and must choose accordingly. shared-nothing architecture using probabilistic splitting,” Proc. Int. Conf. Parallel Distrib. Inform. Syst., Miami Beach, EX,USA, Dec. 1991. The important conclusion from this research is that neither R. Epstein, “Techniques for processing of aggregates in relational the input size nor the memory size determines the choice database systems,” UCB/Electron. Res. Lab. Memo. M79/8, Univ. of Califomia, Feb. 1979. between sort- and hash-based query-processing algorithms. R. Fagin, J. Nievergelt, N. Pippenger, and H.R. Strong, “Extendible Instead, the choice should be govemed by the relative sizes hashing: A fast access method for dynamic files,” ACM Trans.Darubase of the two inputs into binary operators, by the danger of Syst., vol. 4, no. 3, p. 315-, Sept. 1979. S. Fushimi, M. Kitsuregawa, and H. Tanaka, “An overview of the system nonuniform hash value distributions, and by the opportunities software of a parallel relational database machine GRACE,” Proc. Int. to exploit interesting orderings. Furthermore, because nei- Conf. Very Lurge Datu Bases, Kyoto, Japan, Aug. 1986. ther algorithm type outperforms the other in all situations, G. Graefe, “Encapsulation of parallelism in the Volcano query process- ing system,” Proc. ACM SIGMOD Conf., 1990, p. 102-. and because realistic situations exist that favor one or the G. Graefe and D. L. Davison, “Encapsulation of parallelism architecture- other, both should be available in a query execution engine independence in extensible database query processing,” lEEE TRans. for a choice to be made in each case by the query opti- Software Eng., vol. 19, no. 8, pp. 747, Aug. 1993. G. Graefe, “Parallel extemal sorting in Volcano,” Tech. Rep. 459, Univ. mizer. of Colorado, Boulder, USA, Dept. Comput. Sci., 1991. -, “Volcano: An extensible and parallel dataflow query processing system,”IEEE Trans. Knowledge. Data Eng., vol. 6, no. 1, pp. 120-135, Feb. 1994. ACKNOWLEDGMENT G. Graefe and S. S. Thakkar, “Tuning a parallel database algorithm on a shared-memory multiprocessor,” SofhYare4ractice and Experience, The initial interest in comparing sort- and hash-based algo- vol. 22, no. 7, p.495, July 1992. rithms in greater detail resulted from a spirited discussion with L. M. Haas, W.F. Cody, J.C. Freytag, G. Lapis, B.G. Lindsay, G.M. Lohman, K. Ono, and H. Pirahesh, “An extensible processor for an B. Lindsay and H. Pirahesh during the VLDB Conference in extended relational query language,” Comput. Sci. Res. Rep., San Jose, 1988. D. DeWitt, D. Schneider, and the anonymous reviewers CA, USA, Apr. 1988. B. R. Iyer and D. M. Dias, “System issues in parallel sorting for database made insightful comments on earlier drafts. systems,” Proc. IEEE Conf. Data Eng. 1990, p. 246. T. Keller and G. Graefe, “The one-to-one match operator of the Volcano query processing system,” Oregon Graduate Center, Comput. Sci. Tech. Rep., Beaverton, OR, USA, June 1989. REFERENCES A. Kemper and G. Moerkotte, “Access support in object bases,” Proc. ACM SIGMOD Conf., 1990, p. 364. A. Aggarval and J. S. Vitter, “The input/output complexity of sorting M. Kitsuregawa, H. Tanaka, and T. Motooka, “Application of hash to and related problems,” Commun. ACM vol. 31, p. 1116, Oct. 1988. data base machine and its architecture,” New Generation Computing, M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. N. vol. 1, 1983. Gray, P. P. Griffiths, W. F. King, R. A, Lone, P. R. McJones, J. W. Mehl, A. M. Kitsuregawa, M. Nakayama, and M. Takagi, “The effect of bucket G. R. Putzolu, I. L. Traiger, B. W. Wade, and’V. Watson, “System R A size tuning in the dynamic hybrid GRACE hash join method,” Proc. In?. relational approach to database management,” ACM Trans. Database Conf. Very Large Data Bases, 1989, p. 257. Syst. vol. 1, no. 2, p. 97, June 1976 (reprinted in M. Stonebraker, A. Klug, “Access paths in the ‘ABE’ statistical query facility,” Proc. Readings in Database Systems. San Mateo, C A Morgan Kaufmann, ACM SIGMOD Conf., 1982, p. 161. 1988 D. Knuth, The Art of Computer Programming: Sorting and Searching , M. Beck, D. Bitton, and W.K. Wilkinson, “Sorting large files on a vol. 111. Reading, MA: Addison-Wesley, 1973 backend multiprocessor,” IEEE Trans. Comput., vol. 37, p. 769, 1988. R. P. Kooi, “The optimization of queries in relational databases,” Ph.D. D. Bitton and D. J. DeWitt, “Duplicate record elimination in large data dissertation, Case Westem Reserve Univ., OH, USA, Sept. 1980. files,” ACM Trans. Database Syst., vol. 8, no. 2, p. 255-, June R.A. Lone and H. C. Young, “A low communication sort algorithm for 1983. a parallel database machine,” Proc. In[. Conf. Very Large Data Bases, D. Bitton Friedland, “Design, analysis, and implementation of parallel 1989, p. 125. extemal sorting algorithms,” Comput. Sci. Tech. Rep. 464, University J. Menon, “A study of sort algorithms for multiprocessor database of Wisconsin-Madison, Jan. 1982. machines,” Proc. Int. Conf. Very Large Data Bases, 1986, p. 197. M. Blasgen and K. Eswaran, “On the evaluation of queries in a relational M. Nakayama, M. Kitsuregawa, and M. Takagi, “Hash-partitioned join database system,” IBM Res. Rep. RJ-1745, San Jose, CA, USA, Apr. method using dynamic destaging strategy,” Proc. Int. Conf. Very Large 8, 1976. Data Bases, 1988, p. 468. -, “Storage and access in relational databases,” IBM Syst. J., vol. J. Ousterhout, “Why aren’t operating systems getting faster as fast as 16, no. 4, 1977. hardware?’ WRL Tech. Rep. TN-11, Palo Alto, CA, USA, Oct. 1989. K. Bratbergsengen, “Hashing methods and relational algebra opera- J. E. Richardson and M. J. Carey, “Programming constructs for database tions,” Proc. Int. Conf. Very Large Data Bases, 1984, p. 323. system implementation in EXODUS,” Proc. ACM SIGMOD Conf., 1987, H.T. Chou, D.J. DeWitt, R.H. Katz, and A.C. Klug, “Design and p. 208. implementation of the Wisconsin storage system,” Software: Practice B. Salzberg, File Structures: An Analytic Approach. Englewood Cliffs, and Experience, vol. 15, no. 10, p. 943, Oct. 1985. NJ: Prentice-Hall, 1988. D.J. DeWitt, R. Katz, F. Olken, L. Shapiro, M. Stonebraker, and D. -, “Merging sorted runs using large main memory,” Acta Infor- Wood, “Implementation techniques for main memory database systems,” marica, vol. 27, p. 195, 1990. Proc. ACM SIGMOD Conf., 1984, p. 1. B. Salzberg, A. Tsukerman, J. Gray, M. Stewart, S. Uren, and B. D. J. DeWitt and R. H. Gerber, “Multiprocessor hash-based join algo- Vaughan “Fastsort: A distributed single-input single-output extemal rithms,” Proc. Int. Conf. Very Large Data Bases, 1985, p. 151. sort,” Proc. ACM SIGMOD Con$.. 1990, p. 94. D. J. DeWitt, R.H. Gerber, G. Graefe, M.L. Heytens, K.B. Kumar, D. Schneider and D. DeWitt, “A performance evaluation of four parallel and M. Muralikrishna, “GAMMA: A high performance dataflow data- join algorithms in a shared-nothing multiprocessor environment,” Proc. base machine,” Proc. Int. Conf. Very Large Data Bases, 1986, p. 228 ACM SIGMOD Conf., 1989, p. 110. (reprinted in M. Stonebraker, Readings in Database Systems. San P.G. Selinger, M. M. Astrahan, D. D. Chamberlin, R.A. Lorie, and Mateo, CA: Morgan Kaufmann, 1988). T. G. Price, “Access path selection in a relational database management D. J. DeWitt, S. Ghandeharizadeh, and D. Schneider “A performance system,” Proc. ACM SICMOD Conf., 1979, p. 23 (reprinted in M. analysis of the GAMMA database machine,” Proc. ACM SIGMOD Stonebraker, Readings in Database Systems. San Mateo, CA: Morgan- Conf., 1988, p. 350. Kaufman, 1988). D. J. DeWitt, “The Wisconsin benchmark: Past, present, and future,” in J. L.D. Shapiro, “Join processing in database systems with large main Gray, Ed., Darabuse and Transactions Processing Systems Performance memories,” ACM Trans. Database Syst., vol. 11, no. 3, p. 239, Sept. Handbook. San Mateo, CA: Morgan Kaufmann, 1991. 1986.
11.944 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 6, DECEMBER 1994  M. Stonebraker, E. Wong, P. Kreps, and G. D. Held, “The Design and A. Linville received the B.S. degree in geology from Implementation of INGRES,” ACM Trans Database Syst , vol. 1, no. 3, Flonda Atlantic University, Boca Raton, FJ+ USA. p. 189, Sept. 1976 (reprinted in M. Stonebraker, Readings in Darabase She worked in the oil industry for several years Systems. San Mateo, CA: Morgan-Kaufman, 1988). before retuming to school. She is currently a gradu-  Teradata Corp., DBC/1012, Dura Base Computer, Concepts, and Facl/- ate student in computer science at the University of ities, Los Angeles, CA, USA, 1983. Colorado at Boulder. She has worked for the past  S. Todd, “PRTV: An efficient implementation for large relational data year on the Volcano extensible query-processing bases,” Proc Int Conf Very Large Data Bases, 1975, p. 554. system.  H Zeller and J. Gray, “an adaptive hash join algonthm for multiuser environments,” Proc Int Conf. Very Large Data Bases, Bnsbane, Aus- tralia, 1990.  G. K. Zipf, Human Behavior and the Principle of Least Effort An Intro- duction to Human Ecology Reading, MA: Addison-Wesley, 1949. L.D. Shapiro received the B.A. degree in mathe- matics from Reed College, Portland, OR, USA, in 1965, and the Ph.D. degree in mathematics from Yale University, New Haven, CT, USA, in 1969. G. Graefe was an undergraduate student in business administration and Currently, he is a Professor and Chair of Com- computer science in Germany before he received the M.S. and Ph.D. degrees puter Science at Portland State University, Port- in computer science from the University of Wisconsin-Madison, in 1984 land, OR, USA. Previously, he was a member and 1987, respectively. of the faculty at North Dakota State University, In 1987, he joined the faculty of the Oregon Graduate Institute, where he Fargo, ND, USA, and the University of Minnesota, initiated both the Volcano project on extensible query processing and, with Minneapolis. He has served as an investigator on David Maier, the REVELATION project on OODB performance. From 1989 to several research projects funded by, among others, 1994, he was an Assistant Professor of Computer Science at the University the National Science Foundation, the U.S. Department of Health, Education, of Colorado at Boulder. He is currently working on extensions to Volcano, and Welfare, the U.S. Air Force Office of Scientific Research, and the U.S. including a new optimizer generator, request processing in object-oriented Department of Agriculture. In addition, he has done extensive consulting work and scientific database systems, optimization and execution of very complex for local and national businesses and industries. His current research interests queries, and physical database design. His thesis work at the University of are in database management systems performance issues. Wisconsin was the EXODUS Optimizer Generator. Dr. Shapiro is a member of ACM and the IEEE Computer Society.