A Modular Query Optimizer Architecture for Big Data
The performance of analytical query processing in data management systems depends primarily on the capabilities of the system’s query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer.In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-theart

1 .Orca: A Modular Query Optimizer Architecture for Big Data Mohamed A. Soliman∗, Lyublena Antova∗, Venkatesh Raghavan∗, Amr El-Helw∗, Zhongxian Gu∗, Entong Shen∗, George C. Caragea∗, Carlos Garcia-Alvarado∗, Foyzur Rahman∗, Michalis Petropoulos∗, Florian Waas‡, Sivaramakrishnan Narayanan§, Konstantinos Krikellas†, Rhonda Baldwin∗ ∗ ‡ † § Pivotal Inc. Datometry Inc. Google Inc. Qubole Inc. Palo Alto, USA San Francisco, USA Mountain View, USA Mountain View, USA ABSTRACT Despite a plethora of research in this area, most exist- The performance of analytical query processing in data man- ing query optimizers in both commercial and open source agement systems depends primarily on the capabilities of projects are still primarily based on technology dating back the system’s query optimizer. Increased data volumes and to the early days of commercial database development [22], heightened interest in processing complex analytical queries and are frequently prone to produce suboptimal results. have prompted Pivotal to build a new query optimizer. Realizing this significant gap between research and prac- In this paper we present the architecture of Orca, the new tical implementations, we have set out to devise an architec- query optimizer for all Pivotal data management products, ture that meets current requirements, yet promises enough including Pivotal Greenplum Database and Pivotal HAWQ. headroom for future developments. Orca is a comprehensive development uniting state-of-the- In this paper, we describe Orca, the result of our recent re- art query optimization technology with own original research search and development efforts at Greenplum/Pivotal. Orca resulting in a modular and portable optimizer architecture. is a state-of-the-art query optimizer specifically designed for In addition to describing the overall architecture, we high- demanding analytics workloads. It is distinguished from light several unique features and present performance com- other optimizers in several important ways: parisons against other systems. Modularity. Using a highly extensible abstraction of meta- data and system description, Orca is no longer confined Categories and Subject Descriptors to a specific host system like traditional optimizers. In- H.2.4 [Database Management]: Systems—Query pro- stead it can be ported to other data management sys- cessing; Distributed databases tems quickly through plug-ins supported by its Meta- data Provider SDK. Keywords Extensibility. By representing all elements of a query and Query Optimization, Cost Model, MPP, Parallel Processing its optimization as first-class citizens of equal foot- ing, Orca avoids the trap of multi-phase optimization where certain optimizations are dealt with as an af- 1. INTRODUCTION terthought. Multi-phase optimizers are notoriously Big Data has brought about a renewed interest in query difficult to extend as new optimizations or query con- optimization as a new breed of data management systems structs often do not match the previously set phase has pushed the envelope in terms of unprecedented scal- boundaries. ability, availability, and processing capabilities (cf. e.g., [9, 18, 20, 21]), which makes large datasets of hundreds of Multi-core ready. Orca deploys a highly efficient multi- terabytes or even petabytes readily accessible for analysis core aware scheduler that distributes individual fine- through SQL or SQL-like interfaces. Differences between grained optimization subtasks across multiple cores for good and mediocre optimizers have always been known to speed-up of the optimization process. be substantial [15]. However, the increased amount of data these systems have to process magnifies optimization mis- Verifiability. Orca has special provisions for ascertaining takes and stresses the importance of query optimization correctness and performance on the level of built-in more than ever. mechanisms. Besides improving engineering practices, these tools enable rapid development with high confi- Permission to make digital or hard copies of all or part of this work for personal or dence and lead to reduced turnaround time for both classroom use is granted without fee provided that copies are not made or distributed new features as well as bug fixes. for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than Performance. Orca is a substantial improvement over our ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- previous system and in many cases offers query speed- publish, to post on servers or to redistribute to lists, requires prior specific permission up of 10x up to 1000x. and/or a fee. Request permissions from permissions@acm.org. SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA. Copyright 2014 ACM 978-1-4503-2376-5/14/06 ...$15.00. We describe the architecture of Orca and highlight some http://dx.doi.org/10.1145/2588555.2595637. of the advanced features enabled by its design. We provide 337

2 . 12/5/13 gp-dia-3-0.png (650×502) During query execution, data can be distributed to seg- ments in multiple ways including hashed distribution, where tuples are distributed to segments based on some hash func- tion, replicated distribution, where a full copy of a table is stored at each segment and singleton distribution, where the whole distributed table is gathered from multiple segments to a single host (usually the master). 2.2 SQL on Hadoop Processing analytics queries on Hadoop is becoming in- creasingly popular. Initially, the queries were expressed as MapReduce jobs and the Hadoop’s appeal was attributed to its scalability and fault-tolerance. Coding, manually op- timizing and maintaining complex queries in MapReduce though is hard, thus SQL-like declarative languages, such Figure 1: High level GPDB architecture as Hive [28], were developed on top of Hadoop. HiveQL queries are compiled into MapReduce jobs and executed by a blueprint of various components and detail the engineer- Hadoop. HiveQL accelerated the coding of complex queries ing practices we have pioneered and deployed to realize this but also made apparent that an optimizer is needed in the project. Lastly, we give performance results based on the Hadoop ecosystem, since the compiled MapReduce jobs ex- TPC-DS benchmark comparing Orca to other systems. In hibited poor performance. particular, we focus on query processing systems contributed Pivotal responded to the challenge by introducing to the open source space. HAWQ [21], a massively parallel SQL-compliant engine on The remainder of this paper is organized as follows. We top of HDFS. HAWQ employes Orca in its core to devise give preliminaries on the computing architecture in Sec- www.gopivotal.com/assets/images/gp-dia-3-0.png 1/1 efficient query plans minimizing the cost of accessing data tion 2. In Section 3, we present the architecture of Orca in Hadoop clusters. The architecture of HAWQ combines and describe its components. Section 4 presents the query an innovative state-of-the-art cost-based optimizer with the optimization workflow. Section 5 describes how Orca ex- scalability and fault-tolerance of Hadoop to enable interac- changes metadata with the backend database system. We tive processing of data at petabyte scale. describe in Section 6 the tools we built to maintain a veri- Recently, a number of other efforts, including Cloudera’s fiable query optimizer. Section 7 presents our experimental Impala [17] and Facebook’s Presto [7], introduced new op- study, followed by a discussion of related work in Section 8. timizers to enable SQL processing on Hadoop. Currently, We summarize this paper with final remarks in Section 9. these efforts support only a subset of the SQL standard fea- tures and their optimizations are restricted to rule-based. 2. PRELIMINARIES In comparison, HAWQ has a full-fledged standard compli- We give preliminaries on massively parallel processing ant SQL interface and a cost-based optimizer, both of which databases (Section 2.1), and Hadoop query engines (Sec- are unprecedented features in Hadoop query engines. We il- tion 2.2). lustrate in our experimental study in Section 7 the key role that Orca plays in differentiating HAWQ from other Hadoop 2.1 Massively Parallel Processing SQL engines on both functional and performance sides. Pivotal’s Greenplum Database (GPDB) [20] is a mas- sively parallel processing (MPP) analytics database. GPDB 3. ORCA ARCHITECTURE adopts a shared-nothing computing architecture with two or more cooperating processors. Each processor has its own Orca is the new query optimizer for Pivotal data man- memory, operating system and disks. GPDB leverages this agement products, including GPDB and HAWQ. Orca is a high-performance system architecture to distribute the load modern top-down query optimizer based on the Cascades op- of petabyte data warehouses, and use system resources in timization framework [13]. While many Cascades optimizers parallel to process a given query. are tightly-coupled with their host systems, a unique feature Figure 1 shows a high level architecture of GPDB. Stor- of Orca is its ability to run outside the database system as a age and processing of large amounts of data are handled stand-alone optimizer. This ability is crucial to supporting by distributing the load across several servers or hosts to products with different computing architectures (e.g., MPP create an array of individual databases, all working to- and Hadoop) using one optimizer. It also allows leverag- gether to present a single database image. The master is ing the extensive legacy of relational optimization in new the entry point to GPDB, where clients connect and sub- query processing paradigms like Hadoop [7, 10, 16, 17]. Fur- mit SQL statements. The master coordinates work with thermore, running the optimizer as a stand-alone product other database instances, called segments, to handle data enables elaborate testing without going through the mono- processing and storage. When a query is submitted to the lithic structure of a database system. master, it is optimized and broken into smaller components DXL. Decoupling the optimizer from the database system dispatched to segments to work together on delivering the requires building a communication mechanism to process final results. The interconnect is the networking layer re- queries. Orca includes a framework for exchanging informa- sponsible for inter-process communication between the seg- tion between the optimizer and the database system called ments. The interconnect uses a standard Gigabit Ethernet Data eXchange Language (DXL). The framework uses an switching fabric. XML-based language to encode the necessary information 338

3 . Database%System% join of two tables). Group members, called group expres- % Query! Parser! Catalog! % Executor! Results! sions, achieve the group goal in different logical ways (e.g., % different join orders). Each group expression is an operator ! Query2DXL! MD!Provider! ! DXL2Plan! that has other groups as its children. This recursive struc- ! ture of the Memo allows compact encoding of a huge space of possible plans as we illustrate in Section 4.1. DXL!Query! DXL!MD! DXL!Plan! Search and Job Scheduler. Orca uses a search mecha- nism to navigate through the space of possible plan alter- Orca% natives and identify the plan with the least estimated cost. The search mechanism is enabled by a specialized Job Sched- Figure 2: Interaction of Orca with database system uler that creates dependent or parallel work units to perform query optimization in three main steps: exploration, where DXL'Query' DXL'Plan' equivalent logical expressions are generated, implementation where physical plans are generated, and optimization, where Orca% required physical properties (e.g., sort order) are enforced Search' Op*mizer%Tools% Operators' and plan alternatives are costed. We discuss the details of optimization jobs scheduling in Section 4.2. Job'Scheduler' Property'Enforcement' Transforma9ons' Transformations. [13] Plan alternatives are generated Memo% by applying transformation rules that can produce either Card.'Es9ma9on' Cost'Model' equivalent logical expressions (e.g., InnerJoin(A,B) → In- nerJoin(B,A)), or physical implementations of existing ex- MD'Cache' pressions (e.g., Join(A,B) → HashJoin(A,B)). The results of applying transformation rules are copied-in to the Memo, GPOS% which may result in creating new groups and/or adding new Excep9on' Concurrency' File'I/O' Memory'Manager' group expressions to existing groups. Each transformation Handling' Control' rule is a self-contained component that can be explicitly ac- OS% tivated/deactivated in Orca configurations. Property Enforcement. Orca includes an extensible Figure 3: Orca architecture framework for describing query requirements and plan char- acteristics based on formal property specifications. Prop- for communication, such as input queries, output plans and erties have different types including logical properties (e.g., metadata. Overlaid on DXL is a simple communication pro- output columns), physical properties (e.g., sort order and tocol to send the initial query structure and retrieve the data distribution), and scalar properties (e.g., columns used optimized plan. A major benefit of DXL is packaging Orca in join conditions). During query optimization, each oper- as a stand-alone product. ator may request specific properties from its children. An Figure 2 shows the interaction between Orca and an ex- optimized child plan may either satisfy the required proper- ternal database system. The input to Orca is a DXL query. ties on its own (e.g., an IndexScan plan delivers sorted data), The output of Orca is a DXL plan. During optimization, or an enforcer (e.g., a Sort operator) needs to be plugged in the database system can be queried for metadata (e.g., ta- the plan to deliver the required property. The framework ble definitions). Orca abstracts metadata access details by allows each operator to control enforcers placement based allowing database system to register a metadata provider on child plans’ properties and operator’s local behavior. We (MD Provider) that is responsible for serializing metadata describe this framework in more detail in Section 4.1. into DXL before being sent to Orca. Metadata can also be Metadata Cache. Since metadata (e.g., table definitions) consumed from regular files containing metadata objects se- changes infrequently, shipping it with every query incurs an rialized in DXL format. overhead. Orca caches metadata on the optimizer side and The database system needs to include translators that only retrieves pieces of it from the catalog if something is consume/emit data in DXL format. Query2DXL transla- unavailable in the cache, or has changed since the last time tor converts a query parse tree into a DXL query, while it was loaded in the cache. Metadata cache also abstracts DXL2Plan translator converts a DXL plan into an executable the database system details from the optimizer, which is plan. The implementation of such translators is done com- particularly useful during testing and debugging. pletely outside Orca, which allows multiple systems to use Orca by providing the appropriate translators. GPOS. In order to interact with operating systems with The architecture of Orca is highly extensible; all compo- possibly different APIs, Orca uses an OS abstraction layer nents can be replaced individually and configured separately. called GPOS. The GPOS layer provides Orca with an exten- Figure 3 shows the different components of Orca. We briefly sive infrastructure including a memory manager, primitives describe these components as follows. for concurrency control, exception handling, file I/O and synchronized data structures. Memo. The space of plan alternatives generated by the optimizer is encoded in a compact in-memory data struc- ture called the Memo [13]. The Memo structure consists of 4. QUERY OPTIMIZATION a set of containers called groups, where each group contains We describe Orca’s optimization workflow in Section 4.1. logically equivalent expressions. Memo groups capture the We then show how the optimization process can be con- different sub-goals of a query (e.g., a filter on a table, or a ducted in parallel in Section 4.2. 339

4 .<? xml version = " 1.0 " encoding = " UTF -8 " ? > GROUP&0& < dxl:DXLMessage xmlns:dxl = " http: // greenplum . com / dxl / v1 " > < dxl:Query > 0:%Inner%Join%[1,2]% Inner%Join% < d x l : O u t p u t C o l umns > (T1.a=T2.b)%% GROUP&1& < dxl:Ident ColId = " 0 " Name = " a " Mdid = " " / > </ d x l : O u t p u t C o lumns > 0:%Get(T1)%[]% < dxl:SortingColumnList > Get(T1)% Get(T2)% GROUP&2& < d x l : S o r t i n g C olumn ColId = " 0 " OpMdid = " " > </ d x l : S o r t i n g C o l u m n L i s t > Logical'Expression' 0:%Get(T2)%[]% < d xl : Di s tr i bu t ion Type = " Singleton " / > < dxl: Logi cal Join JoinType = " Inner " > Ini$al'Memo' < dxl:LogicalGet > < d x l : T a b l e D e sc ri p to r Mdid = " 0.1639448.1.1 " Name = " T1 " > < dxl:Columns > < dxl:Ident ColId = " 0 " Name = " a " Mdid = " " / > Figure 4: Copying-in initial logical expression < dxl:Ident ColId = " 1 " Name = " b " Mdid = " " / > </ dxl:Columns > </ d x l : T a b l e D es c ri pt o r > </ dxl:LogicalGet > (1) Exploration. Transformation rules that generate log- < dxl:LogicalGet > < d x l : T a b l e D e sc ri p to r Mdid = " 0.2868145.1.1 " Name = " T2 " > ically equivalent expressions are triggered. For example, a < dxl:Columns > Join Commutativity rule is triggered to generate InnerJoin[2,1] < dxl:Ident ColId = " 2 " Name = " a " Mdid = " " / > out of InnerJoin[1,2]. Exploration results in adding new < dxl:Ident ColId = " 3 " Name = " b " Mdid = " " / > </ dxl:Columns > group expressions to existing groups and possibly creating </ d x l : T a b l e D es c ri pt o r > new groups. The Memo structure has a built-in duplicate de- </ dxl:LogicalGet > tection mechanism, based on expression topology, to detect < dxl:Comparison Operator = " = " Mdid = " " > < dxl:Ident ColId = " 0 " Name = " a " Mdid = " " / > and eliminate any duplicate expressions created by different < dxl:Ident ColId = " 3 " Name = " b " Mdid = " " / > transformations. </ dxl:Comparison > </ dxl: Logi calJ oin > (2) Statistics Derivation. At the end of exploration, </ dxl:Query > the Memo maintains the complete logical space of the given </ dxl:DXLMessage > query. Orca’s statistics derivation mechanism is then trig- gered to compute statistics for the Memo groups. A statis- Listing 1: DXL query message tics object in Orca is mainly a collection of column his- tograms used to derive estimates for cardinality and data skew. Derivation of statistics takes place on the compact 4.1 Optimization Workflow Memo structure to avoid expanding the search space. We illustrate query optimization workflow using the fol- In order to derive statistics for a target group, Orca picks lowing running example: the group expression with the highest promise of deliver- ing reliable statistics. Statistics promise computation is SELECT T1.a FROM T1, T2 expression-specific. For example, an InnerJoin expression WHERE T1.a = T2.b with a small number of join conditions is more promising ORDER BY T1.a; than another equivalent InnerJoin expression with a larger where the distribution of T1 is Hashed(T1.a) and the distri- number of join conditions (this situation could arise when bution of T2 is Hashed(T2.a) (cf. Section 2.1). generating multiple join orders). The rationale is that the Listing 1 shows the representation of the previous query larger the number of join conditions, the higher the chance in DXL, where we give the required output columns, sort- that estimation errors are propagated and amplified. Com- ing columns, data distribution and logical query. Metadata puting a confidence score for cardinality estimation is chal- (e.g., tables and operators definitions) are decorated with lenging due to the need to aggregate confidence scores across metadata ids (Mdid’s) to allow requesting further informa- all nodes of a given expression. We are currently exploring tion during optimization. An Mdid is a unique identifier several methods to compute confidence scores in the com- composed of a database system identifier, an object identi- pact Memo structure. fier and a version number. For example, ‘’ refers to After picking the most promising group expression in the GPDB’s integer equality operator with version ‘1.0’. Meta- target group, Orca recursively triggers statistics derivation data versions are used to invalidate cached metadata objects on the child groups of the picked group expression. Finally, that have gone through modifications across queries. We the target group’s statistics object is constructed by com- discuss metadata exchange in more detail in Section 5. bining the statistics objects of child groups. The DXL query message is shipped to Orca, where it is Figure 5 illustrates statistics derivation mechanism for the parsed and transformed to an in-memory logical expression running example. First, a top-down pass is performed where tree that is copied-in to the Memo. Figure 4 shows the ini- a parent group expression requests statistics from its child tial contents of the Memo. The logical expression creates groups. For example, InnerJoin(T1,T2) on (a=b) requests three groups for the two tables and the InnerJoin operation. histograms on T1.a and T2.b. The requested histograms We omit the join condition for brevity. Group 0 is called are loaded on demand from the catalog through the regis- the root group since it corresponds to the root of the logical tered MD Provider, parsed into DXL and stored in the MD expression. The dependencies between operators in the log- Cache to service future requests. Next, a bottom-up pass is ical expression are captured as references between groups. performed to combine child statistics objects into a parent For example, InnerJoin[1,2] refers to Group 1 and Group 2 statistics object. This results in (possibly modified) his- as children. Optimization takes place as described in the tograms on columns T1.a and T2.b, since the join condition following steps. could impact columns’ histograms. 340

5 . {%%}% bution. Gather operator gathers tuples from all segments to Hist(T1.a)%Hist(T2.b)% the master. GatherMerge operator gathers sorted data from Inner%Join(a=b)%[1,2]% Inner%Join(a=b)%[1,2]% all segments to the master, while keeping the sort order. Re- distribute operator distributes tuples across segments based {T1.a}% {T2.b}% on the hash value of given argument. Hist(T1.a)% Hist(T2.b)% Figure 7 shows the optimization of req. #1 by Inner- GROUP%1% GROUP%2% HashJoin[1,2]. For this request, one of the alternative plans GROUP%1% GROUP%2% is aligning child distributions based on join condition, so (c)$Bo8om(up$sta.s.cs$deriva.on$ that tuples to be joined are co-located2 . This is achieved (a)$Top(down$sta.s.cs$requests$ by requesting Hashed(T1.a) distribution from group 1 and GROUP&1& Hashed(T2.b) distribution from group 2. Both groups are GROUP&0& requested to deliver Any sort order. After child best plans 0:%Get(T1)%[]% Hist(T1.a)% are found, InnerHashJoin combines child properties to deter- 0:%Inner%Join%[1,2]% GROUP&2& mine the delivered distribution and sort order. Note that Hist(T1.a)% Hist(T2.b)% 1:%Inner%Join%[2,1]% 0:%Get(T1)%[]% the best plan for group 2 needs to hash-distribute T2 on Hist(T2.b)% T2.b, since T2 is originally hash-distributed on T2.a, while (b)$Computed$sta.s.cs$are$ (d)$Combined$sta.s.cs$are$ the best plan for group 1 is a simple Scan, since T1 is already a8ached$to$child$groups$ a8ached$to$parent$group$$ hash-distributed on T1.a. When it is determined that delivered properties do not Figure 5: Statistics derivation mechanism satisfy the initial requirements, unsatisfied properties have to be enforced. Property enforcement in Orca in a flexible Constructed statistics objects are attached to individual framework that allows each operator to define the behav- groups where they can be incrementally updated (e.g., by ior of enforcing required properties based on the properties adding new histograms) during optimization. This is crucial delivered by child plans and operator local behavior. For ex- to keep the cost of statistics derivation manageable. ample, an order-preserving NL Join operator may not need to enforce a sort order on top of the join if the order is (3) Implementation. Transformation rules that create already delivered by outer child. physical implementations of logical expressions are trig- Enforcers are added to the group containing the group gered. For example, Get2Scan rule is triggered to gener- expression being optimized. Figure 7 shows two possible ate physical table Scan out of logical Get. Similarly, Inner- plans that satisfy req. #1 through property enforcement. Join2HashJoin and InnerJoin2NLJoin rules are triggered to The left plan sorts join results on segments, and then gather- generate Hash and Nested Loops join implementations. merges sorted results at the master. The right plan gathers (4) Optimization. In this step, properties are enforced and join results from segments to the master, and then sorts plan alternatives are costed. Optimization starts by submit- them. These different alternatives are encoded in the Memo ting an initial optimization request to the Memo’s root group and it is up to the cost model to differentiate their costs. specifying query requirements such as result distribution and Finally, the best plan is extracted from the Memo based on sort order. Submitting a request r to a group g corresponds the linkage structure given by optimization requests. Fig- to requesting the least cost plan satisfying r with a root ure 6 illustrates plan extraction for the running example. physical operator in g. We show the local hash tables of relevant group expressions. For each incoming request, each physical group expres- Each local hash table maps incoming optimization request sion passes corresponding requests to child groups depending to corresponding child optimization requests. on the incoming requirements and operator’s local require- We first look-up the best group expression of req. #1 in ments. During optimization, many identical requests may the root group, which leads to GatherMerge operator. The be submitted to the same group. Orca caches computed corresponding child request in the local hash table of Gath- requests into a group hash table. An incoming request is erMerge is req #3. The best group expression for req #3 computed only if it does not already exist in group hash is Sort. Therefore, we link GatherMerge to Sort. The cor- table. Additionally, each physical group expression main- responding child request in the local hash table of Sort is tains a local hash table mapping incoming requests to the req #4. The best group expression for req #4 is Inner- corresponding child requests. Local hash tables provide the HashJoin[1,2]. We thus link Sort to InnerHashJoin. The same linkage structure used when extracting a physical plan from procedure is followed to complete plan extraction leading to the Memo, as we show later in this section. the final plan shown in Figure 6. Figure 6 shows optimization requests in the Memo for The extracted plan is serialized in DXL format the running example. The initial optimization request is and shipped to the database system for execution. req. #1: {Singleton, <T1.a>}, which specifies that query DXL2Plan translator at the database system translates results are required to be gathered to the master based on DXL plan to an executable plan based on the underling query the order given by T1.a1 . We also show group hash ta- execution framework. bles where each request is associated with the best group Multi-Stage Optimization. Our ongoing work in Orca expression (GExpr) that satisfies it at the least estimated involves implementing multi-stage optimization. An opti- cost. The black boxes indicate enforcer operators that are 2 plugged in the Memo to deliver sort order and data distri- There can be many other alternatives (e.g., request children to be gathered to the master and perform the join there). Orca 1 Required properties also include output columns, rewindability, allows extending each operator with any number of possible common table expressions and data partitioning. We omit these optimization alternatives and cleanly isolates these alternatives properties due to space constraints. through property enforcement framework. 341

6 . Groups#Hash#Tables# Memo# Extracted#final#plan# GROUP#0# ## Opt.#Request# Best#GExpr# 2:#Inner#NLJoin#[2,1]# 3:#Inner#NLJoin#[1,2]# 4:#Inner#HashJoin#[1,2]# 5:#Inner#HashJoin#[2,1]# GatherMerge(T1.a)# 1# Singleton,##<T1.a># 8" …# …# …# …# #4# #7,##10# …# …# 2# Singleton,#Any# 7" …# …# Sort(T1.a)# 3# Any,##<T1.a># 6" 6:#Sort(T1.a)#[0]# 7:#Gather[0]# 8:#GatherMerge(T1.a)#[0]# 4# Any,##Any# 4" Inner#Hash#Join# #3# #4# …# …# #1# #3# …# …# …# …# ## Opt.#Request# Best#GExpr# GROUP#1# 5# Any,#Any# 1" 1:#Scan(T1)[]# 2:#Sort(T1.a)#[1]# 3:#Replicate[1]# 6# Replicated,#Any# 3" #7# …# …# …# …# Scan(T1)# 7# Hashed(T1.a),#Any# 1" …# …# 8# Any,#<T1.a># 2" GROUP#2# ## Opt.#Request# Best#GExpr# 1:#Scan(T2)[]# 2:#Replicate[2]# 3:#Redistribute(T2.b)#[2]# 9# Any,#Any# 1" Redistribute(T2.b)# 10# Hashed(T2.b),#Any# 3" #9# …# …# #10# #9# 11# Replicated,#Any# 2" …# …# …# …# Scan(T2)# Figure 6: Processing optimization requests in the Memo {Singleton,%<T1.a>}% {Singleton,%<T1.a>}% lates to better query plans and hence better system perfor- Inner%Hash%Join%[1,2]% Inner%Hash%Join% mance. Parallelizing query optimizer is crucial to benefit {Hashed(T1.a),%Any}% {Hashed(T2.b),%Any}% from advanced CPU designs that exploit an increasing num- Scan(T1)% Redistribute(T2.b)/ bers of cores. GROUP%1% GROUP%2% Orca is a multi-core enabled optimizer. Optimization pro- Scan(T2)% cess is broken to small work units called optimization jobs. (a)$Passing$requests$to$child$groups$ (b)$Combining$child$groups$best$plans$ Orca currently has seven different types of optimization jobs: GatherMerge(T1.a)/ Sort(T1.a)/ • Exp(g): Generate logically equivalent expressions of all Sort(T1.a)/ Gather/ group expressions in group g. Inner%Hash%Join% Inner%Hash%Join% • Exp(gexpr): Generate logically equivalent expressions Scan(T1)% Redistribute(T2.b)/ Scan(T1)% Redistribute(T2.b)/ of a group expression gexpr. Scan(T2)% Scan(T2)% • Imp(g): Generate implementations of all group expres- (c)$Enforcing$missing$proper:es$to$sa:sfy${Singleton,$<T1.a>}$request$ sions in group g. • Imp(gexpr): Generate implementation alternatives of a Figure 7: Generating InnerHashJoin plan alternatives group expression gexpr. mization stage in Orca is defined as a complete optimiza- • Opt(g, req): Return the plan with the least estimated tion workflow using a subset of transformation rules and cost that is rooted by an operator in group g and sat- (optional) time-out and cost threshold. A stage terminates isfies optimization request req. when any of the following conditions is met: (1) a plan • Opt(gexpr, req): Return the plan with the least esti- with cost below cost threshold is found, (2) time-out oc- mated cost that is rooted by gexpr and satisfies opti- curs, or (3) the subset of transformation rules is exhausted. mization request req. The specification of optimization stages can be given by the user through Orca’s configuration. This technique allows • Xform(gexpr, t) Transform group expression gexpr us- resource-constrained optimization where, for example, the ing rule t. most expensive transformation rules are configured to run in later stages to avoid increasing the optimization time. This For a given query, hundreds or even thousands of job in- technique is also a foundation for obtaining a query plan stances of each type may be created. This introduces chal- as early as possible to cut-down search space for complex lenges for handling job dependencies. For example, a group queries. expression cannot be optimized until its child groups are also optimized. Figure 8 shows a partial job graph, where Query Execution. A copy of the final plan is dispatched optimization of group g0 under optimization request req0 to each segment. During distributed query execution, a dis- triggers a deep tree of dependent jobs. Dependencies are tribution enforcer on each segment acts as both sender and encoded as child-parent links; a parent job cannot finish be- receiver of data. For example, a Redistribute(T2.b) instance fore its child jobs finish. While child jobs are progressing, running on segment S sends tuples on S to other segments the parent job needs to be suspended. This allows child jobs based on the hash value of T2.b, and also receives tuples to pick up available threads and run in parallel, if they do from other Redistribute(T2.b) instances on other segments. not depend on other jobs. When all child jobs complete, the suspended parent job is notified to resume processing. 4.2 Parallel Query Optimization Orca includes a specialized job scheduler designed from Query optimization is probably the most CPU-intensive scratch to maximize the fan-out of job dependency graph process in a database system. Effective usage of CPUs trans- and provide the required infrastructure for parallel query 342

7 . Opt(g0,(req0)( Vanilla! Oracle,!MS!SQL! Hadoop,! ! Postgres! DB2,!MySQL,!etc.! MongoDB! op1mize'group' File! Imp(g0)( Opt(g0.gexpr,(req0)( …( expressions'in'g0''' based!! GPDB! HAWQ! PG! DB! NoSQL! …( Provider! Exp(g0)( Opt(g1,(req1)( …( op1mize'children' ! …( …( of'g0.gexpr' …( Imp(g1)( Opt(g1.gexpr,(req1)( …( op1mize'group' ! expressions'in'g1'''' ! …( …( MD!Provider!PlugIins! ! implement'group' Exp(g1)( Imp(g1.gexpr)( …( expressions'in'g1' ! MD!Cache! …( ! Exp(g1.gexpr)( …( explore'group' ! expressions'in'g1'' ! …( ! ! MD!Accessor! …( ! …( Xform(g1.gexpr,(t)( Xform(g1.gexpr,(t’)( …( ! ! MD!Accessor! …( …( explora1on'rules' implementa1on' ! ! explore'children' of'g1.gexpr' rules'of'g1.gexpr' ! MD!Accessor! ! of'g1.gexpr' ! ! Op'miza'on!Engine! ! Figure 8: Optimization jobs dependency graph Orca! Figure 9: Metadata exchange framework optimization. The scheduler provides APIs to define opti- mization jobs as re-entrant procedures that can be picked Dumpfile! up by available processing threads. It also maintains job de- ! DXLStacks! pendency graph to identify opportunities of parallelism (e.g., ! running transformations in different groups), and notify sus- DXL!Query! ! DXL!MD! DXL!Plan! ! pended jobs when the jobs they depend on have terminated. During parallel query optimization, multiple concurrent requests to modify a Memo group might be triggered by MD!Cache! ! different optimization requests. In order to minimize syn- ! ! chronization overhead among jobs with the same goal (e.g., ! Op+miza+on! exploring the same group), jobs should not know about the ! Engine! ! existence of each other. When an optimization job with Orca! some goal is under processing, all other incoming jobs with the same goal are forced to wait until getting notified about Figure 10: Replay of AMPERe dump the completion of the running job. At this point, the sus- pended jobs can pick up the results of the completed job. This functionality is enabled by attaching a job queue to eliminating the need to access a live backend system. Orca each group, such that incoming jobs are queued as long as includes an automated tool for harvesting metadata that there exists an active job with the same goal. optimizer needs into a minimal DXL file. We show in Sec- tion 6.1 how this tool is used to replay optimization of cus- tomer queries while the backend database system is offline. 5. METADATA EXCHANGE Orca is designed to work outside the database system. One major point of interaction between optimizer and 6. VERIFIABILITY database system is metadata exchange. For example, the Testing a query optimizer is a task as challenging as build- optimizer may need to know whether indices are defined on ing one. Orca is built with testing in mind since the early a given table to devise an efficient query plan. The access to development phase. There is a built-in testing scheme that metadata is facilitated by a collection of Metadata Providers makes it difficult for developers to introduce regressions as that are system-specific plug-ins to retrieve metadata from part of adding new features, and makes it simple for test en- the database system. gineers to add test cases to be verified with every build. In Figure 9 shows how Orca exchanges metadata with differ- addition, we leverage several tools and testing frameworks ent backend systems. During query optimization, all meta- we built to assure the quality and verifiability of Orca, in- data objects accessed by Orca are pinned in an in-memory cluding a cardinality estimation testing framework, a num- cache, and are unpinned when optimization completes or an ber of benchmark tests at various scales, a data generator error is thrown. All accesses to metadata objects are accom- that can generate data by reversing database statistics [24], plished via MD Accessor, which keeps track of objects being and two unique testing tools we discuss next. accessed in the optimization session, and makes sure they The first tool, discussed in Section 6.1, is automatic cap- are released when they are no longer needed. MD Accessor turing and replaying of optimizer’s anomalies. The second is also responsible for transparently fetching metadata data tool, discussed in Section 6.2, implements an automated from external MD Provider if the requested metadata object method to measure the accuracy of optimizer’s cost model. is not already in the cache. Different MD Accessors serving different optimization sessions may have different external 6.1 Minimal Repros MD providers for fetching metadata. AMPERe [3] is a tool for Automatic capture of Minimal In addition to system-specific providers, Orca implements Portable and Executable Repros. The motivation for build- a file-based MD Provider to load metadata from a DXL file, ing AMPERe was to be able to reproduce and debug cus- 343

8 .<? xml version = " 1.0 " encoding = " UTF -8 " ? > p4! < dxl:DXLMessage xmlns:dxl = " http: // greenplum . com / dxl / v1 " > Estimated Cost! < dxl:Thread Id = " 0 " > < dxl:Stacktrace > 1 0 x000e8106df g p o s : : C E x c e p t i o n : : R a i s e 2 0 x000137d853 C O p t T a s k s : : P v O p t i m i z e T a s k p2! 3 0 x000e81cb1c g p os : : C T a sk : : E x ec u t e p3! 4 0 x000e8180f4 g p o s : : C W o r k e r : : E x e c u t e 5 0 x000e81e811 g p o s : : C A u t o T a s k P r o x y : : E x e c u t e p1! </ dxl:Stacktrace > Actual Cost! < dxl:TraceFlags Value = " g p _ o p t i m i z e r _ h a s h j o i n " / > < dxl:Metadata SystemIds = " 0. GPDB " > < dxl:Type Mdid = " " Name = " int4 " Figure 11: Plan space I s R edistributable = " true " Length = " 4 " / > < dxl:RelStats Mdid = " 2.688.1.1 " Name = " r " Rows = " 10 " / > might generate a plan different from the expected one (e.g., < dxl:Relation Mdid = " 0.688.1.1 " Name = " r " Di str ibu tion Pol icy = " Hash " because of changes in the cost model). Such discrepancy D is tr i bu ti o nC ol u mn s = " 0 " > causes the test case to fail, and triggers investigating the < dxl:Columns > root cause of plan difference. Using this framework, any < dxl:Column Name = " a " Attno = " 1 " Mdid = " " / > bug with an accompanying AMPERe dump, whether filed </ dxl:Columns > </ dxl:Relation > by internal testing or through customer reports, can be au- </ dxl:Metadata > tomatically turned into a self-contained test case. < dxl:Query > < d x l : O u t p u t C o lumns > < dxl:Ident ColId = " 1 " Name = " a " Mdid = " " / > 6.2 Testing Optimizer Accuracy </ d x l : O u t p u t C olumns > The accuracy of Orca’s cost model can be impacted by < dxl:LogicalGet > a number of error sources including inaccurate cardinality < d x l : T a b l e D e sc ri p to r Mdid = " 0.688.1.1 " Name = " r " > < dxl:Columns > estimates and not properly adjusted cost model parameters. < dxl:Column ColId = " 1 " Name = " a " Mdid = " " / > As a result, cost model provides imperfect prediction of the </ dxl:Columns > wall clock time for the execution of a plan. Quantifying opti- </ d x l : T a b l e D es c ri pt o r > </ dxl:LogicalGet > mizer’s accuracy is crucial to avoid performance regressions </ dxl:Query > introduced by bug fixes and newly added features. </ dxl:Thread > Orca includes a built-in tool called TAQO [15] for Testing </ dxl:DXLMessage > the Accuracy of Query Optimizer. TAQO measures the abil- ity of optimizer’s cost model to order any two given plans Listing 2: Simplified AMPERe dump correctly, i.e., the plan with the higher estimated cost will in- deed run longer. For example, in Figure 11, the optimizer or- ders (p1 , p3 ) correctly, since their actual cost is directly pro- tomer issues in the optimizer without having access to the portional to computed cost estimates. On the other hand, customer production system. the optimizer orders (p1 , p2 ) incorrectly, since their actual An AMPERe dump is automatically triggered when an un- cost is inversely proportional to computed cost estimates. expected error is encountered, but can also be produced on TAQO measures the optimizer’s accuracy by costing and demand to investigate suboptimal query plans. The dump executing plans that the optimizer considers when optimiz- captures the minimal amount of data needed to reproduce ing a given query. Evaluating each single plan in the search a problem, including the input query, optimizer configura- space is infeasible, in general. This limitation can be over- tions and metadata, serialized in DXL (cf. Section 3). If the come by sampling plans uniformly from the search space. dump is generated due to an exception, it also includes the Optimization requests’ linkage structure (cf. Section 4.1) exception’s stack trace. provides the infrastructure used by TAQO to build a uni- Listing 2 shows an example of a simplified AMPERe dump. form plan sampler based on the method introduced in [29]. The dump includes only the necessary data to reproduce the Given a sample of plans from the search space of a given problem. For example, the dump captures the state of MD query, TAQO computes a correlation score between the rank- Cache which includes only the metadata acquired during the ing of sampled plans based on estimated costs and their course of query optimization. AMPERe is also built to be ranking based on actual costs. The correlation score com- extensible. Any component in Orca can register itself with bines a number of measures including importance of plans the AMPERe serializer to generate additional information in (the score penalizes optimizer more for cost miss-estimation the output dump. of very good plans), and distance between plans (the score AMPERe allows replaying a dump outside the system does not penalize optimizer for small differences in the es- where it was generated. Any Orca instance can load the timated costs of plans that are actually close in execution dump to retrieve the input query, metadata and configura- time). The correlation score also allows benchmarking the tion parameters in order to invoke an optimization session optimizers of different database systems to evaluate their identical to the one that triggered the problematic situation relative quality. We discuss the testing methodology imple- at hand. This process is depicted in Figure 10, where the mented in TAQO in more detail in [15]. optimizer loads the input query from the dump, creates a file-based MD Provider for the metadata, sets optimizer’s configurations and then spawns the optimization threads to 7. EXPERIMENTS reproduce the problem instantly. In our experimental study, we chose to conduct an end- AMPERe is also used as a testing framework, where a to-end evaluation of a database system equipped with Orca, dump acts as a test case that contains an input query and rather than evaluating Orca’s individual components, to its expected plan. When replaying the dump file, Orca highlight the added value of our new query optimizer. We 344

9 .first compare Orca to the legacy query optimizer of Pivotal • Join Ordering. Orca includes a number of join or- GPDB. We then compare Pivotal HAWQ (which employs dering optimizations based on dynamic programming, Orca in its core) to other popular SQL on Hadoop solutions. left-deep join trees and cardinality-based join ordering. 7.1 TPC-DS Benchmark • Correlated Subqueries. Orca adopts and extends a uni- fied representation of subqueries to detect deeply cor- Our experiments are based on the TPC-DS benchmark [1]. related predicates and pull them up into joins to avoid TPC-DS is a widely adopted decision support benchmark repeated execution of subquery expressions. that consists of a set of complex business analytic queries. It has superseded the well-known TPC-H by providing a • Partition Elimination. Orca introduces a novel frame- much richer schema and a larger variety of business prob- work for on-the-fly pruning of partitioned tables [2]. lems ranging from business reporting, ad-hoc exploration, This feature is implemented by extending Orca’s en- iterative queries to data mining. In our development process forcers framework to accommodate new properties. we have observed that TPC-H often lacks the sophistication of the workload from our enterprise customers. On the other • Common Expressions. Orca introduces a new hand, TPC-DS with its 25 tables, 429 columns and 99 query producer-consumer model for WITH clause. The model templates can well represent a modern decision-supporting allows evaluating a complex expression once, and con- system and is an excellent benchmark for testing query op- suming its output by multiple operators. timizers. The rich SQL syntax (WITH clause, window func- The interplay of the previous features is enabled by Orca’s tions, subqueries, outer joins, CASE statement, Intersect, architecture and components abstraction. Each feature is Except, etc.) in the TPC-DS queries is a serious SQL com- designed, implemented and tested with minimal changes in pliance test for any query engine. the behavior of other features. The combined benefit and clean interaction of features are manifested by Figure 12. 7.2 MPP Databases For a smaller number of queries, Orca produced sub- In this part, we compare the performance of Orca with the optimal plans with up to 2x slow down compared to Planner. GPDB legacy query optimizer (a.k.a. Planner) that inherits These sub-optimal plans are partly due to cardinality esti- part of its design from the PostgreSQL optimizer. The Plan- mation errors or sub-optimal cost model parameters that ner is a robust optimizer that has been serving hundreds of need further tuning. We are actively investigating these is- production systems well, and has been improving over the sues and constantly improving Orca. past decade. We have also measured optimization time and Orca’s memory footprint when using the full set of transformation 7.2.1 Experiment Setup rules. The average optimization time is around 4 seconds, For the comparison between Orca and Planner, we use a while the average memory footprint is around 200 MB. As cluster of 16 nodes connected with 10Gbps Ethernet. Each we mentioned in Section 4.1, our ongoing work involves im- node has dual Intel Xeon processors at 3.33GHz, 48GB of plementing techniques to shortcut optimization and improve RAM and twelve 600GB SAS drives in two RAID-5 groups. resource consumption for complex queries. The operating system is Red Hat Enterprise Linux 5.5. We installed two isolated instances of the same version of 7.3 SQL on Hadoop GPDB (one uses Orca, and the other uses Planner). We Hadoop has quickly become a popular analytics ecosys- use 10 TB TPC-DS benchmark with partitioned tables for tem due to its scalability. In recent years, many Hadoop performance evaluation. systems have been developed with SQL or SQL-like query in- terfaces. In this section, we compare the performance of Piv- 7.2.2 Performance otal HAWQ (powered by Orca) against three Hadoop SQL We generated 111 queries out of the 99 templates of TPC- engines: Impala [17], Presto [7], and Stinger [16]. Please DS. Both Orca and Planner support all the queries in their refer to Section 8 for a discussion on these systems. original form without any re-writing. The full SQL compli- ance provides maximum compatibility of BI tools and ease- 7.3.1 Experiment Setup of-use for data analysts from different backgrounds. As we The experiments are conducted on a cluster with 10 nodes; show in the SQL on Hadoop experiments in Section 7.3, two for HDFS name node and coordinator services of SQL many Hadoop SQL engines currently support only a small engines, and eight for HDFS data nodes and worker nodes. subset of TPC-DS queries out of the box. Each node has dual Intel Xeon eight-core processors at The performance speed up of Orca compared to Planner 2.7GHz, 64GB of RAM and 22 900GB disks in JBOD. The for all queries is shown in Figure 12, where bars above the operating system is Red Hat Enterprise Linux 6.2. speed-up ratio of 1 indicate performance improvement of We used CDH 4.4 and Impala 1.1.1 for Impala, Presto Orca. We observe that Orca is able to produce similar or 0.52, and Hive 0.12 for Stinger. We made our best efforts better query plan for 80% of the queries. For the entire to tune the optimal configurations of each system, including TPC-DS suite, Orca shows a 5x speed-up over Planner. enabling short circuit read, allocating as much memory as In particular, for 14 queries Orca achieves a speed-up ra- possible to workers and having one standalone node for co- tio of at least 1000x - this is due to a timeout we enforced ordinator services. For HAWQ, we used Pivotal HD version at 10000 seconds. These queries took more than 10000 sec- 1.1 in our experiment. onds with Planner’s plan while they were able to finish with Optimization of TPC-DS queries in different systems Orca’s plan in minutes. turned out to be quite challenging because systems currently The performance improvement provided by Orca is due to have limited SQL support. For example, Impala does not a combination of salient features including the following: yet support window functions, ORDER BY statement without 345

10 . x   x   x   x   x   x   x   x   x   x   00 00 00 00 00 00 00 00 00 00 10 10 10 10 10 10 10 10 10 10 Orca Speed-up ratio 5   1.0   0.5   1   3   5   6   8   10   12   15   20   22   23   25   28   30   32   36   39   41   43   45   49   51   52   56   58   60   62   66   69   72   76   79   82   86   89   91   93   95   99   14a   18a   67a   70a   77a   80a   17   27   34   37   47   54   64   74   84   87   97   Query ID (every other ID is shown due to space limit) Figure 12: Speed-up ratio of Orca vs Planner (TPC-DS 10TB) 100" 100" HAWQ speed-up ratio HAWQ speed-up ratio 10" 10" 1" 1" 3" 11" 15" 19" 21" 22a" 25" 26" 27a" 29" 42" 43" 46" 50" 52" 55" 59" 68" 75" 76" 79" 82" 85" 93" 96" 4" 7" 37" 54" 74" 97" 3" 12" 17" 18" 20" 22" 25" 29" 37" 42" 52" 55" 67" 76" 82" 84" 86" 90" 98" Query ID Query ID (*)"Query"runs"out"of"memory"in"Impala" Figure 14: HAWQ vs Stinger (TPC-DS 256GB) Figure 13: HAWQ vs Impala (TPC-DS 256GB) LIMIT and some analytic functions like ROLLUP and CUBE. 111   111   111   Presto does not yet support non-equi joins. Stinger cur- op7miza7on   rently does not support WITH clause and CASE statement. In execu7on   #  of  queries   addition, none of the systems supports INTERSECT, EXCEPT, disjunctive join conditions and correlated subqueries. These 31   unsupported features have forced us to rule out a large num- 20   19   19   12   ber of queries from consideration. 0   0   After excluding unsupported queries, we needed to re- HAWQ   Impala   Presto   S7nger   write the remaining queries to work around parsing limita- tions in different systems. For example, Stinger and Presto do not support implicit cross-joins and some data types. Af- Figure 15: TPC-DS query support ter extensive filtering and rewriting, we finally managed to get query plans for 31 queries in Impala, 19 queries in Stinger unable to successfully run any TPC-DS query in Presto (al- and 12 queries in Presto, out of the total 111 queries. though we managed to run much simpler join queries in Presto). For Impala and Stinger, we managed to run a num- 7.3.2 Performance ber of TPC-DS queries, as we discuss next. Our first attempt was to evaluate the different systems us- Figure 15 summarizes the number of supported queries ing 10TB TPC-DS benchmark. However, most of the queries in all the systems. We show the number of queries that from Stinger did not return within a reasonable time limit, each system can optimize (i.e., return a query plan), and and almost all the queries from Impala and Presto failed due the number of queries that can finish execution and return to an out of memory error. This mainly happens due to the query results for the 256GB dataset. inability of these systems to spill partial results to disk when Figure 13 and Figure 14 show the speedup ratio of HAWQ an operator’s internal state overflows the memory limits. over Impala and Stinger. Since not all the queries are To obtain a better coverage across the different systems, supported by the two systems, we only list the successful we used 256GB TPC-DS benchmark, considering that the queries. The bars marked with ‘∗’ in Figure 13 indicate the total working memory of our cluster is about 400GB (50GB queries that run out of memory. For query 46, 59 and 68, × 8 nodes). Unfortunately even with this setting, we were Impala and HAWQ have similar performance. 346

11 . For queries where HAWQ has the most speedups, we find Building on the principles of Cascades, a parallel opti- that Impala and Stinger handle join orders as literally spec- mization framework is proposed in [30] that enables building ified in the query, while Orca explores different join orders Cascades-like optimizers for multi-core architectures. The to suggest the best one using a cost-based approach. For ex- parallel query optimization framework in Orca (cf. Sec- ample in query 25, Impala joins two fact tables store_sales tion 4.2) is based on the principles introduced in [30]. and store_returns first and then joins this huge interme- diate results with another fact table catalog_sales, which 8.2 SQL Optimization On MPP Databases is quite inefficient. In comparison, Orca joins the fact tables The exponential growth in the amount of data being with dimension tables first to reduce intermediate results. stored and queried has translated into a wider usage of In general, join ordering is a non-trivial optimization that Massively Parallel Processing (MPP) systems such as Tera- requires extensive infrastructure on the optimizer side. data [27], Oracle’s Exadata [31], Netezza [25], Pivotal Green- Impala recommends users to write joins in the descend- plum Database [20], and Vertica [18]. Due to space limita- ing order of the sizes of joined tables. However, this sug- tion, we summarize a few recent efforts in re-designing the gestion ignores the filters (which may be selective) on the query optimizer to meet the challenges of big data. tables, adds a non-trivial overhead to a database user for SQL Server Parallel Data Warehouse (PDW) [23] makes complex queries and may not be supported by BI tools that extensive re-use of the established Microsoft’s SQL Server automatically generates queries. The lack of join ordering optimizer. For each query, PDW triggers an optimization optimizations in a query optimizer has negative impact on request to the SQL Server optimizer that works on a shell the quality of produced plans. Other possible reasons for database that maintains only the metadata and statistics of HAWQ speedups such as resource management and query the database and not its user data. The plan alternatives ex- execution is outside the scope of this paper. plored by SQL Server optimizer are then shipped to PDW’s The average speedup ratio of HAWQ in this set of exper- Data Movement Service (DMS) where these logical plans iments is 6x against Impala and 21x against Stinger. Note are retrofitted with distribution information. While this ap- that the queries listed in Figure 13 and Figure 14 are rel- proach avoids building an optimizer from scratch, it makes atively simple queries in TPC-DS benchmark. More com- debugging and maintenance harder since the optimization plex queries (e.g., queries with correlated sub-queries) are logic is spread across two different processes and codebases. not supported by other systems yet, while being completely Structured Computations Optimized for Parallel Execution supported by Orca. We plan to revisit TPC-DS benchmark (SCOPE) [6] is Microsoft’s data analysis platform that lever- performance evaluation in the future when all the queries ages characteristics of both parallel databases and MapRe- are supported by other systems. duce systems. SCOPE’s scripting language, like Hive [28], is based on SQL. SCOPE is developed for the Cosmos dis- tributed data platform that employs an append-only file sys- 8. RELATED WORK tem, while Orca is designed with a vision to work with mul- Query optimization has been a fertile field for several tiple underlying data management systems. ground breaking innovations over the past decades. In this SAP HANA [11] is a distributed in-memory database sys- section, we discuss a number of foundational query optimiza- tem that handles business analytical and OLTP queries. An- tion techniques, and recent proposals in the space of MPP alytical queries in MPP databases can potentially generate databases and Hadoop-based systems. a large amount of intermediate results. Concurrent analyti- cal queries can exhaust the available memory, most of which 8.1 Query Optimization Foundations is already consumed to store and index raw data, and will Volcano Parallel Database [12] introduced basic princi- trigger data to be spilled to disk that results in a negative ples for achieving parallelism in databases. The proposed impact on query performance. framework introduced exchange operators, which enable two Vertica [18] is the commercialized MPP version of the C- means of parallelism, namely inter-operator parallelism, via Store project [26] where the data is organized into projec- pipelining, and intra-operator parallelism, via partitioning of tions and each projection is a subset of table attributes. The tuples across operators running on different processes. The initial StarOpt and its modified StratifiedOpt optimizer were proposed design allows each operator to execute indepen- custom designed for queries over star/snowflake schemas, dently on local data, as well as work in parallel with other where the join keys of the same range are co-located. When copies of the operator running in other processes. Several data co-location cannot be achieved, the pertinent projec- MPP databases [6, 8, 18, 20, 23] make use of these principles tions are replicated on all nodes to improve performance, as to build commercially successful products. addressed by Vertica’s V2Opt optimizer. Cascades [13] is an extensible optimizer framework whose principles have been used to build MS-SQL Server, 8.3 SQL On Hadoop SCOPE [6], PDW [23], and Orca, the optimizer we present The classic approach of executing SQL on Hadoop is in this paper. The popularity of this framework is due to converting queries into MapReduce jobs using Hive [28]. its clean separation of the logical and physical plan spaces. MapReduce performance can be unsatisfactory for interac- This is primarily achieved by encapsulating operators and tive analysis. Stinger [16] is an initiative to optimize query transformation rules into self-contained components. This evaluation on Hadoop by leveraging and extending Hive. modular design enables Cascades to group logically equiv- This approach, however, could entail significant redesign alent expressions to eliminate redundant work, allows rules of MapReduce computing framework to optimize passes on to be triggered on demand in contrast to Volcano’s [14] ex- data, and materialization of intermediate results on disk. haustive approach, and permits ordering the application of Several efforts have addressed interactive processing on rules based on their usefulness to a given operator. Hadoop by creating specialized query engines that allow 347

12 .SQL-based processing of data in HDFS without the need P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, to use MapReduce. Impala [17], HAWQ [21] and Presto [7] A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, are key efforts in this direction. These approaches are dif- R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google’s ferent in the design and capabilities of their query optimiz- Globally-distributed Database. In OSDI, 2012. ers and execution engines, both of which are differentiating [10] D. J. DeWitt, A. Halverson, R. Nehme, S. Shankar, factors for query performance. Co-location of DBMS and J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Hadoop technologies allows data to be processed natively Split Query Processing in Polybase. In SIGMOD, 2013. on each platform, using SQL in the DBMS and MapReduce [11] F. F¨arber, S. K. Cha, J. Primsch, C. Bornh¨ ovd, S. Sigg, and in HDFS. Hadapt [4] pioneered this approach. Microsoft has W. Lehner. SAP HANA Database: Data Management for also introduced PolyBase [10] to offer the ability to join ta- Modern Business Applications. SIGMOD Rec., 40(4), 2012. bles from PDW [23] with data on HDFS in order to optimize [12] G. Graefe. Encapsulation of Parallelism in the Volcano Query Processing System. In SIGMOD, 1990. data exchange from one platform to another. [13] G. Graefe. The Cascades Framework for Query AsterixDB [5] is an open-source effort to efficiently store, Optimization. IEEE Data Eng. Bull., 18(3), 1995. index and query semi-structured information based on a [14] G. Graefe and W. J. McKenna. The Volcano Optimizer NoSQL style data model. Currently, AsterixDB’s query Generator: Extensibility and Efficient Search. In ICDE, planner is driven by user hints rather than a cost driven ap- 1993. proach like Orca. Dremel [19] is a scalable columnar solution [15] Z. Gu, M. A. Soliman, and F. M. Waas. Testing the from Google used to analyze outputs of MapReduce pipeline. Accuracy of Query Optimizers. In DBTest, 2012. Dremel provides a high level scripting languages similar to [16] Hortonworks. Stinger, Interactive query for Apache Hive. AsterixDB’s scripting language (AQL) [5] and SCOPE [6] http://hortonworks.com/labs/stinger/, 2013. [17] M. Kornacker and J. Erickson. Cloudera Impala: to process read-only nested data. Real-Time Queries in Apache Hadoop, for Real. http://www.cloudera.com/content/cloudera/en/ 9. SUMMARY products-and-services/cdh/impala.html, 2012. With the development of Orca, we aimed at developing [18] A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, and C. Bear. The Vertica Analytic Database: a query optimization platform that not only represents the C-store 7 Years Later. VLDB Endow., 5(12), 2012. state-of-the-art but is also powerful and extensible enough to [19] S. Melnik, A. Gubarev, J. J. Long, G. Romer, support rapid development of new optimization techniques S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: and advanced query features. Interactive Analysis of Web-Scale Datasets. PVLDB, In this paper, we described the engineering effort needed 3(1):330–339, 2010. to build such a system entirely from scratch. Integrating into [20] Pivotal. Greenplum Database. Orca a number of technical safeguards posed a significant http://www.gopivotal.com/products/pivotal-greenplum- database, 2013. investment, yet, has paid already significant dividends in the [21] Pivotal. HAWQ. http://www.gopivotal.com/sites/ form of the rapid pace of its development and the resulting default/files/Hawq_WP_042313_FINAL.pdf, 2013. high quality of the software. Orca’s modularity allows it to [22] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. be adapted easily to different data management systems by Lorie, and T. G. Price. Access Path Selection in a encoding a system’s capabilities and metadata using a clean Relational Database Management System. In SIGMOD, and uniform abstraction. 1979. [23] S. Shankar, R. Nehme, J. Aguilar-Saborit, A. Chung, 10. REFERENCES M. Elhemali, A. Halverson, E. Robinson, M. S. [1] TPC-DS. http://www.tpc.org/tpcds, 2005. Subramanian, D. DeWitt, and C. Galindo-Legaria. Query [2] L. Antova, A. ElHelw, M. Soliman, Z. Gu, M. Petropoulos, Optimization in Microsoft SQL Server PDW. In SIGMOD, and F. Waas. Optimizing Queries over Partitioned Tables 2012. in MPP Systems. In SIGMOD, 2014. [24] E. Shen and L. Antova. Reversing Statistics for Scalable [3] L. Antova, K. Krikellas, and F. M. Waas. Automatic Test Databases Generation. In Proceedings of the Sixth Capture of Minimal, Portable, and Executable Bug Repros International Workshop on Testing Database Systems, using AMPERe. In DBTest, 2012. pages 7:1–7:6, 2013. [4] K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and [25] M. Singh and B. Leonhardi. Introduction to the IBM E. Paulson. Efficient Processing of Data Warehousing Netezza Warehouse Appliance. In CASCON, 2011. Queries in a Split Execution Environment. In SIGMOD, [26] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, 2011. M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. [5] A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and S. B. Zdonik. N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, C-Store: A Column-oriented DBMS. In VLDB, 2005. and V. J. Tsotras. ASTERIX: Towards a Scalable, [27] Teradata. http://www.teradata.com/, 2013. Semistructured Data Platform for Evolving-world Models. [28] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, Dist. Parallel Databases, 29(3), 2011. N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - A Petabyte Scale Data Warehouse using Hadoop. In ICDE, [6] R. Chaiken, B. Jenkins, P.-˚A. Larson, B. Ramsey, 2010. D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. PVLDB, [29] F. Waas and C. Galindo-Legaria. Counting, Enumerating, 1(2), 2008. and Sampling of Execution Plans in a Cost-based Query Optimizer. In SIGMOD, 2000. [7] L. Chan. Presto: Interacting with petabytes of data at Facebook. http://prestodb.io, 2013. [30] F. M. Waas and J. M. Hellerstein. Parallelizing Extensible Query Optimizers. In SIGMOD Conference, pages 871–878, [8] Y. Chen, R. L. Cole, W. J. McKenna, S. Perfilov, A. Sinha, 2009. and E. Szedenits, Jr. Partial Join Order Optimization in the Paraccel Analytic Database. In SIGMOD, 2009. [31] R. Weiss. A Technical Overview of the Oracle Exadata Database Machine and Exadata Storage Server, 2012. [9] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, 348

0 点赞
0 收藏