The MemSQL Query Optimizer


1.The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database Jack Chen, Samir Jindel, Robert Walzer, Rajkumar Sen, Nika Jimsheleishvilli, Michael Andrews Presentation by Andria Trigeorgi and Elena Constantinou Instructor: Demetris Zeinalipour

2.Motivation Enterprises need to run complex analytic queries on real-time for interactive real-time decision making Analytical queries need to be optimized and executed very quickly 2

3. MemSQL Is a distributed memory-optimized SQL database Real-time transactional and analytical workloads Can store data in two formats: in-memory row-oriented disk-backed column-oriented Sub-second query latencies over large volumes of changing data 3

4.MemSQL - Architecture Shared-nothing architecture Two types of nodes: Aggregator nodes = scheduler nodes Leaf nodes = execution nodes Two ways to distribute the user data based on table Distributed tables - rows are sharded across the leaf nodes Reference tables - the table data is replicated across all nodes 4 A L A L user query L partitions |||||||| |||||||| ||||||||

5.MemSQL: Execution of a query Aggregator node: Converts the query into a distributed query execution plan – DQEP Series of DQEPs = operations which are executed on nodes Representation of DQEPs using a SQL-like syntax and interface Query plans are compiled to machine code and are cached, without values for the parameters 5

6.Components of the Optimizer To find the best query execution plan with the least cost requires: Query rewrites Cost model of query execution Complex queries: joins across star and snowflake schemas, sorting, grouping and aggregations, and nested subqueries → powerful and fast query optimization 6

7.Components of the Optimizer Rewriter Applies SQL-to-SQL rewrites on the query, using heuristics or cost (based on the characteristics of the query and the rewrite itself) Applies some rewrites in a top-down manner , while applying others in a bottom-up manner and interleaves rewrites 7

8.Components of the Optimizer Enumerator Central component of the optimizer Determines the distributed join order and data movement decisions Selects the best plan, based on the cost models of the database operations and the network data movement operations Called by the Rewriter to cost rewrites 8

9.3 Components of the Optimizer Planner Converts the logical execution plan to a sequence of distributed query and data movement operations Uses SQL extensions: RemoteTables and ResultTables 9

10.Important Contributions Rewriter calls Enumerator to cost rewritten queries based on distributed cost Enumerator uses pruning techniques (heuristics), to enumerate faster Parts of the join graph run as bushy joins 10

11.Steps to optimize a query Forms an operator tree for the query and sent it to the query optimizer Rewriter applies the beneficial query rewrites to the operator tree Enumerator uses a search space exploration algorithm with pruning to find the best plan for join order Planner generates the DQEP, that consists SQL-like DQEP Steps, and these steps can be sent as queries over the network to be executed on nodes across the cluster 11

12.Rewriter: Cost-Based Rewrites Column Elimination transformation: removes any projection columns that are never used → reduce I/O cost and network resources Group-By Pushdown: reorders a ‘group by’ before a join to evaluate the group by earlier This transformation is not always beneficial, depending on the sizes of the joins and the cardinality of the group by → needing of cost estimates 12

13.Rewriter: Heuristic Rewrites Sub-Query Merging: Merges subselects D isadvantage: In the case of joining very large numbers of tables under a number of simple views, merging all the subselects would result in a single large join of all these tables → discards information about the structure of the join graph & expensive for the Enumerator to effectively optimize Solution: Uses heuristics to detect this type of situation and avoid merging all the views in such cases 13

14.Interleaving of Rewrites Pushing a predicate down may enable Outer Join to Inner Join conversion if that predicate rejects NULLs of the outer table Interleaving of two rewrites: going top-down over each select block (before processing any subselects ) and apply 1) Outer Join to Inner Join and then 2) Predicate Pushdown Rewrites like bushy join are done bottom-up , because they are cost-based 14 Inner Join Outer Join

15.Costing Rewrites Estimation of the cost of a candidate query transformation by calling the Enumerator, to see how the transformation affects the potential execution plans of the query tree The Enumerator determines the best execution plan taking into account data distribution, because a rewrite can affect the efficient distributed plan that the Optimizer can chose 15

16.Costing Rewrites CREATE TABLE T1 (a int , b int , shard key (b)) CREATE TABLE T2 (a int , b int , shard key (a), unique key (a)) Q1: SELECT sum(T1.b) AS s FROM T1, T2 WHERE T1.a = T2.a GROUP BY T1.a, T1.b Q2: SELECT V.s from T2, (SELECT a, sum(b) as s FROM T1 GROUP BY T1.a, T1.b ) V WHERE V.a = T2.a; 𝑅 1 =200,000 be the rowcount of T1 and 𝑅 2 =50,000 be the rowcount of T2 lookup cost of 𝐶 J =1 units (unique key on T2.a ) the group-by is executed using a hash table with an average cost of 𝐶 G =1 units per row 𝐶𝑜𝑠𝑡 Q1 =𝑅 1 𝐶 J +𝑅 1 𝑆 J 𝐶 G =200,000𝐶 J +20,000𝐶 G =220,000 𝐶𝑜𝑠𝑡 Q2 =𝑅 1 𝐶 G +𝑅 1 𝑆 G 𝐶 J =200,000𝐶 G +50,000𝐶 J =250,000 16

17.Costing Rewrites R un the query in a distributed setting T2 is sharded on T2.a , but T1 is not sharded on T1.a → compute this join by reshuffling 𝐶 R =3 units per row 𝐶𝑜𝑠𝑡 Q1 =𝑅 1 𝐶 R +𝑅 1 𝐶 J +𝑅 1 𝑆 J 𝐶 G =200,000(𝐶 R +𝐶 J )+20,000𝐶 G =620,000 𝐶𝑜𝑠𝑡 Q2 =𝑅 1 𝐶 G +𝑅 1 𝑆 G 𝐶 R +𝑅 1 𝑆 G 𝐶 J =200,000𝐶 G +50,000(𝐶 R +𝐶 J )=400,000 17

18.Bushy Joins Finding the optimal join permutation extremely costly and time consuming Many database systems do not consider bushy joins limiting their search join trees Query rewrite mechanism to generate bushy join plans is not new and has already been explored MemSQL use Bushy joins plan Use heuristic – based approach which consider only hopeful bushy joins

19.19 Bushy Joins

20.Bushy Plan Heuristic The rewrite mechanism consider promising bushy joins by forming one or more subselects . The enumerator to determine the cost in order to decide which candidate option is better

21.Generate bushy plans - Algorithm Build a graph where vertexes represent tables and edges represent join predicate Identify candidate satellite tables Select only the satellite tables, which are the tables connected to only other table in the graph Identify seed tables, which are tables that are connected to at least two different tables, at least one of which is a satellite table. For each seed table: Compute the cost C 1 of the current plan Create a derived table containing the seed table joined to its adjacent satellite tables Apply the Predicate Pushdown rewrite followed by the Column Elimination rewrite Compute the cost C2. If C1 < C 2, discard the changes made in steps (b) and (c), and otherwise keep them.




25.MemSQL Rewriter does not do any categorization of tables based on cardinalities and statistics. Looks only at number of connections in the graph to identity the set of seed tables These fundamental differences enable us to cover a lot of more generic cases that might benefit from bushy join plans.

26.Enumerator Optimizes the join plan within each select block Processes the select blocks bottom-up Huge search space → bottom-up System-R with sharding distribution S hard keys: (1) predicate columns of equality joins and (2) grouping columns 26

27.Enumerator Data Movement Operations: Broadcast: Tuples are broadcasted from each leaf node to all other leaf node s Broadcast Cost: R D Reshuffle: Tuples are moved from each leaf node to a target leaf node based on a hash of a chosen set of distribution columns Reshuffle Cost: 1 / N (R D + R H) 27

28.Planner: Remote Tables and Result Tables Remote Tables Communication between each leaf and all the partitions Problem: Each partition querying all other partitions Result Tables (SQL SELECT ) Store intermediate results for each partition and then compute the final select 28

29.Planner Broadcast CREATE RESULT TABLE r1 AS SELECT * FROM x WHERE x.b < 2 (on every partition) CREATE RESULT TABLE r2 AS SELECT * FROM REMOTE(r1) (on every node) SELECT * FROM r2 JOIN y WHERE y.c > 5 AND r2.a = y.a (on every partition) Reshuffle CREATE RESULT TABLE r1 PARTITION BY ( y.a ) AS SELECT * FROM y WHERE y.c > 5 (on every partition) SELECT * FROM x JOIN REMOTE(r1(p)) WHERE x.b < 2 AND x.a = r1.a (on every partition) 29 SELECT * FROM x JOIN y WHERE x.a = y.a AND x.b < 2 AND y.c > 5 (table x is sharded on a but table y is not)