The Case for Predictive Database Systems: Opportunities and Cha

This paper argues that next generation database management systems should incorporate a predictive model management component to effectively support both inward-facing applications, such as self management, and user-facing applications such as data-driven predictive analytics. We draw an analogy between model management and data management functionality and discuss how model management can leverage profiling, physical design and query optimization techniques, as well as the pertinent challenges. We then describe the early design and architecture of Longview, a predictive DBMS prototype that we are building at Brown, along with a case study of how models can be used to predict query execution performance.

1. The Case for Predictive Database Systems: Opportunities and Challenges Mert Akdere, Ugur Cetintemel, Matteo Riondato, Eli Upfal, Stan Zdonik Department of Computer Science Brown University, Providence, RI, USA {makdere, ugur, matteo, eli, sbz} ABSTRACT them in the process of data management and query processing. This paper argues that next generation database management We make the case that such a Predictive Database Management systems should incorporate a predictive model management System (PDBMS) is the natural progression beyond the current component to effectively support both inward-facing applications, afterthought or specialized approaches. We outline the potential such as self management, and user-facing applications such as performance and usability advantages that PDBMSs offer, along data-driven predictive analytics. We draw an analogy between with the research challenges that need to be tackled when model management and data management functionality and realizing them. discuss how model management can leverage profiling, physical A PDBMS enables declarative predictive queriesby providing design and query optimization techniques, as well as the pertinent predictive capability in the context of a declarative language like challenges. We then describe the early design and architecture of SQL; users will not need to concern themselves with the details of Longview, a predictive DBMS prototype that we are building at tasks like model training and selection. Such tasks will be Brown, along with a case study of how models can be used to performed by the optimizer behind the scenes, optionally using predict query execution performance. hints from the user. Much as SQL has made programmers more productive in the context of data processing, this approach will have a similar effect for predictive analytics tasks. While there 1. INTRODUCTION will no doubt be some predictive applications that can benefit from custom, manually optimized prediction logic, we expect that Predictive modeling has been used with varying degrees of many users will be satisfied with “commodity” predictive success for many years [GH05]. As models grow more functionality. The success of the recent Google Prediction API sophisticated, and data collection and storage become increasingly [GP] is early evidence in this direction. This service allows users more extensive and accurate, the quality of predictions improves. to upload their historical data to the service, which automatically As such, model-based, data-driven prediction is fast emerging as and transparently performs model training and selection to an essential ingredient of both user-facing applications, such as produce predicted results. predictive analytics, and system-facing applications such as autonomic computing and self management. Predictive queries have a broad range of uses. First, they can support predictive analytics to answer complex questions At present, predictive applications are not well supported by involving missing or future values, correlations, and trends, which database systems, despite their growing prevalence and can be used to identify opportunities or threats (e.g., forecasting importance. Most prediction functionality is provided outside the stock-price trends, identifying promising sponsor candidates, database system by specialized prediction software, which uses predicting future sales, monitoring intrusions and performance the DBMS primarily as a backend data server. Some commercial anomalies). database systems (e.g., the data mining tools for Oracle [Ora], SQL Server [SS08], and DB2 [DB2]) provide basic extensions Second, predictive functionality can help build introspective that facilitate the execution of predictive models on database services that assist in various data and resource management and tables in a manner similar to stored procedures. As we discuss optimization tasks. Today, many systems either use very simple, below, and also noted by others (e.g., [DB07, AM06]), this loose mostly static predictive techniques or do not use any prediction at coupling misses significant opportunities for improved all. This is primarily due to the difficulty of acquiring the performance and usability. There has also been recent work on appropriate statistics and efficiently and confidently predicting custom integration of specific models (e.g., [JXW08, HR07, over them. For example, pre-fetching algorithms are often based ACU10, AU07, APC08]). on simple linear correlations to decide future data or query requests. Most admission control schemes are based on static This paper argues that next generation database systems should estimations (thresholds) of the maximum number of tasks that the natively support and manage predictive models, tightly integrating system can cope with. Load distribution algorithms typically detect-and-react instead of predict-and prevent problems. This article is published under a Creative Commons License Agreement Query optimizers commonly use simplistic analytical models to (, which permits distribution and reason about query costs. There is major recent interest and reproduction in any medium as well allowing derivative works, provided that success in applying sophisticated statistical and learning models to you attribute the original work to the author(s) and CIDR 2011. such problems [GKD09, BBD09, SBC06]. An integrated, readily available predictive functionality would make it easy to not only 5th Biennial Conference on Innovative Data Systems Research (CIDR ‘11) consolidate and replace existing solutions but also build new ones. January 9-12, 2011, Asilomar, California, USA. As such, an integrated predictive functionality would be an

2.important step towards building the truly autonomic database attributes (often based on their correlation to the prediction systems of the future. attribute using metrics such as information gain or correlation A PDBMS integrates predictive models as first-class entities, coefficients [CT06]) and use this ranking to guide a heuristic managing them in much the same way as data. Thus, we consider search [GH05] to identify the most predictive attributes tested model management as the key underlying component of a over a disjoint test data set. The training data set may be sampled PDBMS. Model management may greatly benefit from analogues to speed up the process. of many well-established data management techniques: Prediction accuracy is a function of the quality of the estimated models. The quality of the model (and the resulting predictions)  Profiling and modeling: Cost and accuracy characteristics can be measured by metrics such as the variation distance [MU05] of models need to be modeled, and fed to the optimizer so or the mean square error between the predictions and the true that the proper model(s) can be chosen for a given predictive values. With assumptions about the underlying stochastic process, task. one may be able to bound these measures analytically, using large  Physical design and specialized data structures: Data can deviation theory, appropriate versions of the central limit theorem be structured to facilitate efficient model building and and martingale convergence bounds [MU05]. Alternatively, one predictive query execution (e.g., I/O-aware skip-lists can use multiple tests on available data to compute the empirical [GZ08]). values for these measures. However, using empirical values to  Pre-computation and materialization: Model building is estimate the model or prediction error adds another layer of error often prohibitively expensive for ad hoc or interactive to the estimate, namely the gap between the empirical statistics queries. In such cases, models can be pre-built and and the true value they estimate. While the empirical statistic is an materialized for use by the optimizer and executor. unbiased estimate, the variance of the estimate can be large, Furthermore, this process can be automated in many cases. depending on the size and variance of the test set.  Query optimization: The optimizer considers the alternative Hypothesis testing and confidence interval estimations are two ways of model building, selection, and execution, as well as common techniques for determining predictive accuracy the inherent cost-accuracy tradeoffs when selecting an [MWH98]. In general, it is not possible to estimate a priori what execution plan. model would be most predictive for a given data set without In the rest of the paper, we discuss these model management training and testing it. One form of hypothesis testing that is techniques as well as the technical challenges that arise when commonly used is K-Fold Cross Validation (K-CV). K-CV building a PDBMS. Our discussion is centered on Longview, a divides up the training data into k non-overlapping partitions. prototype predictive DBMS that we have been building at Brown One of the partitions is used as validation data while the other k-1 University. Longview is being designed to efficiently support partitions are used to train the model. declarative predictive analytics through novel integrated model management techniques. Users can plug new model types into the 3. LONGVIEW: A PREDICTIVE DBMS system along with a modest amount of meta-data, and the system 3.1 Design and Architecture Overview uses these models to efficiently evaluate queries involving predictions. 3.1.1 Data and Query Model We sketch the basic architecture of Longview and its early Longview provides two interfaces for access to its predictive implementation on top of PostgreSQL. We also discuss an internal functionality. The first access method is the direct interface, predictive application, query performance prediction, which which consists of a collection of SQL functions that offers direct exercises some of the model management issues we raise. Finally, access to the functionality of the integrated prediction models. we discuss prior work and finish with concluding remarks. The direct interface does not provide the user with automated model management tools and is thus targeted towards advanced 2. BACKGROUND: PREDICTION WITH users who want to exert hands-on control on the prediction models MODELS and their operations. For example, using this interface a user can We use the term model to refer to any predictive function such as ask the system to build a linear regression model with specific Multiple Regression, Bayesian Nets, and Support Vector configuration parameters or perform prediction with a pre-built Machines. Training a model involves using one or more data sets support vector machine instance. We summarize the details of the to determine the best model instance that explains the data. For direct interface in Section 3.2.1. example, fitting a function to a time series may yield a specific The second access method is the declarative interface, which polynomial instance that can be used to predict future values. offers additional, high-level predictive functionality on top of the In general, model training (or building) involves selecting (i) the low-level direct interface. feature attributes, a subset of all attributes in the data set, and (ii) a This declarative interface extends SQL in a few simple ways to training data set. In some cases, a domain expert can manually accommodate the extra specifications needed for expressing specify the feature attributes. In other cases, this step is trivial as predictive queries. In particular, queries may refer to predictors the prediction attribute(s) directly determine the feature and predictor relations (p-relations) to access predictive attribute(s), e.g., as in the case of auto-regressive models. functionality. Predictors are essentially SQL functions that Alternatively, feature attributes can be learned automatically. provide declarative predictive functionality using system- Most solutions for automatic learning are based on heuristics, managed prediction models. P-relations are essentially views since given a set of n attributes, trying the power set is produced by the application of predictors on select subsets of prohibitively expensive if n is not small or training is costly input features. Both p-relations and predictors can be used in [GH05, MWH98]. A common approach is to rank the candidate conjunction with regular relations within standard SQL queries. A

3.p-relation is virtual by default; however, it can also be The output schema of a predictor is defined by the associated materialized to enable further optimizations. p_schema. In addition, one can add special ERROR attributes to a We give a simple example that illustrates some of the key p_schema to access the estimated errors for each predicted value. concepts of the query language that we are developing. Consider For instance, adding the attribute “ERROR relerr real” to the following schema: p_schema would extend the output schema of a predictor with the relerr attribute, which represents the estimated prediction error. Customer(cid, name, city), Orders(oid, cid, total), Now, we describe how to define p-relations with the following TrainData(cid, status) example: CREATE VIEW StatusPRelation AS In addition to the Customer and Orders relations, which store the SELECT cid, StatusPredictor(name,city,total) records for customers and their orders, we define the TrainData FROM ( relation that stores the status (either “preferred” or “regular”) of a SELECT cid, name, city, sum(total) as total subset of the customers. We first show how to build a predictor FROM Customer C, Orders O for predicting the status of any customer based on the training WHERE C.cid = O.cid data supplied for a subset of the customers. Next, we discuss p- GROUP BY cid, name, city, status relations and their use through an example p-relation representing HAVING sum(total) > 1000) the status predictions of a select subset of customers based on a predictor. With the above statement, we create a p-relation named The first step in creating a predictor is to define a schema StatusPRelation, which is basically a view consisting of status describing the set of involved features and target attributes. For predictions from StatusPredictor for the set of features specified this purpose, we define a schema, named StatusSchema, with the with the provided query (i.e., customers with order totals greater target attribute customer status and features name, city and total than 1000) and the features themselves. using the CREATE P_SCHEMA statement: When a view definition that accesses a predictor function is CREATE P_SCHEMA StatusSchema ( submitted to the system, Longview registers the given data set as a name text, specific target feature set for that predictor. In turn, the model city text, generation process for the predictor works to generate more total int, efficient and accurate prediction models based on the properties of TARGET status text) the given feature set. To create a predictor, we use the CREATE PREDICTOR The use of declarative queries for the specification of data sets in statement that can be used to automatically build prediction model building and prediction offers an easy and flexible method model(s) using the given training data set: of expressing predictive operations over complex data sets. Users can easily specify complex queries (e.g., computing aggregates CREATE PREDICTOR StatusPredictor over groups) to supply input data sets for prediction models. ON StatusSchema(name, city, total, status) Moreover, it is also possible to use database views as data WITH DATA providers. For instance, a database view can be used to perform SELECT name, city, sum(total) as total, status standard pre-processing tasks such as cleaning, normalization, and FROM Customer C, Orders O, TrainData T discretization [DB07], and can cook the raw data into a form that WHERE T.cid = C.cid and T.cid = O.cid is more amenable for effective learning. GROUPBY cid, name, city, status WITH ERROR CVERROR(10, “relative_error”, 0.1) 3.1.2 Basic Architecture We illustrate the high-level architecture of Longview in Figure 1, With the statement shown above, we instruct the system to create which shows the primary functional units of interest, along with a predictor named StatusPredictor by training a set of prediction the data they require. The architecture reflects the notion of models using the training data specified through a query. The last models as first-class citizens by depicting the data manager and part, WITH ERROR, defines the error estimation process. In this model manager as co-equal modules. example, we want to use 10-fold cross-validation and the relative_error accuracy metric with a target average error of 0.1. The Data Manager is very similar to a typical data manager in a Notice that the decoupling between the schema and predictor conventional DBMS. The Model Manager is responsible for definitions allows us to create multiple predictors with different creating materialized models a priori (materialized) or in an on- data sets or accuracy requirements over a single schema. demand fashion when an adequate materialized model does not exist. The role of materialized models in the model world is The example query below illustrates the use of the StatusPredictor similar to that of indices and materialized views in the data world: for estimating the status of all customers: they are derived products that can be used to quickly generate data SELECT C.cid, StatusPredictor(,, of interest. Indices and materialized views improve query speed FROM Customer C, while materialized models improve prediction speed. (select cid, sum(total) as total from Orders The Model Manager trains appropriate models (based on the Group By Cid) as O available model templates) for each predictor in the database. The WHERE C.cid = O.cid Model Manager can run as a background process, constantly instantiating models for improved accuracy and efficiency. In order to build and maintain prediction models, the Model

4.Manager can utilize many different strategies. For example, it can 3.2 Model Management choose to sample data at different amounts and times and it can build different types of prediction models over different subsets of As a design philosophy towards a generic model manager, we available features. In addition, the Model Manager also strive to build on existing database extension mechanisms such as determines the best model for the query at hand: if an appropriate views, triggers, rules and user defined functions to simplify our model has already been pre-computed and materialized, it will implementation and produce highly portable functionality. identify and use that model; if not, it will create a new instantiation on the fly. 3.2.1 Database Integration of Prediction Models Prediction Model API. Longview currently supports a black- box-style integration approach that allows existing model implementations (available from a plethora of standalone applications and libraries such as libsvm [CC01]) to be used by the system as database functions. This approach offers an easy and effective way of utilizing pre-tested and optimized prediction logic within a SQL execution framework. New prediction models are registered into the system by providing implementations of a simple model interface (the prediction model API) describing function templates for training and application of prediction methods. This interface decouples implementation and predictive functionality, while allowing multiple predictive models to be used for the same task. Table 1 summarizes the basic interface methods. Function Arguments Description Figure 1: High-level Longview architecture. The system training data feature and target values provides full-fledged support for models; model and data Build model parameters model-specific training parameters management are tightly integrated. Predict model pointer pointer to previously built model feature list feature values for use in prediction The Model Manager and the Data Manager must cooperate in their decision making. As we discuss later, special data structures Serialize model pointer can assist the process of model training. This is consistent with the fact that DBMSs, in general, get much of their performance gains Deserialize byte array serialized model from supporting specialized data structures like indices. Table 1 - Prediction Model API Model meta-data entered through the model interface as well as The build function is used to train a prediction model based on the those derived during run-time such as the list of materialized given features and target values, as well as model-specific training models and their parameters, training results, error values for parameters. The predict function uses a previously built model to various data sets, are all stored in a model catalog. The model predict a target attribute based on the input feature values. Finally, manager is responsible for updating the catalog. Longview uses the serialize and de-serialize functions to store and Longview will try to produce a good model whenever possible by retrieve prediction models. Most third-party model libraries trying various parameter assignments (e.g., history length, include built-in model (de)serialization methods for this purpose. sampling density, etc.) and using hypothesis testing to find the Prediction Model Direct Interface. The prediction model API is best fit. While Longview aggressively tries to optimize this model used internally by the Longview system to access the functionality search process, in some cases, this is either not possible or would of prediction models and is not visible to the user. However, as require testing too many alternatives. In these cases, Longview mentioned earlier, Longview also provides an interface for direct will provide a set of tools with which the DBA can inspect the access to the prediction models by the user. The main functions data and add additional information about the datasets which included in this interface are given in Table 2. These functions might indicate, for example, that the data is seasonal, or that the have dynamic implementations in Longview, as wrappers around data might best be modeled using exponential smoothing. the prediction model API, and provide a unified method of access Traditional DBMSs provide such tuning tools for DBAs as well. to all the available prediction model types within SQL statements. P-relation queries are written against views that include predicted The create function is used to create a prediction model entry in attributes. When a p-relation query is received by the system, the the model catalogs for the given model type and attribute schema. optimizer might generate a query plan that contains prediction The Longview model catalog stores all model data and associated operators. These operators are selected from a collection of meta-data. Each model instance built is recorded in a relation that instantiated models that are managed by the Model Manager, or contains a unique (auto-generated) instance id, model type, and a created on the fly. Alternatively, tuples in the predicted view can serialization field storing the type-specific representation of a be computed eagerly and materialized as resources become prediction model (e.g., the coefficients of a regression model). We available, in which case p-relation queries can be executed as also store model attributes; each is represented with a name, id, a scans over the materialized tuples. type (e.g., double) and a role (i.e., feature, target). The build and predict SQL-functions are similar to the corresponding functions in the prediction model API. The build

5.function trains the prediction model specified by the model id, and In addition to the techniques mentioned earlier such as sampling, stores its serialized representation in the model catalog. The feature selection and materialized models, there are further predict function performs prediction with the given model id over opportunities to reduce the execution costs of these tasks. First, the provided feature data set. We also provide a test function that these operations can be done in parallel for multiple models on the can be used to apply the model on a feature data set and compute same data. In this multi-model building process (akin to multi- its accuracy over the true values of the target attributes. We query optimization), data can be read once and all relevant models provide an argument to specify the accuracy function for use (e.g., can be updated at the same time. Moreover, we can build and absolute error, squared error). The outputs of the test and update models in an opportunistic manner based on memory- prediction functions are represented as relations and can be used resident data. as data sources in other queries. Auto Design. The auto-design problem is a related problem in Function Arguments Description which the goal is to choose and build a set of prediction models based on a given workload that contains a set of predictive queries model schema description of features and target that are most likely to be submitted, i.e., queries that we would Create attributes like to execute quickly and with good predictive accuracy. For model type prediction model type this purpose, the database system would need to identify the most model id specifies the model instance common prediction attributes in the workload and then the set of training query query computing the feature and features that are highly predictive of those attributes. Build target values model model-specific training parameters Specialized Data Structures. There are opportunities for a parameters PDBMS to leverage data representations that are tuned to the model id process of prediction. In particular, structures that can enhance Predict feature list | feature values for use in prediction model training have the most potential to yield major performance query improvements with the idea being accessing “just enough” data to model id build a model of acceptable accuracy. training query Test accuracy parameters for the accuracy Data-driven training commonly involves accessing select regions options function in the underlying feature space, combined with sampling techniques that can be used to further reduce I/O requirements. Table 2 - Prediction Model Direct User Interface This process is often iterative: more data is systematically included to check if the resulting model is better. In general, 3.2.2 Model Building and Maintenance multi-dimensional index structures defined over the feature space can be effectively used here, but care must be taken that index- Model Materialization. Longview builds and materializes model based sampling does not introduce any biases. Multi-dimensional instances much as a conventional DBMS pre-computes indices or clustering, when performed in a manner that facilitates efficient materialized views. For each predictor and associated p-relations, sampling, can provide further benefits. As an alternative to the there can be multiple materialized prediction models built using index-based sampling of disk-resident data, we can also opt to different model types and different feature subsets. As a result, replicate the data (or materialize the results of a training query) model building and maintenance may easily become a bottleneck using disk organizations tuned for efficient sampling, e.g., as the number of pre-built models increases. Therefore, methods horizontally partition the data into uniform samples so that for decreasing the cost of building and maintaining models are an sampling can be done with sequential I/O. essential part of Longview. As a concrete example for time-series prediction, we introduced a The quality of a model is primarily a function of its training data variant of skip lists to efficiently access arbitrary ranges of the and model-specific configuration parameters. In the limit, we underlying time dimension with different sampling granularities. would like to produce one materialized model for each prediction query. This approach will likely be infeasible for two reasons: (1) M5 the time required to build a model per query is larger than some ( ) ( M4 30 ) NIL target threshold, e.g., in applications involving interactive queries; 15 ( ) 7 M3 and (2) the estimated time required to update these models in the 24 34 2 9 18 ( 27 33 ) ( M1 ) face of newly arriving data is greater than some maintenance M2 threshold. Figure 2: I/O conscious skip-lists. Each node indicates a block In many ways, this problem is very similar to the problem of of tuples sampled from the original relation. Unlike in automatic index or materialized view selection. We require (1) a standard skip lists, nodes (blocks) are not shared across levels. reasonably good cost and accuracy model that can be used to M’s indicate different time ranges and sampling density. compare the utility of the materialized models, and (2) a way to heuristically prune the large space of possible models. The original skip-list formulation is modified to make it I/O conscious by copying the relevant data from each lower level up A good solution to this problem involves covering the underlying to the higher-levels. Each level is essentially a materialized “feature space” well such that a prediction with acceptable sample view [JJ08] stored in clustered form on disk, allowing us accuracy can be made for a large set of queries subject to a limit to access a particular time range with desired density with a small on model maintenance costs. In prior work, we proposed a number of disk accesses (see Figure 2 for an illustration). solution along these lines for time-series-based forecasting using multi-variate regression [GZ08]. Query Execution and Optimization such, while they do a good job of comparing the costs of alternative query plans, they are poor predictors of plan execution Predictor optimization. Declarative predictive queries specify [GKD09]. what to predict but not how. For a given prediction task, it is the responsibility of the predictor to build and use an appropriate As an alternative, we express the QPP task using the declarative prediction model satisfying the desired accuracy. For this purpose, prediction interface in Longview. In addition to describing the each Longview predictor continuously tries to build accurate query specification and execution, we also show different prediction models for as much of its input feature space as modeling approaches to achieve accurate QPP under various possible, while keeping resource consumption under a workload scenarios. If a representative workload is available, for configurable threshold to avoid negatively impacting the other example, we can build good models using coarse-grained, plan- database tasks. In the case of p-relations, predictors can build level models. Such models, however, do not generalize well, and more targeted prediction models using select parts of the training perform poorly for unseen or changing workloads. In these cases, data (i.e., model segmentation) based on the target data of p- fine-grained, operator-level modeling performs much better due to relations. We discuss an application of the model segmentation its ability to capture the behavior of arbitrary plans, although they idea and demonstrate its potential in Section 4. do not perform as well as plan-level models for fixed workloads. We then build hybrid models that combine plan- and operator- In addition, Longview automatically keeps track of model cost- level models to provide the best of both worlds by striking a good accuracy characteristics. For each model instance, the run-time balance between generality and accuracy. cost and quality of the predictions during build and test operations are recorded. Using this information, Longview can monitor the Plan-level Prediction. We first consider a basic approach that evolution of models, track the used training data sets and the extracts features from query plans and then couples them with performance values on test data sets. These model profiles guide sample plan executions to build models using supervised learning query optimization decisions. We may also expect expert users (or (as also explored in [GKD09]). Once built, these models can model developers) to supply simple cost functions, akin to those perform predictions using only static plan information. The for the relational operators, for training and prediction costs, following features are extracted from each query plan for which can also be stored and leveraged as part of model profiles. modeling purposes: optimizer estimates for query plan costs, number of output tuples and their average sizes (in bytes), and Finally, we observed the need for a more formal, expressive tool instance (i.e., occurrence) and cardinality counts for each operator when working with sophisticated prediction models. To this end, type included in the query plan. we believe that a model algebra that captures common model operations such as choice (selection), composition, and merge is We integrated two prediction models, Support Vector Machines warranted. Properties of these operations could introduce further and Linear Regression, into the PostgreSQL database system functionality as well as optimization opportunities. A model (version 8.4.1) through the use of machine learning libraries ensemble, which uses a set of prediction models collectively to LIBSVM [CC01] and Shark [IMT08]. We used the TPC-H perform a prediction task, is an example for this complex model decision support benchmark to generate our database and query case. Model ensembles rely on the collective power of multiple workload. The database size is set to 10GB and experiments were prediction models to smooth their predictions and mitigate the run on a 2.4 GHz machine with 4GB memory. Our query potential errors from a single prediction model. workload consists of 500 TPC-H queries, which are generated from 18 TPC-H query templates and executed one after another Online execution. Online execution of predictive queries (along with clean start (i.e., file system and database buffers are cleared). the lines of online aggregation), in which predictions, and thus the query results, get progressively better over time, is an important Fixed Workload Experiment: In the first experiment, we defined a usage model for interactive, exploratory tasks. Predictive accuracy plan-level predictor using the described query plan features and can be improved over time using more data, more features, or the execution time target attribute as our p_schema (named more models. The challenge is to effectively orchestrate this PlanSchema). For this purpose, we first inserted the runtime query process and perform efficient revision of query results. plan features and the execution times of all queries in our TPC-H workload to database tables (runtimefeats and qexec). Then, we 4. CASE STUDY: PREDICTING QUERY defined our predictor to use 90% of the workload for building EXECUTION LATENCIES prediction models to estimate the execution times of the remaining 10% of the queries. We provide the definition of the We now describe our ongoing work on an inward-looking plan-level predictor below; however at this point we do not have a predictive task, query performance prediction (QPP), which SQL parser for the extensions proposed in the declarative involves the estimation of the execution latency of query plans on predictor interface and thus performed our operations using the a given hardware platform. Modern database systems can greatly direct interface along with a few additional SQL functions that benefit from accurate QPP. For example, resource managers can provide functionality similar to the declarative predictor interface. utilize QPP to allocate workload such that interactive behavior is CREATE PREDICTOR PlanLvlPredictor achieved or specific quality of service targets are met. Optimizers ON PlanSchema(…) can choose among alternative plans based on expected execution WITH DATA latency instead of total work incurred. SELECT R.*, Q.exec_time While accurate QPP is important, it is also challenging: database FROM runtimefeats R, qexec Q systems are becoming increasingly complex, with several WHERE R.qid = Q.qid and R.qid <= 450 database and operating system components interacting in sophisticated and often unexpected ways. Analytical cost models The qid attribute is a key in both tables that uniquely defines a are not designed to capture these interactions and complexity. As query in the TPC-H workload. Next, we used the pre-runtime

7.estimations of the query plan features from the query optimizer Linear Regression models to estimate the execution time for each (stored in table estimatedfeats) for performance prediction of the operator type and compose them in a bottom-up manner up the remaining 10% of the queries. The following query is used to query tree to predict the total execution time. express this operation: Each operator is modeled using a generic set of features such as the number of input/output tuples and estimated execution times SELECT qid, PlanLvlPredictor(…).exec_time for child operators (runtime and estimated values for these FROM estimatedfeats E features are stored in opruntimefeats and opestimatedfeats tables). WHERE E.qid > 450 Bottom-up prediction requires a nested use of predictors. In this experiment, we used support vector machines (SVMs) as Moreover, the connections between predictors are dynamic as our prediction model. In addition, our current predictor optimizer they depend on the plan of the query at hand. Currently, we uses a standard feature selection algorithm for choosing the set of perform this nested prediction operation within a user-defined features to use in prediction models. The set of features (7 of the database function that uses the operator predictors as required by total 29 features) used in the resulting model are: number of the plan of each query. We think that such complex models can be Group Aggregate, Hash Aggregate, and Materialization operators, built and used more effectively with a model algebra as mentioned estimated total plan cost, cardinality of Hash Aggregate and Hash in Section 3. Join operators and the estimated total number of output rows from The results for the Changing Workload experiment using the all operators in the query plan. The error value (defined as |true operator-level prediction methods are shown in Figure 3 for 10 value – estimate| / true value) for each TPC-H template is shown TPC-H templates. The average error rate is 56%, which represents in Figure 3 (The average prediction accuracy is 90%). a major improvement over query-level prediction for this We observed that queries from the 9th template (which has the workload. unusual high errors) run close to the 1 hour time limit (after which Hybrid Prediction. Looking closer, we observe that the error (of we killed and discarded queries) and therefore execute longer than 233%) for the operator-level prediction of template 4 queries is most other queries. We then performed manual model- much higher than those for other templates. To gain more insight, segmentation by building a separate prediction model for the we provide the error values for each operator in the execution plan queries of the 9th template, which achieved 93% accuracy. for an example template-4 query (Figure 4). Observe that the This example illustrates the potential efficiency of using errors originate from the segmented models built from different data partitions. As highlighted sub-plan and discussed before, intelligent model-building algorithms that propagate to the upper automatically identify such partitions in the feature space are levels. Here, the error is essential for improved accuracy. due to the inability of the Finally, when we added the additional feature used by that model models to capture the per- (cardinality of the Nested Loop operator) to the general prediction tuple processing time of model and retrained it, we increased its accuracy to 93% (shown the Hash Aggregate with the bars for 2-step feature selection in the figure). operator, which in this case is computation-bound. Thus the I/O cost of the Sequence Scan operation that normally determines the overall execution latency is dominated by Hash Aggregate’s high computational cost in this case. The fundamental problem is that operator- Figure 4: Query Tree and level training inherently Prediction Errors for Template 4. fails to capture the “context” of the operator behavior. To solve this problem, we combined the plan- and operator-level Figure 3: Query Prediction Performance for TPC-H Queries. prediction methods for template 4 by modeling the highlighted sub-plan with a plan-level model and using operator-level prediction for the remainder of the query plan. With this approach, Changing Workload Experiment: In this experiment, we built we reduced the template error to 53% and the overall average separate prediction models for each TPC-H template using only error across templates to 38% for the Changing Workload the queries from the other TPC-H templates for training. In this scenario. case the average prediction error increased to 232%. In addition, the error values were highly dependent on the target query 5. RELATED WORK template and were distributed in a large range (2% to 1692%). We draw from a large number of subject areas, which we Operator-level Prediction. We also studied an operator-level summarize below. Other closely related work was cited inline as modeling approach with the goal of building better models for the appropriate. Changing Workload scenario. In this case, we build separate

8.Major commercial DBMSs support predictive modeling tools [ACU10] M. Akdere, U. Cetintemel, E. Upfal: Database-support (e.g., Oracle Data Mining tools, SQL Server Data Mining and for Continuous Prediction Queries over Streaming Data. PVLDB DB2 Intelligent Miner). Such tools commonly allow users to 3(1), 2010. invoke model instances, typically implemented as stored [AM06] A. Deshpande and S. Madden, MauveDB: Supporting procedures, using extended SQL (e.g., the “FORECAST” clause Model-based User Views in Database Systems. SIGMOD 2006. [Ora]). In SQL Server Analysis Services, users are provided with a graphical interface within Visual Studio in which they can [AU07] Declarative temporal data models for sensor-driven query interactively build and use a number of prediction models such as processing. Y. Ahmad and U. Cetintemel. DMSN 2007. decision trees and naïve Bayes networks. While we utilize similar [APC08] Simultaneous Equation Systems for Query Processing prediction models and techniques, our goal is to create a more on Continuous-Time Data Streams. Y. Ahmad, O. automated and integrated system in which predictive functionality Papaemmanouil, U. Cetintemel, J. Rogers. ICDE 2008. is mostly managed by the system with help from the user (akin to [BBD09] S. Babu, N. Borisov, S. Duan, H. Herodotou, and V. existing data management functionality). Thummala. Automated Experiment-Driven Management of On the academic side, MauveDB [AM06] was an early system to (Database) Systems. HotOS 2009. support model-based views defined using statistical models. Such [CC01] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library views can be used for a variety of purposes including cleaning, for support vector machines, 2001. Software available at interpolation and prediction (with a focus on sensor network applications). The PDBMS functionality we sketch in this paper goes significantly beyond the scope of MauveDB. Our direct [CT06] Thomas M. Cover and Joy A. Thomas. Elements of prediction interface and MauveDB views have similar Information Theory. Wiley-Interscience, 2006. functionality and purpose. However, we believe that automated [DB2] DB2 Intelligent Miner Web Site. model building and maintenance services, such as our declarative predictor interface, are essential for commoditization of predictive functionality. [DB07] S. Duan and S. Babu, Processing Forecasting Queries. VLDB'07. Another closely related system is Fa [DB07], which was designed to support forecasting queries over time-series data. Fa offers [GH05] J. G. De Gooijer and R. J. Hyndman. 25 Years of IIF efficient strategies for model building and selection, making a Time Series Forecasting: A Selective Review. June 2005. solid contribution towards model management and predictive Tinbergen Institute Discussion Papers No. TI 05-068/4. query processing. Longview can leverage many of Fa’s [GKD09] A. Ganapathi, H. Kuno, U. Dayal, J. Wiener, A. Fox, techniques but also aims for deeper, more comprehensive model M. Jordan, D. Patterson: Predicting Multiple Metrics for Queries: management, by treating models as native entities and addressing Better Decisions Enabled by Machine Learning. ICDE 2009. the entire predictive model life cycle. [GP] Google Prediction API, Recently, there have been successful applications of machine [GZ08] Tingjian Ge, Stan Zdonik. A Skip-list Approach for learning techniques to DBMS self-management problems. Query- Efficiently Processing Forecasting Queries. VLDB 2008. plan-level predictions have been studied in [GKD09]. NIMO proposed techniques for accelerating the learning of cost models [HR07] H. Bravo, R. Ramakrishnan. Optimizing mpf queries: for scientific workflows [SBC06]. Performance prediction for decision support and probabilistic inference. SIGMOD 2007. concurrent query workloads was investigated in [AAB08]. [IMT08] Christian Igel, Verena H., and Tobias G. Shark. Journal of Machine Learning Research, 2008. 6. CONCLUDING REMARKS [JJ08] Shantanu Joshi and Chris Jermaine. Materialized Sample We argue that it is high time for the database community to start Views for Database Approximation. IEEE Trans. Knowl. Data building predictive database systems. We discussed how Eng. 20(3): 337-351 (2008) predictive queries could meaningfully leverage and, at the same [JXW08] R. Jampani, F. Xu, M. Wu, L. Perez, C. Jermaine, P. time, contribute to next generation data management. We Haas: MCDB: a monte carlo approach to managing uncertain presented our vision for a predictive DBMS called Longview, data. SIGMOD 2008. outlined the main architectural and algorithmic challenges in building it, and reported experimental results from an early case [MU05] Mitzenmacher, M., Upfal, E. Probability and study of applying the predictive functionality for query Computing: Randomized Algorithms and Probabilistic Analysis. performance prediction. [MWH98] Makridakis, S., Wheelwright S., and Hyndman, R. Forecasting Methods and Applications. Third Edition. John ACKNOWLEDGEMENTS Wiley & Sons, Inc. 1998. This work is supported in part by the NSF under grant IIS- [Ora] Oracle Data Mining Web Site. 0905553. 7. REFERENCES [SBC06] P. Shivam, S. Babu, and J. Chase. Active and Accelerated Learning of Cost Models for Optimizing Scientific [AAB08] Ahmad, M., Aboulnaga, A., Babu, S., and Munagala, K. Applications. VLDB 2006. Modeling and exploiting query interactions in database systems. [SS08] Microsoft SQL Server 2008. CIKM 2008.