A Case for A Collaborative Query Management System

Over the past 40 years, database management systems (DBMSs) have evolved to provide a sophisticated variety of data management capabilities. At the same time, tools for managing queries over the data have remained relatively primitive. In this paper, we argue that, in these new settings, data management systems must provide powerful query management capabilities, from query browsing to automatic query recommendations. We first discuss the requirements for a collaborative query management system. We outline an early system architecture and discuss the many research challenges associated with building such an engine

1. A Case   for A Collaborative Query Management System Nodira Khoussainova, Magdalena Balazinska, Wolfgang Gatterbauer, YongChul Kwon, and Dan Suciu Department of Computer Science and Engineering, University of Washington, Seattle, WA, USA {nodira, magda, gatter, yongchul, suciu}@cs.washington.edu ABSTRACT Large Synoptic Survey Telescope (LSST) [22]. To give an idea for Over the past 40 years, database management systems (DBMSs) the scale of these projects, the LSST is estimated to generate fif- have evolved to provide a sophisticated variety of data manage- teen terabytes of raw data per night [22] for a total of five petabytes ment capabilities. At the same time, tools for managing queries per year. Similarly, analysts and engineers in companies such as over the data have remained relatively primitive. One reason for Google, Microsoft, Amazon, eBay, and Yahoo! process massive this is that queries are typically issued through applications. They data logs collected from the large-scale services that these com- are thus debugged once and re-used repeatedly. This mode of inter- panies provide (e.g., clickstreams, search logs, network flow data, action, however, is changing. As scientists (and others) store and etc.). They analyze these logs to make informed business decisions share increasingly large volumes of data in data centers, they need and to improve their services. In both of these new environments, the ability to analyze the data by issuing exploratory queries. In this and in contrast to traditional database settings, users need to exe- paper, we argue that, in these new settings, data management sys- cute continuously changing queries; as the desired analysis on the tems must provide powerful query management capabilities, from data changes over time, so do the queries. Furthermore, the vol- query browsing to automatic query recommendations. We first dis- ume of these data sets pushes toward a usage model where the data cuss the requirements for a collaborative query management sys- sits in a shared data center and all users execute queries against tem. We outline an early system architecture and discuss the many this shared data store. Due to the increasing size of data sets, the research challenges associated with building such an engine. cost of running a query has escalated, making the traditional trial- and-error method for query development too expensive. Executing queries over the whole data set is still necessary, however, because 1. INTRODUCTION executing queries only on a small data sample can be insufficient Modern database management systems (DBMSs) provide so- for identifying interesting analyses. phisticated features to assist users in organizing, storing, manag- In this paper, we argue that these new environments could greatly ing, and retrieving data in a database. They provide, however, only benefit from sophisticated query management capabilities. When limited capabilities for managing the queries that users issue on the users continuously develop new queries, they need support for for- data. These capabilities are limited to query-by-example [28, 37], mulating these queries. More importantly, they should leverage graphical tools for composing queries [5, 6], and query logging knowledge about queries that they or others have previously exe- aimed at physical tuning [26, 32, 33]. Traditionally, more elab- cuted on the data. Such information can help both in query formu- orate query management was not necessary because applications lation and in deciding what queries to ask. For example, consider would only issue canned queries over the data (e.g., accounting or a scientist who decides to correlate two datasets. If the system inventory management applications). These queries were devel- automatically recommends previous queries correlating the same oped once and used repeatedly. Emerging applications in the area datasets but authored by another member of the research lab, the of large-scale scientific data management and industrial data anal- user could either quickly determine that his analysis has already ysis, however, are challenging this traditional DBMS usage pattern been done or could leverage the existing query to author his own, and, as we argue, could greatly benefit from more advanced query especially if the old query is documented with annotations. In gen- management tools. eral, when many users submit a variety of queries over a shared Scientists in areas such as biology, physics, astronomy and database, logging, organizing, and mining these queries can pro- the geosciences collect, store, retrieve, explore and analyze vast duce a wealth of information. The role of a query management sys- amounts of data. Examples of large-scale scientific databases in- tem is to effectively digest and present such information to users. clude the Sloan Digital Sky Survey (SDSS) [31], the Incorporated We propose to build a Collaborative Query Management Sys- Research Institutions for Seismology (IRIS) [19], and soon the tem (CQMS) targeted at these new, large-scale, shared-data envi- ronments. A CQMS should enable users to perform simple tasks such as browse the log of all queries they submitted and document their queries by annotating them. By doing so, users will be able to quickly find, edit and re-execute past queries. Even better, the sys- This article is published under a Creative Commons License Agreement tem should support sophisticated search capabilities allowing users (http://creativecommons.org/licenses/by/3.0/). to identify queries that operate on specific input data, have desired You may copy, distribute, display, and perform the work, make derivative properties (e.g. small result set, fast execution time), or produce works and make commercial use of the work, but you must attribute the work to the author and CIDR 2009. specific results. The system should also mine its query log and ac- 4th Biennial Conference on Innovative Data Systems Research (CIDR) tively recommend queries to users, thus, helping them further lever- January 4-7, 2009, Asilomar, California, USA. CIDR Perspectives 2009

2.age previously performed analyses. When the data or the schema Query feature relations changes, the system should also monitor and evaluate the validity   in its extensive collection, flagging, automatically of the queries Queries(qid,qText) repairing, or even removing those queries that have become obso- DataSources(qid,relName) Attributes(qid,attrName,relName) lete. Of course, clear access control rules must be set to restrict Predicates(qid,attrName,relName,op,const) knowledge transfer to only group members collaborating with each other. Meta-query Building such a CQMS raises several important technical chal- lenges. First, managing a collection of queries is more akin to man- SELECT Q.qid, Q.qText aging an evolving set of source code snippets rather than managing FROM Queries Q, Attributes A1, Attributes A2 WHERE Q.qid = A1.qid AND Q.qid = A2.qid ordinary data. Queries have elaborate semantics and complex rela- AND A1.attrName = ’salinity’ tionships with each other. The appropriate data model, query lan- AND A1.relName = ’WaterSalinity’ guage, and features for a CQMS must thus be carefully designed to AND A2.attrName = ’temp’ AND A2.relName = ’WaterTemp’ leverage the sophistication of query objects without overwhelming the user. Second, the CQMS must be efficient because it must pro- vide hints and recommendations interactively, as a user types a new Figure 1: Example meta-query for “find all queries that correlate wa- query. Finally, the CQMS should include a client with an intuitive, ter salinity with water temperature data”. easy-to-use interface and effective visualization methods. In this paper, we first explore the various query management functions that a CQMS should provide (Section 2). We sketch an initial, system architecture for a CQMS (Section 3) and discuss the Search. A meta-query is a query that searches for queries. Such many research challenges involved in building such a system (Sec- queries enable users to locate past queries matching specific search tion 4). We describe related work in Section 5 and conclude in conditions. The resulting queries can then be learned from, re- Section 6. Overall, as the database community builds new data executed, or used as a starting point to compose a new query. A management systems for large-scale data repositories [1, 7, 12, CQMS should offer a range of meta-querying techniques. At min- 27], we argue that these systems should include advanced query imum, it should provide substring matching and keyword search, management capabilities in the form of a CQMS engine. like existing systems [29, 33]. Beyond this, we propose three other paradigms for meta-querying that a CQMS should offer. 2. REQUIREMENTS Unlike keyword and substring search, query-by-parse-tree al- In this section, we explore the different features that a Collabo- lows users to formulate conditions on the structure of the query. rative Query Management System should provide to its users. We With this technique, the user can precisely specify conditions on categorize these features according to four modes of user interac- the joined relations, selections, projections, nested subqueries, etc. tion. In the default Traditional Interaction Mode, users simply ask Although powerful in searching for queries based on the query text queries over the data with minimal interference from the CQMS. In only, this technique does not consider the query output, the query the Search and Browse Interaction Mode, users search for queries execution time, or other important information about the query as or browse through a set of queries. In the more advanced Assisted a whole. Interaction Mode, the CQMS interactively assists the user in com- The second technique, query-by-data, enables users to set condi- posing a query. Finally, in the Administrative Interaction Mode, tions on the query output. The user specifies that the query output users and administrators can perform various maintenance tasks on should include or exclude specific tuples. For example, the user stored queries. We now present each mode of interaction, along may remember that there are some properties that distinguish Lake with the required features in more detail. We discuss resulting re- Washington from Lake Union. So they would request “all queries search challenges later in Section 4. whose output includes Lake Washington but not Lake Union”, and see that all matching queries specify ‘temp < 18’. This problem is 2.1 Traditional Interaction related to machine learning; the user specifies positive and negative In the Traditional Interaction Mode, the user submits standard training examples (tuples), and the system finds classifiers (past queries over the data without support from the CQMS. At the same queries) that separate these examples. Supporting query-by-data time, in the background, the CQMS logs and pre-processes these efficiently is a challenging problem. queries for further use. It is essential that the CQMS does not im- Query-by-feature involves extracting and storing specific query pose significant runtime overhead. features in new relations. Features can be syntactic (e.g. relations In addition, the CQMS should support and occasionally even re- in the FROM clause, predicates in the WHERE clause, attributes quest query annotations in this mode, especially for queries that in the SELECT clause), metadata (e.g. author, execution time) or are difficult to re-use without proper documentation (e.g. queries semantic (e.g. cardinality, samples from output). Users are able with more than a specified number of tables, or queries that include to issue SQL meta-queries that search for queries whose features nesting). Users can add annotations to whole queries or query frag- match specific conditions. For example, “find me all queries that ments. For example, a user could explain the choice for using an correlate water salinity with water temperature data” can be ex- outer join in a query, or describe their information goal for the par- pressed as shown in Figure 1. Since such statements may be awk- ticular query. In general, through annotations, users can capture ward to write, the CQMS could automatically generate these state- semantic information about their queries. ments from partially written queries (at least for syntactic features). For example, if a user types: SELECT FROM WaterSalinity, 2.2 Search and Browse Interaction WaterTemperature, the system can automatically issue the query One essential query management feature is the ability for users shown in Figure 1. Due to its efficient storage in relations and flex- to search for and browse through past queries. We refer to this ibility in which features to capture, query-by-features may offer a mode of interaction as the Search and Browse Interaction Mode. good trade-off between expressibility and efficiency. CIDR Perspectives 2009

3. 2:30 2:31 2:32 2:33 2:34 2:35 time Query1 SeattleLakesQuery WaterSalinity ‘temp < 22’ SELECT *   Suggest: ‘tem p FROM WaterSalinity S, WaterTemp T, CityLocations L < 18 WHERE T.temp < 18 AND Completions ’ S.loc_x = T.loc_x AND Corrections ‘ te ‘S.loc_x = …’ ‘S.loc_y = …’ S.loc_y = T.loc_y AND mp Comparisons <1 L.city IN ( SELECT City from Cities WHERE State = ‘WA’ 0’ SELECT City from Cities WHERE State = ‘MI’ SELECT City from Cities WHERE Pop > 10000 Annotate... SELECT * Submit FROM WaterSalinity S, WaterTemp T, CityLocations L WHERE T.temp < 18 Similar Queries Score | Query | Diff | Annotations _ [100% ] |select * from WaterSalinity, … | none |find temp and salinity of [98% ] |select temp from WaterTemp… | -1 col |find temps of seattle lak [75% ] |select temp from watertemp… | -1 col, -1 pred|find temps of michigan l Figure 2: Query session window of the query browser. Figure 3: Query composition in the assisted interaction mode. Browse. After finding the desired queries, the CQMS must allow the user to browse the results. Many systems that provide query logging [11, 15, 26, 32, 33] also allow the user to view the log could similarly suggest predicates in the WHERE clause of a query in a table or a file. However, to make the query log suitable for and even complete subclauses of the query. This capability requires browsing, the CQMS needs to present it in a comprehensible, sum- efficiently mining the query log for association rules. It also raises marized format. One possible method is to present query sessions the important challenge of when and how to present the sugges- instead of individual queries. A query session is a series of (of- tions. The CQMS must not overwhelm the user! ten similar) queries with the same information goal in mind. Such The second feature we envision for the Assisted Interaction query sessions should be automatically identified, highlighted, and Mode is automated query correction. Like a spell checker, while visually summarized. For example, Figure 2 displays one possible a user types a query, the CQMS suggests corrections to relation visualization of a query session; each node represents a query in the and attribute names but also changes to entire query clauses. For session, and the edges indicate the difference between consecutive instance, if a predicate causes a query to return the empty set, the queries. In this figure, left to right, the user has added the ‘Water- CQMS could suggest similar, previously issued predicates that re- Salinity’ relation to the FROM clause, tried different conditions on turn a non-empty set for the query. The challenge lies in defining ‘temp’, picked ‘temp < 18’, and added two more predicates to the what constitutes an error and what corrections are useful to display. WHERE clause. Unlike a listing of the six full SQL queries, this vi- Along similar lines, a CQMS could also perform complete query sualization allows the user to quickly understand and navigate the recommendations, showing logged queries similar to those the user query session. Subject to access control policies, users would also recently issued. Query similarity could be defined in terms of query benefit from browsing queries submitted by other users. In this parse trees, features, or output data. An interesting question is how case, an effective query log visualization tool becomes even more to construct ranking functions that combine similarity measures to- critical to avoid overwhelming the user. gether and with other desired properties (e.g. high popularity, effi- Supporting such browse and search functionality raises impor- cient runtime, small result cardinality, etc). Overall, we posit that tant research challenges related to query representation and model- such functionality could be especially useful when a user first ex- ing, efficient query-log processing (clustering, mining, etc.), main- amines a new dataset and the system guides them from their rough tenance in face of database updates, ranking (e.g., by similarity or query attempts toward similar popular queries asked by other users. popularity score), interface design, and query annotations. We dis- Finally, new users often suffer from a steep learning curve when cuss some of these challenges further in Section 4. trying to articulate queries. In such a case, a tutorial or step-by-step guided instructions would greatly reduce the start up cost. How- 2.3 Assisted Interaction ever, maintaining up-to-date documentation is time consuming. By Composing SQL queries can be a difficult task even for techni- analyzing the set of all queries and the evolution of query sessions, cally savvy users such as scientists or analysts. Furthermore, even we hypothesize that a CQMS may be able to automatically produce if a user is fluent in composing queries, she can benefit from hints a tutorial on the new data set or new analysis task, e.g. the system regarding what queries to ask. Thus, a key role of the CQMS is could introduce each relation and its schema by showing the user to assist users in query composition. For this, we define the As- the most popular queries that include the relation. sisted Interaction Mode where the CQMS monitors the user as she types queries and provides suggestions for completing and correct- 2.4 Administrative Interaction ing the queries. Figure 3 shows an example of assisted interaction Similar to data management, query management will require a where the system suggests several completions for the query being set of administrative capabilities. Users will need the ability to composed. It also shows similar queries at the bottom. delete old queries, define access control rules on their queries (e.g. In its simplest form, query completion consists in providing users sharing them only with members of the same research group), etc. with possible completions for relation and attribute names as the We call this the User Administrative Interaction Mode. user types a query. Although this simple capability can already Furthermore, the system administrator will need to manage the be helpful, query completion can be pushed much further. In par- query log, give preference to ranking functions, exclude irrelevant ticular, the system can make context-aware suggestions. For ex- features from the similarity functions, adjust tunable parameters ample, assume that the most popular table to include in the FROM such as the sample size for the query-by-data approach, mark or clause is CityLocations. However, for queries that also in- delete obsolete queries, run offline processes to compute query ses- clude WaterSalinity, the most popular is WaterTemp. Thus, sions, query clusters, etc. We call this the System Administrative if the user has already included WaterSalinity, the system Interaction Mode. Similarly to automatic physical database tun- should suggest WaterTemp over CityLocations. The CQMS ing [8], the CQMS should provide automatic query maintenance CIDR Perspectives 2009

4. Search & 4.1 Query Profiler Client Traditional Assisted Administrative Browse The two fundamental challenges related to the Query Profiler are   what information to capture in what format (Data Model and Query SQL Meta-Query Storage) and how to achieve this efficiently (Profiling Strategy). Data Model and Query Storage. A query is the primary data Query Meta-query Query Query type in a CQMS. Queries are complex objects with elaborate struc- CQMS Profiler Executor Miner Maintenance tures and semantics. The data model for queries should thus effec- tively capture and expose these advanced query properties. The simplest data model is to leave queries as raw text. String DBMS Query Storage search is then immediately available if the underlying storage com- ponent supports it. With this model, however, it is difficult for users to express complex meta-queries that rely on query structure. This Figure 4: QueryManager system architecture. model also limits or at least complicates the query mining process. At the other extreme, queries could be represented and stored when possible. For example, the CQMS could automatically flag as canonicalized parse trees using, for example, XML. The Meta- queries that have become corrupted due to schema changes query Executor could then be an XQuery engine, thus allowing users to formulate arbitrarily complex XQuery meta-queries. The Query Miner would also benefit from this information-rich repre- 3. SYSTEM ARCHITECTURE sentation. However, this model leads to complex meta-queries. In this section, we sketch a high-level design for a CQMS as An alternate data model is to represent and store queries using shown in Figure 4. We adopt the standard client-server architec- both their raw text and a set of pre-defined features (e.g., names of ture. The CQMS client provides the four interaction modes that joined relations, selection and group-by predicates, and projected VERSION: 15 we outlined in Section 2 and communicates with the CQMS server columns). The extracted features could be shredded and stored in through both standard SQL queries and meta-queries. The CQMS a set of relations. In contrast to using XML, this model simpli- server sits on top of a standard DBMS and comprises four compo- fies the Query Storage and Meta-Query Executor components and nents. offers opportunities for even more efficient indexing and compres- Two of the CQMS components, the Query Profiler and the Meta- sion. The above features capture only syntactic query properties. Query Executor run online: they receive inputs from the CQMS However, runtime query properties can also serve for query man- client and return results while the client waits. They must thus agement. Several common runtime features, including result car- perform all their operations with low latency. The Query Profiler dinality, execution time, and the query execution plan are already receives standard SQL queries as input and forwards them to the incorporated in existing query profilers. In a CQMS, we envision DBMS. Before doing so, however, it logs the queries in the Query that the system also captures the query result. This semantic query Storage. Most modern DBMSs already have their own query profil- feature captures information about the intent of the query. It en- ing systems for performance tuning [3, 10, 29]. The CQMS Query ables comparing queries as black-boxes. Profiler, however, performs more sophisticated pre-processing of Finally, in addition to modeling a query, a CQMS must also the queries. For example, it extracts and stores query features. model query sessions as discussed in Section 2. A query log for It also logs statistics about the query execution and samples from a user then takes the form of a tree of query sessions where ver- its output. The Meta-query Executor handles all queries over the tices correspond to individual queries, but queries are labeled with Query Storage. These queries are issued by the CQMS client the unique ID of the session they belong to. Edges represent rela- during Search and Browse and Assisted Interaction modes. For tions between queries as shown in Figure 2. Examples of relations Assisted Interaction modes, meta-queries may include k-nearest between queries include temporal relations, modification relations neighbors (kNN) queries. The Meta-query Executor also handles and investigation relations (where the latter query investigates why all administrative requests such as changing access control settings. certain tuples are included in the first query’s output). The query The remaining two CQMS components, the Query Miner and log can be stored as a standard normalized edge relation, i.e. a pair Query Maintenance, run in the background. The Query Miner ana- of query identifiers and an edge type. lyzes the query storage. It performs tasks such as clustering queries Profiling Strategy. The Query Profiler should not hinder ordi- based on similarity, association-rule mining, etc. Its goal is to ex- nary data processing. This raises several system design questions. tract useful information from the query log. In order to maintain Profiling queries: If queries are modeled as raw text or as XML all information up-to-date, it runs periodically. The Query Main- parse trees, the profiling process is straightforward. Its overhead tenance component performs automatic maintenance of the query can be almost completely removed by profiling queries during the log. Changes in the database schema or data distribution often normal query-parse phase of the DBMS query execution (although invalidate old queries, statistics, and analysis results. The Query this would tie the CQMS more closely with the underlying DBMS). Maintenance component keeps the Query Storage up-to-date by Even without this optimization, the overhead still remains small. In flagging outdated queries and updating query statistics. The Query contrast, if the profiler needs to extract a variety of query features, Maintenance can also perform physical tunings of the Query Stor- the overhead depends on the selected features and extraction algo- age to improve performance of meta-queries. rithms. Most syntactic features should be easy to log, but if the Next, we discuss the research challenges associated with these system needs to do a variety of additional lookups, such as logging components. the differences with the previous query in a session, the overhead may quickly grow. Such additional tasks are thus best left to the 4. RESEARCH CHALLENGES Query Miner, even if this imposes some latency in when the data Building a CQMS raises several important challenges. In this becomes available for meta-querying. section, we discuss some of the challenges associated with each of Profiling query results: Since it is usually infeasible to log the the four core CQMS components and the CQMS client. whole output of a query, we must find techniques to summarize CIDR Perspectives 2009

5.the output succinctly. This problem is closely related to selectivity completion or correction suggestions. Similarly, by mining com- estimation [16] and standard approaches exist including building mon query evolution patterns and correlating them with query run- histograms  or sampling. An interesting additional option is to ad- time features, a CQMS could automatically generate a tutorial for just the maximum size allowed for the output summary depending new users demonstrating common mistakes and good practices. on the query execution time. For example, if a query takes two Challenges in Mining. First, the definition of similarity between hours to complete and outputs ten rows, then the system should queries should be reconsidered from the perspective of users and store the whole output. However, if a query takes only two seconds query management tasks. To be usable, the system needs to go be- and outputs two million rows, there is no need to store the output. yond string similarity. Better options include parse tree similarity (perhaps after removing the constants from the tree), or looking at 4.2 Meta-Query Executor similarities between query features. Second, more advanced multi- As discussed in Section 2, a CQMS could support many types of relational mining [14, 36] and graph mining [9] could produce use- meta-queries (query-by-parse-tree, query-by-features, and query- ful information since queries and sessions form a graph and can by-data). A key question is what query language to use for these be spread across multiple relations. However, such advanced tech- meta-queries. Are existing languages sufficient to express them? niques could impose a greater runtime cost. Finally, incremental SQL is a viable option when queries are represented as raw text mining algorithms and, in general, incremental maintenance will or as a set of features. XQuery is possible for the parse-tree likely be necessary considering the possibly rapid growth of the model. However, in both cases, the meta-queries can quickly be- query log in a large-scale shared database. come complex. Users may need assistance in authoring the meta- queries themselves! Approaches for addressing this problem in- 4.4 Query Maintenance clude either creating a simpler language for meta-queries or pro- The Query Maintenance component strives to maintain all infor- viding a tool in the CQMS client that will generate meta-queries mation in the Query Storage up-to-date. In this section, we dis- from more intuitive user interactions (e.g. from a partly specified cuss challenges caused by schema evolution and data distribution SQL query). Both approaches can lead to interesting research prob- changes, as well as the challenges of measuring and maintaining lems, although the latter technique is likely superior as it maintains query quality. the power of existing languages while providing good support for Queries logged in the Query Storage operate on an underlying users. large database whose schema can change. Schema evolution can Independent of query language and data model, we expect four cause some of the stored queries to stop working. The CQMS major classes of meta-queries: keyword meta-queries, complex should be able to efficiently identify affected queries and handle meta-queries explicitly stating conditions on query features or them appropriately. Identifying affected queries can be imple- structure, meta-queries with conditions on query outputs, and kNN mented by comparing the timestamp of a query with that of the last queries. The first two classes of queries should be straightforward schema modification on any input relation. Handling potentially in- to support once the query data model, meta-query language, and valid queries is more challenging. A simple option is to drop such storage layer are designed. The latter two types of queries, how- queries from the storage, but this approach may drop significant ever, are significantly more challenging to provide. The first chal- numbers of queries. Another option is to systematically repair the lenge lies in the meta-query semantics, i.e. what it means for two queries by applying appropriate changes, but how to perform such queries or the output of two queries to be similar. The second chal- repairs automatically is an open question. lenge lies in meta-query performance. Meta-querying must be in- Significant changes in data distribution may invalidate runtime teractive, which can be challenging depending on the distance func- query features discussed in Section 4.1. A na¨ıve and overly expen- tions chosen for queries and their outputs. sive solution is to rerun all queries periodically to renew their statis- tics. A better approach is to re-execute queries only when there is 4.3 Query Mining reason to believe their statistics have significantly changed. This The CQMS Query Storage can grow considerably over time. solution may prove to be efficient if there is an accurate method for Mining this archive can benefit all non-traditional interaction detecting such changes. The system could also update the statistics modes. There are various types of query analysis possible. Due more frequently for popular or important queries. to space constraints, we discuss only two prominent ones. In addition, the system must maintain a measure of the query Clustering: By clustering queries, a CQMS can not only save quality because the system is most useful if it can quickly and accu- storage space (through better compression of similar queries) but rately provide users with queries that are not only relevant but also can also provide more sophisticated query management capabili- of high-quality. The definition of high-quality may differ across ties. For example, the CQMS can better deduplicate meta-query various applications. For example, quality can be defined in terms results or at least group results into sets of similar queries. It can of query efficiency, query simplicity, source tables’ quality, etc. also provide better query recommendations and similarity search- Measuring query quality is thus a difficult but important task. ing. The idea of query clustering has been around for years [4, 17, 23]. However, previous tools were not designed to help database 4.5 CQMS Client users but rather database administrators or query optimizers. Sim- We envision the Assisted and the Search and Browse interactions ilarly, if the CQMS clusters entire query sessions, it can provide to become the default interaction modes for scientists, industry ana- better services. For example, a meta-query could be restricted to lysts, and other users working closely with large-scale, shared data return only results from similar query sessions, or query recom- sets, especially if they are new to databases. The challenge for mendations can be limited to queries from users who have similar the CQMS client is to make these modes of interaction intuitive query session patterns as the current user. to use. For this, we need new interaction and visualization tech- Association Rules: By learning association rules [2], which cap- niques, which we intend to adapt from results in software engineer- ture relationships between values in a database, a CQMS could pro- ing and human-computer interaction (HCI) research. In particular, vide more advanced support for query composition. For example, we envision the CQMS client to take the appearance of a modern, by mining common edit patterns, the CQMS could provide better integrated development environment, such as Eclipse [34]. CIDR Perspectives 2009

6. A mixed-initiative interface is one that offers both automated Assisting programming tasks has been a long issue in software reasoning and the ability of direct manipulation to its users. The engineering research. Recent approaches mine open-source soft- CQMS client  is an example of such an interface. Therefore, HCI ware code repositories to extract reusable patterns and provide research on mixed-initiative interfaces [18] can guide us in design- context-sensitive assistance like completion and recommendation ing a simple and intuitive interface for the CQMS client. within the development environment [24, 25, 30, 35]. The goal of Our initial ideas for query completion and correction come from such research closely matches that of a CQMS: help users write existing tools. For example, we propose to initially use drop-down “correct” queries easily. A CQMS, in fact, aids new assistance tool menus, a standard technique from search interfaces and code ed- development because the developer no longer has to consider assis- itors, for auto-complete suggestions. For corrections, we borrow tance logic and data management. They simply use meta-queries. from spell checkers: indicate corrections by coloring the affected text and show a pop-up menu with details of the suggested correc- tion when the user hovers over the text. Visualizing the difference 6. CONCLUSION between a query and its recommended queries is difficult because New environments are emerging where large numbers of users the differences may not be apparent in the query text. As a first need to develop and run complex queries over a very large, shared attempt, we propose a separate sub-window to show the additional data repository. Examples include large scientific databases and information such as the semantic similarity, popularity, date of last Web-related data. These users are not SQL savvy, yet they need to execution, etc. The novel challenge for the client interaction is the perform complex analysis on the data and are further constrained presence of possible completions and corrections at any point in the by the high cost of running and testing their queries, often on a query and at different levels of granularity (e.g. the current token, shared server cluster. We have argued for the need of a Collabora- the current predicate, or the current clause). Clearly, if not designed tive Query Management System that logs and organizes all queries, well, the query completion tool can be a painful work companion. and assists users in formulating new ones. Such a system, however, Its design, implementation, and evaluation thus requires real HCI poses several research challenges, which we have discussed in the expertise. paper. Our goal is to address these challenges, build a CQMS, and Finally, an important aspect of query browsing is that a user test it in a scientific database environment. should be able to quickly identify key differences and dependen- cies between queries. Such differences or dependencies can occur 7. ACKNOWLEDGMENTS on various axes such as differences in data sources, selection pred- This work was partially supported by NSF Grants IIS-0713123, icates, annotations, popularity, execution time, etc. Thus the key IIS-0454425, IIS-0627585 and IIS-0428168. The authors would challenge for the query browser is to effectively present all this in- also like to thank the reviewers for their helpful feedback. formation in a comprehensible and navigable visualization. 5. RELATED WORK 8. REFERENCES Today’s DBMSs provide limited query management capabilities. [1] SciDB. http://confluence.slac.stanford.edu/ The closest to our proposal are the DB2 Query Management Fa- display/XLDB/SciDB. cility [11], Query Patroller [29] and EMS SQL Management Stu- [2] R. Agrawal, T. Imieli´nski, and A. Swami. Mining association dio for Oracle [15]. These tools support graphically composing rules between sets of items in large databases. In Proc. queries, logging queries, sharing query repositories, and viewing SIGMOD, pages 207–216. ACM, 1993. the query plan of stored queries. Query Patroller also analyzes [3] S. Agrawal, S. Chaudhuri, L. Kollar, A. Marathe, queries before execution to ensure good performance. There are V. Narasayya, and M. Syamala. Database tuning advisor for also systems that support only graphical composition [5, 6] and Microsoft SQL Server 2005: demo. In Proc. SIGMOD, those that support query logging [26, 32, 33] but primarily for phys- pages 930–932. ACM, 2005. ical database tuning. Also related are tools that allow ‘query by ex- [4] K. Aouiche, P.-E. Jouve, and J. Darmont. Clustering-based ample’, where the user sets the attributes of a sample object, which materialized view selection in data warehouses. In is then converted into a SQL query [28, 37]. However, none of these ADBIS’06, volume 4152 of LNCS, pages 81–95, 2006. tools allow users to annotate queries or perform advanced searches [5] BaseNow. Database front-end applications. About SQL over stored queries. Query Builder. http://www.basenow.com/help/About_ Many relational mining techniques [13] are directly applicable SQL_Query_Builder.asp. for query mining, especially multi-relational mining [14, 36] since [6] T. Catarci, M. F. Costabile, S. Levialdi, and C. Batini. Visual queries are likely to be stored in multiple database relations. Re- query systems for databases: A survey. Journal of Visual lated also is the work on association rule mining [2]. Languages & Computing, 8(2):215–260, 1997. Query clustering is often used in database performance tuning, [7] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, e.g. Aouiche et al. [4, 23] cluster queries to reduce the search space S. Weaver, and J. Zhou. Scope: Easy and efficient parallel for the materialized view selection problem, and Ghosh et al. [17] processing of massive data sets. In Proc. VLDB, 2008. use clustering for amortizing the cost of query optimization. [8] S. Chaudhuri and V. Narasayya. Self-tuning database Work that may be adapted for query similarity includes the Con- systems: a decade of progress. In Proc. VLDB, pages 3–14, text Distance Measure framework [21] designed for computing 2007. the distance between two objects defined across multiple relations. [9] D. J. Cook and L. B. Holder. Mining Graph Data. John Also relevant is the work in Kim’s thesis [20], which explores the Wiley & Sons, 2006. structure of code change. In this work, the author explores ways [10] B. Dageville, D. Das, K. Dias, K. Yagoub, M. Zait, and of measuring the similarity between two programs’ source codes. M. Ziauddin. Automatic SQL tuning in Oracle 10g. In Proc. Kim also proposes a method for succinctly describing the differ- VLDB, pages 1098–1109, 2004. ence between two source codes which could be helpful for tersely [11] DB2 Query Management Facility. describing the differences between queries in a query session. http://www.ibm.com/software/data/qmf/. CIDR Perspectives 2009

7.[12] D. DeWitt, E. Robinson, S. Shankar, E. Paulson, assistant for reusing open source code on the web. In Proc. of J. Naughton, A. Krioukov, and J. Royalty. Clustera: An ASE, pages 204–213. ACM, 2007.   integrated computation and data management system. In [36] X. Yin. Scalable mining and link analysis across multiple Proc. VLDB, 2008. database relations. PhD thesis, 2007. Adviser: Jiawei Han. [13] S. Dˇzeroski. Relational Data Mining. Springer-Verlag New [37] M. M. Zloof. Query-by-example: the invocation and York, Inc., 2001. definition of tables and forms. In Proc. VLDB, pages 1–24, [14] S. Dˇzeroski. Multi-relational data mining: an introduction. 1975. SIGKDD Explor. Newsl., 5(1):1–16, 2003. [15] EMS SQL Management Studio for Oracle. http: //www.sqlmanager.net/en/products/studio/oracle. [16] L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. SIGMOD Rec., 30(2):461–472, 2001. [17] A. Ghosh, J. Parikh, V. S. Sengar, and J. R. Haritsa. Plan selection based on query clustering. In Proc. VLDB, pages 179–190, 2002. [18] E. Horvitz. Principles of mixed-initiative user interfaces. In Proc. SIGCHI, pages 159–166. ACM, 1999. [19] Incorporated Research Institutions for Seismology. http://www.iris.edu. [20] M. Kim. Analyzing and Inferring the Structure of Code Changes. PhD thesis, 2008. Adviser: David Notkin. [21] Y. Kwon, W. Y. Lee, M. Balazinska, and G. Xu. Clustering events on streams using complex context information. In Mining Complex Data, ICDM 2008 Fourth International Workshop, 2008. to appear. [22] Large Synoptic Survey Telescope. http://www.lsst.org/. [23] H. Mahboubi, K. Aouiche, and J. Darmont. Materialized view selection by query clustering in XML data warehouses. CoRR, abs/0809.1963, 2008. [24] D. Mandelin, L. Xu, R. Bod´ık, and D. Kimelman. Jungloid mining: helping to navigate the API jungle. SIGPLAN Notice, 40(6):48–61, 2005. [25] F. McCarey, M. O. Cinneide, and N. Kushmerick. A recommender agent for software libraries: An evaluation of memory-based and model-based collaborative filtering. In Proc. IAT, pages 154–162. IEEE, 2006. [26] Microsoft TechNet. SQL Server TechCenter: Configuring the analysis services query log. http://www.microsoft.com/technet/prodtechnol/ sql/2005/technologies/config_ssas_querylog.mspx. [27] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In Proc. SIGMOD, pages 1099–1110. ACM, 2008. [28] Reading objects using query by example. Oracle TopLink developer’s guide 10g release 3 (10.1.3). http://www.oracle.com/technology/products/ias/ toplink/doc/1013/main/_html/qrybas002.htm. [29] Query Patroller. http: //www.ibm.com/software/data/db2/querypatroller/. [30] N. Sahavechaphan and K. Claypool. XSnippet: mining for sample code. SIGPLAN Notice, 41(10):413–430, 2006. [31] Sloan Digital Sky Survey. http://www.sdss.org/. [32] Siebel business analytics server administration guide: Administering the query log. http: //download.oracle.com/docs/cd/E12103_01/books/ admintool/admintool_AdministerQuery14.html. [33] Teradata Utility Pack. http://www.teradata.com/t/page/94310/. [34] The Eclipse Foundation. Eclipse. http://www.eclipse.org/. [35] S. Thummalapenta and T. Xie. Parseweb: a programmer CIDR Perspectives 2009