SEEDB: Automatically Generating Query Visualizations

Data analysts operating on large volumes of data often rely on visualizations to interpret the results of queries. However, finding the right visualization for a query is a laborious and time-consuming task. We demonstrate SEEDB, a system that partially automates this task: given a query, SEEDB explores the space of all possible visualizations, and automatically identifies and recommends to the analyst those visualizations it finds to be most “interesting” or “useful”. In our demonstration, conference attendees will see SEEDB in action for a variety of queries on multiple real-world datasets
展开查看详情

1. S EE DB: Automatically Generating Query Visualizations Manasi Vartak Samuel Madden Aditya Parameswaran Neoklis Polyzotis MIT MIT MIT & U. Illinois (UIUC) Google & UCSC mvartak@mit.edu madden@csail.mit.edu adityagp@illinois.edu alkis@cs.ucsc.edu ABSTRACT analyst can use operations such as binning, grouping, and aggrega- Data analysts operating on large volumes of data often rely on vi- tion, and then generate visualizations from the view. For example, sualizations to interpret the results of queries. However, finding the to generate the view ‘total sales by store’, the analyst would group right visualization for a query is a laborious and time-consuming each sales record based on the store where the sale took place and task. We demonstrate S EE DB, a system that partially automates sum up the sale amounts per store. This operation can easily be this task: given a query, S EE DB explores the space of all possible expressed as the familiar aggregation over group-by query: visualizations, and automatically identifies and recommends to the analyst those visualizations it finds to be most “interesting” or “use- Q′ = SELECT store, SUM(amount) FROM Sales WHERE ful”. In our demonstration, conference attendees will see S EE DB Product = “Laserwave” GROUP BY store in action for a variety of queries on multiple real-world datasets. The result of the above query is a two-column table that can then 1. INTRODUCTION be visualized as a bar-chart. Table 1 and Figure 1 respectively show an example of the results of this view and the associated visualiza- Data analysts must sift through very large volumes of data to tion. To explore the query results from different perspectives, the identify trends, insights, or anomalies. Given the scale of data, and analyst generates a large number of views (and visualizations) of the relative ease and intuitiveness of examining data visually, ana- the form described above. (3) The analyst then manually exam- lysts often use visualizations as a tool to identify these trends, in- ines each view and decides which ones are “interesting”. This is sights, and anomalies. However, selecting the “right” visualization a critical and time-consuming step. Naturally, what makes a view often remains a laborious and time-consuming task. interesting depends on the application semantics and the trend we We illustrate the data analysis process using an example. Con- are comparing against. For instance, the view of Laserwave sales sider a dataset containing sales records for a nation-wide chain of by store, as shown in Figure 1, may be interesting if the overall stores. Let’s say the store’s data analyst is interested in examining sales of all products show the opposite trend (e.g. Figure 2). How- how the newly-introduced heating device, the “Laserwave Oven”, ever, the same view may be uninteresting if the sales of all prod- has been doing over the past year. The results of this analysis will ucts follow a similar trend (Figure 3). Thus, we posit that a view inform business decisions for the chain, including marketing strate- is potentially “interesting” if it shows a trend in the subset of data gies, and the introduction of a similar “Saberwave Oven”. selected by the analyst (i.e., Laserwave product-related data) that The analysis workflow proceeds as follows: (1) The analyst poses deviates from the equivalent trend in the overall dataset. Of course, a query to select the subset of data that she is interested in explor- the analyst must decide if this deviation is truly an insight for this ing. For instance, for the example above, she may issue the query: application. (4) Once the analyst has identified interesting views, Q = SELECT ∗ FROM Sales WHERE Product = “Laserwave” the analyst may then either share these views with others, further interact with the displayed views (e.g., by drilling down or rolling Notice that the results for this query may have (say) several million up), or start afresh with a new query. records each with several dozen attributes. Thus, directly perusing Of the four steps in the workflow described above, the ones that the query result is simply infeasible. (2) Next, the analyst stud- are especially repetitive and tedious are steps (2) and (3), where ies various properties of the selected data by constructing diverse the analyst generates a large number of candidate views, and ex- views or visualizations from the data. In this particular scenario, the amines each of them in turn. The goal of our system, S EE DB, is analyst may want to study total sales by store, quantity in stock by to automate these labor-intensive steps of the workflow. Given a region, or average profits by month. To construct these views, the query Q indicating the subset of data that the analyst is interested in, S EE DB automatically identifies and highlights to the analyst the most interesting views of the query results using methods based on This work is licensed under the Creative Commons Attribution- deviation. Specifically, S EE DB explores the space of all possible NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- views and measures how much each view deviates from the corre- cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- sponding view on the entire underlying dataset (e.g. Figure 1 vs. mission prior to any use beyond those covered by the license. Contact Figures 2 or 3.) By generating and scoring potential views automat- copyright holder by emailing info@vldb.org. Articles from this volume ically, S EE DB effectively eliminates steps (2) and (3) that the an- were invited to present their results at the 40th International Conference on alyst currently performs. Instead, once S EE DB recommends inter- Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China. Proceedings of the VLDB Endowment, Vol. 7, No. 13 esting views, the analyst can evaluate this small subset of views us- Copyright 2014 VLDB Endowment 2150-8097/14/08. ing domain knowledge and limit further exploration to these views. 1581

2.Table 1: Data: Total Sales by Store for Laser- Total Sales($) 150 wave straightforward rules that dictate how the view in Table 1 can be 100 transformed to give a visualization like Figure 1. Furthermore, we Store Total Sales ($) 50 limit the set of candidate views to those that generate a two-column Cambridge, MA 180.55 0 result via a single-attribute grouping and aggregation (e.g. Table 1). Seattle, WA 145.50 Cambridge, MA New York, NY San Francisco, CA Seattle, WA Store New York, NY 122.00 However, S EE DB techniques can directly be used to recommend San Francisco, CA 90.13 Figure 1: Visualization: Total visualizations for multiple column views (> 2 columns) that are Sales by Store for Laserwave generated via multi-attribute grouping and aggregation. 40000 40000 We consider a database D with a snowflake schema, with di- Total Sales($) Total Sales($) 30000 30000 mension attributes A, measure attributes M , and potential aggre- 20000 20000 gate functions F over the measure attributes. We limit the class of 10000 10000 queries Q posed over D to be those that select one or more rows 0 Cambridge, MA New York, NY San Francisco, CA Seattle, WA 0 Cambridge, MA New York, NY San Francisco, CA Seattle, WA from the fact table, and denote the results as DQ . Store Store Given such a query Q, S EE DB considers all views Vi that per- Figure 2: Scenario A: Total Sales by Figure 3: Scenario B: Total Sales by form a single-attribute group-by and aggregation on DQ . We rep- Store Store resent Vi as a triple (a, m, f ), where m ∈ M, a ∈ A, f ∈ F , i.e., the view performs a group-by on a and applies the aggregation function We described our vision for S EE DB, along with the associated f on a measure attribute m. We call this the target view. research challenges in a companion vision paper [8]. In this demon- stration proposal, we present our first S EE DB prototype addressing SELECT a, f (m) FROM DQ GROUP BY a some of the challenges listed in that vision paper. In particular, our current prototype of S EE DB is built as a “wrapper” that can As discussed in the previous section, S EE DB evaluates whether a be overlaid on any relational database system. Given any query, view Vi is interesting by computing the deviation between the view S EE DB leverages special optimization algorithms and the underly- applied to the selected data (i.e., DQ ) and the view applied to the ing DBMS to generate and recommend interesting visualizations. entire database. The equivalent view on the entire database Vi (D) To do so efficiently and accurately, we must address the following can be expressed as shown below that we call the comparison view. SELECT a, f (m) FROM D GROUP BY a challenges: (a) We must determine metrics that accurately measure the “deviation” of a view with respect to the equivalent view on the entire database (e.g., Figure 1 vs. 2), while simultaneouly ensur- The results of both the above views are tables with two columns, ing that S EE DB is not tied to any particular metric(s); (b) We must namely a and f (m). We normalize each result table into a prob- intelligently explore the space of candidate views. Since the num- ability distribution, such that the values of f (m) sum to 1. For ber of candidate views (or visualizations) increases as the square our example in Table 1, the probability distribution of Vi (DQ ), de- of the number of attributes in a table (we will demonstrate this in noted as P [Vi (DQ )], is: (Jan: 180.55/538.18, Feb: 145.50/538.18, subsequent sections), generating and evaluating all views, even for March: 122.00/538.18, April: 90.13/538.18). A similar probability a moderately sized dataset (e.g. 1M rows, 100 attributes), can be distribution can be derived for P [Vi (D)]. prohibitively expensive; (c) While executing queries corresponding Given a view Vi and probability distributions for the target view to different views, we must share computation as much as possible. (P [Vi (DQ )]) and comparison view (P [Vi (D)]), the utility of Vi For example, we can compute multiple views and measure their is defined as the distance between these two probability distribu- deviation all together in one query. Independent execution, on the tions. Formally, if S is a distance function, other hand, will be expensive and wasteful; (d) Since analysis must happen in real-time, we must trade-off accuracy of visualizations U (Vi ) = S(P [Vi (DQ )], P [Vi (D)]) or estimation of “interestingness” for reduced latency. Section 3 The utility of a view is our measure for whether the target view is describes how we address these challenges. “potentially interesting” as compared to the comparison view: the Related Work: Over the past few years, the research community higher the utility, the more the deviation from the comparison view, has introduced a number of interactive data analytics tools such as and the more likely the view is to be interesting. Computing dis- ShowMe, Polaris, and Tableau [12, 7] as well as tools like Profiler tance between probability distributions has been well studied, and allow analysts to detect anomalies in data. Unlike S EE DB, which S EE DB supports a variety of metrics to compute utility, including recommends visualizations automatically, the tools place the onus Earth Movers Distance, Euclidean Distance, Kullback-Leibler (K- on the analyst to specify the visualization to be generated. Sim- L) Divergence, and Jenson-Shannon Distance. In our demonstra- ilar visualization specification tools have also been introduced by tion, conference attendees can experiment with different distance the database community, including Fusion Tables [5] and the De- metrics and examine how the choice of metric affects view qual- vise [6] toolkit. There has been some work on browsing data cubes ity. Finally, we note that while other definitions of the comparison in OLAP, allowing analysts to find explanations, get suggestions views and utility metrics are possible, for our initial exploration for next cubes to visit, or identify generalizations or patterns start- into visualization recommendations, we chose to focus on the intu- ing from a single cube [9, 11, 10]. While we may be able to reuse itive definitions above. the metrics from that line of work, the same techniques will not directly apply to visualizations. P ROBLEM 2.1. Given an analyst-specified query Q on a database D, a distance function S, and a positive integer k, find k views V ≡ (a, m, f ) that have the largest values of U (V ) among all 2. PROBLEM STATEMENT the views that can be represented using a triple (a, m, f ), while Given a database D and a query Q, S EE DB considers a number minimizing total computation time. of views that can be generated from Q by adding relational opera- tors. For the purposes of this discussion, we will refer to views and 3. S EE DB DESIGN visualizations interchangeably, since it is straightforward to trans- In this section, we present the S EE DB architecture, starting with late views into visualizations automatically. For example, there are an overview followed by a detailed discussion of its components. 1582

3.3.1 S EE DB architecture overview ones (based on utility) to the frontend. For each view delivered by Our S EE DB prototype is designed as a layer on top of a tradi- the backend, the frontend creates a visualization based on parame- tional relational database system. While optimization opportunities ters such as the data type (e.g. ordinal, numeric), number of distinct are restricted by virtue of being outside the database, our design values, and semantics (e.g. geography vs. time series). The result- permits S EE DB to be used in conjunction with a variety of existing ing set of visualizations is displayed to the analyst who can then database systems. S EE DB is comprised of two parts: a frontend easily examine these “most interesting” views at a glance, explore and a backend. The frontend is a “thin client” that is used to issue specific views in detail via drill-downs, and study metadata for each queries and display visualizations. The backend, in contrast, per- view (e.g. size of result, sample data, value with maximum change forms all the computation required to generate and select views to and other statistics). Figure 5 shows a screenshot of the S EE DB be recommended. Figure 4 depicts the architecture of our system. frontend (showing the query builder) in action. User     Selec0on   criteria   SeeDB  Frontend   SeeDB  Backend   Backend  DBMS   Q   Metadata   Query  Builder   Collector   SeeDB  generated   Query  Generator   visualiza0ons   Op9mizer   View  and   interact  with   Most   View  Processor   visualiza0ons   relevant   views   Figure 4: SeeDB Architecture An analyst uses the frontend to issue queries to S EE DB. We provide three mechanisms for the analyst to issue queries (further discussion in Section 3.2). Once the analyst issues a query via the frontend, the backend takes over. First, the Metadata Collec- tor module queries metadata tables (a combination of database- provided and S EE DB specific tables) for information such as ta- ble sizes, column types, data distribution, and table access pat- terns. The resulting metadata along with the analyst’s query is Figure 5: SeeDB Frontend: Query Builder (left) and Example Visualizations (right) then passed to the Query Generator module. The purpose of the 3.3 The Backend Query Generator is two-fold: first, it uses metadata to prune the The S EE DB backend is responsible for all the computations for space of candidate views to only retain the most promising ones; generating and selecting views. To achieve its goal of finding the and second, it generates target and comparison views for each view most interesting views accurately and efficiently, the S EE DB back- that has not been pruned. The SQL queries corresponding to the end must not only accurately estimate the accuracy of a large num- target and comparison views are then passed to the Optimizer mod- ber of views but also design ways in which the total processing ule. We refer to these queries collectively as view queries. Next, time will be minimized. We first describe the basic S EE DB back- the Optimizer module determines the best way to combine view end framework and then briefly discuss our optimizations. queries intelligently so that the total execution time is minimized. (We discuss optimizations performed by S EE DB in Section 3.3.) Basic Framework: Given a user query Q, the basic approach com- Once the Optimizer module has generated the optimized queries, putes all possible two-column views obtained by adding a single- S EE DB runs them on the underlying DBMS. Results of the opti- attribute aggregate and group-by clause to Q. The target and com- mized queries are processed by the View Processor in a streaming parison views corresponding to each view are then computed and fashion to produce results for individual views. Individual view each view query is executed independently on the DBMS. The results are then normalized and the utility of each view is com- query results for each view are normalized, and utility is computed puted. Finally S EE DB selects the top k views with the highest as the distance between these two distributions (Section 2). Finally, utility and returns them to the S EE DB frontend. The frontend gen- the top-k views with the largest utility are chosen to be displayed. erates and displays visualizations for each of these view. We now The basic approach is clearly inefficient since it examines every discuss S EE DB modules in detail. possible view and executes each view query independently. We next discuss how our optimizations fix these problems. 3.2 The Frontend View Space Pruning: In practice, most views for any query Q The S EE DB frontend, designed as a thin client, performs two have low utility since the target view distribution is very similar main functions: it allows the analyst to issue a query to S EE DB, to the comparison view distribution. S EE DB uses this property and it visualizes the results (views) produced by the S EE DB back- to aggressively prune view queries that are unlikely to have high end. To provide the analyst maximum flexibility in issuing queries, utility. This pruning is based on metadata about the table including S EE DB provides the analyst with three mechanisms for specifying data distributions and access patterns. Our techniques include: an input query: (a) directly filling in SQL into a text box, (b) us- ● Variance-based pruning: Dimension attributes with low variance ing a query builder tool that allows analysts unfamiliar with SQL are likely to produce views having low utility (e.g. consider the to formulate queries through a form-based interface, and (c) using extreme case where an attribute only takes a single value); S EE DB pre-defined query templates which encode commonly performed therefore prunes views with grouping attributes with low variance. operations, e.g., selecting outliers in a particular column. ● Correlated attributes: If two dimension attributes ai and aj have Once the analyst issues a query via the S EE DB frontend, the a high degree of correlation (e.g. full name of airport and abbre- backend evaluates various views and delivers the most interesting viated name of airport), the views generated by grouping the table 1583

4. on ai and aj will be very similar (and have almost equal utility). ● Store Orders dataset [4]: This dataset is often used by Tableau [3] We can therefore generate and evaluate a single view representing as a canonical dataset for business intelligence applications. It both ai and aj . S EE DB clusters attributes based on correlation consists of information about orders placed in a store including and evaluates a representative view per cluster. products, prices, ship dates, geographical information, and prof- ● Access frequency-based pruning: In tables with a large number its. Interesting trends in this dataset have been very well studied, of attributes, only a small subset of attributes are relevant to the and attendees will use S EE DB to quickly re-identify these trends. analyst and are therefore frequently accessed for data analysis. ● Election Contribution dataset [1]: This is an example of a dataset S EE DB tracks access patterns for each table to identify the most typically analyzed by non-expert data analysts like journalists or frequently accessed columns and combinations of columns. While historians. With this dataset, we demonstrate how non-experts can creating views, S EE DB uses this information to prune attributes use S EE DB to quickly arrive at interesting visualizations. that are rarely accessed and are thus likely to be unimportant. ● Medical dataset [2]: This real-world dataset exemplifies a dataset View Query Optimizations: The second set of optimizations used that a clinical researcher might use. The schema of the dataset is by S EE DB minimizes the execution time for view queries that haven’t significantly complex and it is of larger size. been pruned using the techniques described above. Since view ● Synthetic data: We provide a set of synthetic datasets with vary- queries tend to be very similar in structure (they differ in the ag- ing sizes, number of attributes, and data distributions to help at- gregation attribute, grouping attribute or subset of data queried), tendees evaluate S EE DB performance on diverse datasets. S EE DB uses multiple techniques to intelligently combine view queries. Scenario 1: Demonstrating Utility. Attendees are provided with The ultimate goal is to minimize scans of the underlying dataset by three diverse, real-world datasets to explore using S EE DB. For sharing as many table scans as possible. Our strategies include: each dataset, attendees can issue ad-hoc or pre-formulated queries ● Combine target and comparison view query: Since the target view to S EE DB. S EE DB will then intelligently explore the view space and comparison views only differ in the subset of data that the and optimize query execution to return the most interesting visu- query is executed on, we can easily rewrite these two view queries alizations with low latency. Attendees can examine the returned as one. This simple optimization halves the time required to com- queries visually, view the associated metadata, and perform drill- pute the results for a single view. downs. To aid the evaluation of visualizations, the demo system ● Combine Multiple Aggregates: A large number of view queries will be configured to also show the user “bad” views (views with have the same group-by attribute but different aggregation attributes. low utility) that were not selected by S EE DB. Similarly, we pro- Therefore, S EE DB combines all view queries with the same group- vide pre-selected queries (and previously known information about by attribute into a single query. This rewriting provides a speed their results) to allow attendees to confirm that S EE DB does indeed up linear in the number of aggregate attributes. reproduce known information about these queries. Attendees will ● Combine Multiple Group-bys: Since S EE DB computes a large also be able to experiment with a variety of distance metrics for number of group-bys, one significant optimization is to combine computing utility and observe the effects on the resulting views. queries with different group-by attributes into a single query with Scenario 2: Demonstrating Performance and Optimizations. multiple group-bys attributes. For instance, instead of executing This scenario will use an enhanced user interface and synthetic queries for views (a1 , m1 , f1 ), (a2 , m1 , f1 ) . . . (an , m1 , f1 ) in- datasets mentioned above. Attendees will be able to easily ex- dependently, we can combine the n views into a single view repre- periment with a range of synthetic datasets and input queries by sented by ({a1 , a2 . . . an }, m1 , f1 ) and post-process results at the adjusting various “knobs” such as data size, number of attributes, backend. Alternatively, if the SQL GROUPING SETS function- and data distribution. In addition, attendees will also be able to se- ality is available in the underlying DBMS, S EE DB can leverage lect the optimizations that S EE DB applies and observe the effect that as well. While this optimization has the potential to signifi- on response times and accuracy. cantly reduce query execution time, the number of views that can Thus, through our demonstration of S EE DB we seek to illus- be combined depends on the correlation between values of group- trate that (a) it is possible to automate labor-intensive parts of data ing attributes and system parameters like the working memory. analysis, (b) aggregate and grouping-based views are a powerful Given a set of candidate views, we model the problem of finding means to identify interesting trends in data, and (c) the right set of the optimal combinations of views as a variant of bin-packing and optimizations can enable real-time data analysis of large datasets. apply ILP techniques to obtain the best solution. ● Sampling: For datasets of large size, an optimization that affects 5. REFERENCES performance significantly is employing sampling: we construct [1] Fec presidential campaign finance. [Online; accessed 3-March-2014]. [2] Mimic ii database. [Online; accessed 3-March-2014]. a sample of the dataset that can fit in memory and run all view [3] Tableau public. [Online; accessed 3-March-2014]. queries against the sample. However, as expected, the sampling [4] Tableau superstore data. [Online; accessed 3-March-2014]. technique and size of the sample both affect view accuracy. [5] H. Gonzalez et al. Google fusion tables: web-centered data management and ● Parallel Query Execution: The final optimization that S EE DB collaboration. In SIGMOD Conference, pages 1061–1066, 2010. employs is taking advantage of parallel query execution at the [6] M. Livny et al. Devise: Integrated querying and visualization of large datasets. In SIGMOD Conference, pages 301–312, 1997. DBMS to reduce total latency. We observe that as the number of [7] J. D. Mackinlay et al. Show me: Automatic presentation for visual analysis. queries executed in parallel increases, the total latency decreases IEEE Trans. Vis. Comput. Graph., 13(6):1137–1144, 2007. at the cost of increased per query execution time. [8] A. Parameswaran, N. Polyzotis, and H. Garcia-Molina. Seedb: Visualizing database queries efciently. In VLDB, volume 7, pages 325–328, 2013. [9] S. Sarawagi. Explaining differences in multidimensional aggregates. In VLDB, 4. DEMO WALKTHROUGH pages 42–53, 1999. We propose to demonstrate the functionality of S EE DB through [10] S. Sarawagi. User-adaptive exploration of multidimensional data. In VLDB, hands-on interaction with a variety of datasets. Our goals are two pages 307–316, 2000. [11] G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional olap data. In fold: (1) demonstrate the utility of S EE DB in surfacing interest- VLDB, pages 531–540, 2001. ing trends for a query and (2) demonstrate that we can return high [12] C. Stolte et al. Polaris: a system for query, analysis, and visualization of quality views efficiently for a range of datasets. We will use four multidimensional databases. Commun. ACM, 51(11):75–84, 2008. different datasets in our demonstration: 1584