SEEDB: Automatically Generating Query Visualizations

下载 0

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/u3791/seedb86897?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

da仔

发布于

6年前

2061

人观看

#信息技术

Data analysts operating on large volumes of data often rely on visualizations to interpret the results of queries. However, finding the right visualization for a query is a laborious and time-consuming task. We demonstrate SEEDB, a system that partially automates this task: given a query, SEEDB explores the space of all possible visualizations, and automatically identifies and recommends to the analyst those visualizations it finds to be most “interesting” or “useful”. In our demonstration, conference attendees will see SEEDB in action for a variety of queries on multiple real-world datasets

展开查看详情

1 . S EE DB: Automatically Generating Query Visualizations Manasi Vartak Samuel Madden Aditya Parameswaran Neoklis Polyzotis MIT MIT MIT & U. Illinois (UIUC) Google & UCSC mvartak@mit.edu madden@csail.mit.edu adityagp@illinois.edu alkis@cs.ucsc.edu ABSTRACT analyst can use operations such as binning, grouping, and aggrega- Data analysts operating on large volumes of data often rely on vi- tion, and then generate visualizations from the view. For example, sualizations to interpret the results of queries. However, finding the to generate the view ‘total sales by store’, the analyst would group right visualization for a query is a laborious and time-consuming each sales record based on the store where the sale took place and task. We demonstrate S EE DB, a system that partially automates sum up the sale amounts per store. This operation can easily be this task: given a query, S EE DB explores the space of all possible expressed as the familiar aggregation over group-by query: visualizations, and automatically identifies and recommends to the analyst those visualizations it finds to be most “interesting” or “use- Q′ = SELECT store, SUM(amount) FROM Sales WHERE ful”. In our demonstration, conference attendees will see S EE DB Product = “Laserwave” GROUP BY store in action for a variety of queries on multiple real-world datasets. The result of the above query is a two-column table that can then 1. INTRODUCTION be visualized as a bar-chart. Table 1 and Figure 1 respectively show an example of the results of this view and the associated visualiza- Data analysts must sift through very large volumes of data to tion. To explore the query results from different perspectives, the identify trends, insights, or anomalies. Given the scale of data, and analyst generates a large number of views (and visualizations) of the relative ease and intuitiveness of examining data visually, ana- the form described above. (3) The analyst then manually exam- lysts often use visualizations as a tool to identify these trends, in- ines each view and decides which ones are “interesting”. This is sights, and anomalies. However, selecting the “right” visualization a critical and time-consuming step. Naturally, what makes a view often remains a laborious and time-consuming task. interesting depends on the application semantics and the trend we We illustrate the data analysis process using an example. Con- are comparing against. For instance, the view of Laserwave sales sider a dataset containing sales records for a nation-wide chain of by store, as shown in Figure 1, may be interesting if the overall stores. Let’s say the store’s data analyst is interested in examining sales of all products show the opposite trend (e.g. Figure 2). How- how the newly-introduced heating device, the “Laserwave Oven”, ever, the same view may be uninteresting if the sales of all prod- has been doing over the past year. The results of this analysis will ucts follow a similar trend (Figure 3). Thus, we posit that a view inform business decisions for the chain, including marketing strate- is potentially “interesting” if it shows a trend in the subset of data gies, and the introduction of a similar “Saberwave Oven”. selected by the analyst (i.e., Laserwave product-related data) that The analysis workflow proceeds as follows: (1) The analyst poses deviates from the equivalent trend in the overall dataset. Of course, a query to select the subset of data that she is interested in explor- the analyst must decide if this deviation is truly an insight for this ing. For instance, for the example above, she may issue the query: application. (4) Once the analyst has identified interesting views, Q = SELECT ∗ FROM Sales WHERE Product = “Laserwave” the analyst may then either share these views with others, further interact with the displayed views (e.g., by drilling down or rolling Notice that the results for this query may have (say) several million up), or start afresh with a new query. records each with several dozen attributes. Thus, directly perusing Of the four steps in the workflow described above, the ones that the query result is simply infeasible. (2) Next, the analyst stud- are especially repetitive and tedious are steps (2) and (3), where ies various properties of the selected data by constructing diverse the analyst generates a large number of candidate views, and ex- views or visualizations from the data. In this particular scenario, the amines each of them in turn. The goal of our system, S EE DB, is analyst may want to study total sales by store, quantity in stock by to automate these labor-intensive steps of the workflow. Given a region, or average profits by month. To construct these views, the query Q indicating the subset of data that the analyst is interested in, S EE DB automatically identifies and highlights to the analyst the most interesting views of the query results using methods based on This work is licensed under the Creative Commons Attribution- deviation. Specifically, S EE DB explores the space of all possible NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- views and measures how much each view deviates from the corre- cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- sponding view on the entire underlying dataset (e.g. Figure 1 vs. mission prior to any use beyond those covered by the license. Contact Figures 2 or 3.) By generating and scoring potential views automat- copyright holder by emailing info@vldb.org. Articles from this volume ically, S EE DB effectively eliminates steps (2) and (3) that the an- were invited to present their results at the 40th International Conference on alyst currently performs. Instead, once S EE DB recommends inter- Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China. Proceedings of the VLDB Endowment, Vol. 7, No. 13 esting views, the analyst can evaluate this small subset of views us- Copyright 2014 VLDB Endowment 2150-8097/14/08. ing domain knowledge and limit further exploration to these views. 1581

2 .Table 1: Data: Total Sales by Store for Laser- Total Sales($) 150 wave straightforward rules that dictate how the view in Table 1 can be 100 transformed to give a visualization like Figure 1. Furthermore, we Store Total Sales ($) 50 limit the set of candidate views to those that generate a two-column Cambridge, MA 180.55 0 result via a single-attribute grouping and aggregation (e.g. Table 1). Seattle, WA 145.50 Cambridge, MA New York, NY San Francisco, CA Seattle, WA Store New York, NY 122.00 However, S EE DB techniques can directly be used to recommend San Francisco, CA 90.13 Figure 1: Visualization: Total visualizations for multiple column views (> 2 columns) that are Sales by Store for Laserwave generated via multi-attribute grouping and aggregation. 40000 40000 We consider a database D with a snowflake schema, with di- Total Sales($) Total Sales($) 30000 30000 mension attributes A, measure attributes M , and potential aggre- 20000 20000 gate functions F over the measure attributes. We limit the class of 10000 10000 queries Q posed over D to be those that select one or more rows 0 Cambridge, MA New York, NY San Francisco, CA Seattle, WA 0 Cambridge, MA New York, NY San Francisco, CA Seattle, WA from the fact table, and denote the results as DQ . Store Store Given such a query Q, S EE DB considers all views Vi that per- Figure 2: Scenario A: Total Sales by Figure 3: Scenario B: Total Sales by form a single-attribute group-by and aggregation on DQ . We rep- Store Store resent Vi as a triple (a, m, f ), where m ∈ M, a ∈ A, f ∈ F , i.e., the view performs a group-by on a and applies the aggregation function We described our vision for S EE DB, along with the associated f on a measure attribute m. We call this the target view. research challenges in a companion vision paper [8]. In this demon- stration proposal, we present our first S EE DB prototype addressing SELECT a, f (m) FROM DQ GROUP BY a some of the challenges listed in that vision paper. In particular, our current prototype of S EE DB is built as a “wrapper” that can As discussed in the previous section, S EE DB evaluates whether a be overlaid on any relational database system. Given any query, view Vi is interesting by computing the deviation between the view S EE DB leverages special optimization algorithms and the underly- applied to the selected data (i.e., DQ ) and the view applied to the ing DBMS to generate and recommend interesting visualizations. entire database. The equivalent view on the entire database Vi (D) To do so efficiently and accurately, we must address the following can be expressed as shown below that we call the comparison view. SELECT a, f (m) FROM D GROUP BY a challenges: (a) We must determine metrics that accurately measure the “deviation” of a view with respect to the equivalent view on the entire database (e.g., Figure 1 vs. 2), while simultaneouly ensur- The results of both the above views are tables with two columns, ing that S EE DB is not tied to any particular metric(s); (b) We must namely a and f (m). We normalize each result table into a prob- intelligently explore the space of candidate views. Since the num- ability distribution, such that the values of f (m) sum to 1. For ber of candidate views (or visualizations) increases as the square our example in Table 1, the probability distribution of Vi (DQ ), de- of the number of attributes in a table (we will demonstrate this in noted as P [Vi (DQ )], is: (Jan: 180.55/538.18, Feb: 145.50/538.18, subsequent sections), generating and evaluating all views, even for March: 122.00/538.18, April: 90.13/538.18). A similar probability a moderately sized dataset (e.g. 1M rows, 100 attributes), can be distribution can be derived for P [Vi (D)]. prohibitively expensive; (c) While executing queries corresponding Given a view Vi and probability distributions for the target view to different views, we must share computation as much as possible. (P [Vi (DQ )]) and comparison view (P [Vi (D)]), the utility of Vi For example, we can compute multiple views and measure their is defined as the distance between these two probability distribu- deviation all together in one query. Independent execution, on the tions. Formally, if S is a distance function, other hand, will be expensive and wasteful; (d) Since analysis must happen in real-time, we must trade-off accuracy of visualizations U (Vi ) = S(P [Vi (DQ )], P [Vi (D)]) or estimation of “interestingness” for reduced latency. Section 3 The utility of a view is our measure for whether the target view is describes how we address these challenges. “potentially interesting” as compared to the comparison view: the Related Work: Over the past few years, the research community higher the utility, the more the deviation from the comparison view, has introduced a number of interactive data analytics tools such as and the more likely the view is to be interesting. Computing dis- ShowMe, Polaris, and Tableau [12, 7] as well as tools like Profiler tance between probability distributions has been well studied, and allow analysts to detect anomalies in data. Unlike S EE DB, which S EE DB supports a variety of metrics to compute utility, including recommends visualizations automatically, the tools place the onus Earth Movers Distance, Euclidean Distance, Kullback-Leibler (K- on the analyst to specify the visualization to be generated. Sim- L) Divergence, and Jenson-Shannon Distance. In our demonstra- ilar visualization specification tools have also been introduced by tion, conference attendees can experiment with different distance the database community, including Fusion Tables [5] and the De- metrics and examine how the choice of metric affects view qual- vise [6] toolkit. There has been some work on browsing data cubes ity. Finally, we note that while other definitions of the comparison in OLAP, allowing analysts to find explanations, get suggestions views and utility metrics are possible, for our initial exploration for next cubes to visit, or identify generalizations or patterns start- into visualization recommendations, we chose to focus on the intu- ing from a single cube [9, 11, 10]. While we may be able to reuse itive definitions above. the metrics from that line of work, the same techniques will not directly apply to visualizations. P ROBLEM 2.1. Given an analyst-specified query Q on a database D, a distance function S, and a positive integer k, find k views V ≡ (a, m, f ) that have the largest values of U (V ) among all 2. PROBLEM STATEMENT the views that can be represented using a triple (a, m, f ), while Given a database D and a query Q, S EE DB considers a number minimizing total computation time. of views that can be generated from Q by adding relational opera- tors. For the purposes of this discussion, we will refer to views and 3. S EE DB DESIGN visualizations interchangeably, since it is straightforward to trans- In this section, we present the S EE DB architecture, starting with late views into visualizations automatically. For example, there are an overview followed by a detailed discussion of its components. 1582

3 .3.1 S EE DB architecture overview ones (based on utility) to the frontend. For each view delivered by Our S EE DB prototype is designed as a layer on top of a tradi- the backend, the frontend creates a visualization based on parame- tional relational database system. While optimization opportunities ters such as the data type (e.g. ordinal, numeric), number of distinct are restricted by virtue of being outside the database, our design values, and semantics (e.g. geography vs. time series). The result- permits S EE DB to be used in conjunction with a variety of existing ing set of visualizations is displayed to the analyst who can then database systems. S EE DB is comprised of two parts: a frontend easily examine these “most interesting” views at a glance, explore and a backend. The frontend is a “thin client” that is used to issue specific views in detail via drill-downs, and study metadata for each queries and display visualizations. The backend, in contrast, per- view (e.g. size of result, sample data, value with maximum change forms all the computation required to generate and select views to and other statistics). Figure 5 shows a screenshot of the S EE DB be recommended. Figure 4 depicts the architecture of our system. frontend (showing the query builder) in action. User Selec0on criteria SeeDB Frontend SeeDB Backend Backend DBMS Q Metadata Query Builder Collector SeeDB generated Query Generator visualiza0ons Op9mizer View and interact with Most View Processor visualiza0ons relevant views Figure 4: SeeDB Architecture An analyst uses the frontend to issue queries to S EE DB. We provide three mechanisms for the analyst to issue queries (further discussion in Section 3.2). Once the analyst issues a query via the frontend, the backend takes over. First, the Metadata Collec- tor module queries metadata tables (a combination of database- provided and S EE DB specific tables) for information such as ta- ble sizes, column types, data distribution, and table access pat- terns. The resulting metadata along with the analyst’s query is Figure 5: SeeDB Frontend: Query Builder (left) and Example Visualizations (right) then passed to the Query Generator module. The purpose of the 3.3 The Backend Query Generator is two-fold: first, it uses metadata to prune the The S EE DB backend is responsible for all the computations for space of candidate views to only retain the most promising ones; generating and selecting views. To achieve its goal of finding the and second, it generates target and comparison views for each view most interesting views accurately and efficiently, the S EE DB back- that has not been pruned. The SQL queries corresponding to the end must not only accurately estimate the accuracy of a large num- target and comparison views are then passed to the Optimizer mod- ber of views but also design ways in which the total processing ule. We refer to these queries collectively as view queries. Next, time will be minimized. We first describe the basic S EE DB back- the Optimizer module determines the best way to combine view end framework and then briefly discuss our optimizations. queries intelligently so that the total execution time is minimized. (We discuss optimizations performed by S EE DB in Section 3.3.) Basic Framework: Given a user query Q, the basic approach com- Once the Optimizer module has generated the optimized queries, putes all possible two-column views obtained by adding a single- S EE DB runs them on the underlying DBMS. Results of the opti- attribute aggregate and group-by clause to Q. The target and com- mized queries are processed by the View Processor in a streaming parison views corresponding to each view are then computed and fashion to produce results for individual views. Individual view each view query is executed independently on the DBMS. The results are then normalized and the utility of each view is com- query results for each view are normalized, and utility is computed puted. Finally S EE DB selects the top k views with the highest as the distance between these two distributions (Section 2). Finally, utility and returns them to the S EE DB frontend. The frontend gen- the top-k views with the largest utility are chosen to be displayed. erates and displays visualizations for each of these view. We now The basic approach is clearly inefficient since it examines every discuss S EE DB modules in detail. possible view and executes each view query independently. We next discuss how our optimizations fix these problems. 3.2 The Frontend View Space Pruning: In practice, most views for any query Q The S EE DB frontend, designed as a thin client, performs two have low utility since the target view distribution is very similar main functions: it allows the analyst to issue a query to S EE DB, to the comparison view distribution. S EE DB uses this property and it visualizes the results (views) produced by the S EE DB back- to aggressively prune view queries that are unlikely to have high end. To provide the analyst maximum flexibility in issuing queries, utility. This pruning is based on metadata about the table including S EE DB provides the analyst with three mechanisms for specifying data distributions and access patterns. Our techniques include: an input query: (a) directly filling in SQL into a text box, (b) us- ● Variance-based pruning: Dimension attributes with low variance ing a query builder tool that allows analysts unfamiliar with SQL are likely to produce views having low utility (e.g. consider the to formulate queries through a form-based interface, and (c) using extreme case where an attribute only takes a single value); S EE DB pre-defined query templates which encode commonly performed therefore prunes views with grouping attributes with low variance. operations, e.g., selecting outliers in a particular column. ● Correlated attributes: If two dimension attributes ai and aj have Once the analyst issues a query via the S EE DB frontend, the a high degree of correlation (e.g. full name of airport and abbre- backend evaluates various views and delivers the most interesting viated name of airport), the views generated by grouping the table 1583

4 . on ai and We can therefore generate and evaluate a single view representing both ai and aj . S EE DB clusters attributes based on correlation and evaluates a representative view per cluster. ● Access frequency-based pruning: In tables with a large number of attributes, only a small subset of attributes are relevant to the analyst and are therefore frequently accessed for data analysis. S EE DB tracks access patterns for each table to identify the most frequently accessed columns and combinations of columns. While creating views, S EE DB uses this information to prune attributes that are rarely accessed and are thus likely to be unimportant. View Query Optimizations: The second set of optimizations used by S EE DB minimizes the execution time for view queries that haven’t been pruned using the techniques described above. Since view queries tend to be very similar in structure (they differ in the ag- gregation attribute, grouping attribute or subset of data queried), S EE DB uses multiple techniques to intelligently combine view queries. The ultimate goal is to minimize scans of the underlying dataset by sharing as many table scans as possible. Our strategies include: ● Combine target and comparison view query: Since the target view and comparison views only differ in the subset of data that the query is executed on, we can easily rewrite these two view queries as one. This simple optimization halves the time required to com- pute the results for a single view. ● Combine Multiple Aggregates: A large number of view queries have the same group-by attribute but different aggregation attributes. Therefore, S EE DB combines all view queries with the same group- by attribute into a single query. This rewriting provides a speed up linear in the number of aggregate attributes. ● Combine Multiple Group-bys: Since S EE DB computes a large number of group-bys, one significant optimization is to combine queries with different group-by attributes into a single query with multiple group-bys attributes. For instance, instead of executing queries for views (a1 , m1 , f1 ), (a2 , m1 , f1 ) . . . (an , m1 , f1 ) in- dependently, we can combine the n views into a single view repre- sented by ({a1 , a2 . . . an }, m1 , f1 ) and post-process results at the backend. Alternatively, if the SQL GROUPING SETS function- ality is available in the underlying DBMS, S EE DB can leverage that as well. While this optimization has the potential to signifi- cantly reduce query execution time, the number of views that can be combined depends on the correlation between values of group- ing attributes and system parameters like the working memory. Given a set of candidate views, we model the problem of finding the optimal combinations of views as a variant of bin-packing and apply ILP techniques to obtain the best solution. ● Sampling: For datasets of large size, an optimization that affects performance significantly is employing sampling: we construct a sample of the dataset that can fit in memory and run all view queries against the sample. However, as expected, the sampling technique and size of the sample both affect view accuracy. ● Parallel Query Execution: The final optimization that S EE DB employs is taking advantage of parallel query execution at the DBMS to reduce total latency. We observe that as the number of queries executed in parallel increases, the total latency decreases at the cost of increased per query execution time. 4. DEMO WALKTHROUGH We propose to demonstrate the functionality of S EE DB through hands-on interaction with a variety of datasets. Our goals are two fold: (1) demonstrate the utility of S EE DB in surfacing interest- ing trends for a query and (2) demonstrate that we can return high quality views efficiently for a range of datasets. We will use four different datasets in our demonstration: aj will be very similar (and have almost equal utility). ● Store Orders dataset [4]: This dataset is often used by Tableau [3] as a canonical dataset for business intelligence applications. It consists of information about orders placed in a store including products, prices, ship dates, geographical information, and prof- its. Interesting trends in this dataset have been very well studied, and attendees will use S EE DB to quickly re-identify these trends. ● Election Contribution dataset [1]: This is an example of a dataset typically analyzed by non-expert data analysts like journalists or historians. With this dataset, we demonstrate how non-experts can use S EE DB to quickly arrive at interesting visualizations. ● Medical dataset [2]: This real-world dataset exemplifies a dataset that a clinical researcher might use. The schema of the dataset is significantly complex and it is of larger size. ● Synthetic data: We provide a set of synthetic datasets with vary- ing sizes, number of attributes, and data distributions to help at- tendees evaluate S EE DB performance on diverse datasets. Scenario 1: Demonstrating Utility. Attendees are provided with three diverse, real-world datasets to explore using S EE DB. For each dataset, attendees can issue ad-hoc or pre-formulated queries to S EE DB. S EE DB will then intelligently explore the view space and optimize query execution to return the most interesting visu- alizations with low latency. Attendees can examine the returned queries visually, view the associated metadata, and perform drill- downs. To aid the evaluation of visualizations, the demo system will be configured to also show the user “bad” views (views with low utility) that were not selected by S EE DB. Similarly, we pro- vide pre-selected queries (and previously known information about their results) to allow attendees to confirm that S EE DB does indeed reproduce known information about these queries. Attendees will also be able to experiment with a variety of distance metrics for computing utility and observe the effects on the resulting views. Scenario 2: Demonstrating Performance and Optimizations. This scenario will use an enhanced user interface and synthetic datasets mentioned above. Attendees will be able to easily ex- periment with a range of synthetic datasets and input queries by adjusting various “knobs” such as data size, number of attributes, and data distribution. In addition, attendees will also be able to se- lect the optimizations that S EE DB applies and observe the effect on response times and accuracy. Thus, through our demonstration of S EE DB we seek to illus- trate that (a) it is possible to automate labor-intensive parts of data analysis, (b) aggregate and grouping-based views are a powerful means to identify interesting trends in data, and (c) the right set of optimizations can enable real-time data analysis of large datasets. 5. REFERENCES [1] Fec presidential campaign finance. [Online; accessed 3-March-2014]. [2] Mimic ii database. [Online; accessed 3-March-2014]. [3] Tableau public. [Online; accessed 3-March-2014]. [4] Tableau superstore data. [Online; accessed 3-March-2014]. [5] H. Gonzalez et al. Google fusion tables: web-centered data management and collaboration. In SIGMOD Conference, pages 1061–1066, 2010. [6] M. Livny et al. Devise: Integrated querying and visualization of large datasets. In SIGMOD Conference, pages 301–312, 1997. [7] J. D. Mackinlay et al. Show me: Automatic presentation for visual analysis. IEEE Trans. Vis. Comput. Graph., 13(6):1137–1144, 2007. [8] A. Parameswaran, N. Polyzotis, and H. Garcia-Molina. Seedb: Visualizing database queries efciently. In VLDB, volume 7, pages 325–328, 2013. [9] S. Sarawagi. Explaining differences in multidimensional aggregates. In VLDB, pages 42–53, 1999. [10] S. Sarawagi. User-adaptive exploration of multidimensional data. In VLDB, pages 307–316, 2000. [11] G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional olap data. In VLDB, pages 531–540, 2001. [12] C. Stolte et al. Polaris: a system for query, analysis, and visualization of multidimensional databases. Commun. ACM, 51(11):75–84, 2008. 1584

9点赞

3收藏

0下载