DATA SPREAD:Unifying Databases and Spreadsheets

We demonstrate DATASPREAD, a data exploration tool that holistically unifies databases and spreadsheets. It continues to offer a Microsoft Excel-based spreadsheet front-end, while in parallel managing all the data in a back-end database, specifically, PostgreSQL.
展开查看详情

1. DATA S PREAD: Unifying Databases and Spreadsheets Mangesh Bendre, Bofan Sun, Ding Zhang, Xinyan Zhou Kevin Chen-Chuan Chang, Aditya Parameswaran University of Illinois at Urbana-Champaign (UIUC) {bendre1 | bsun6 | dzhang13 | xzhou14 | kcchang | adityagp}@illinois.edu ABSTRACT mance, for example, in Microsoft Excel, it is common knowledge Spreadsheet software is often the tool of choice for ad-hoc tabu- that beyond a few 100s of thousands of rows, the software is no lar data management, processing, and visualization, especially on longer responsive [1]. For the latter issue, there are a number of tiny data sets. On the other hand, relational database systems of- common data analytics operations that are either very cumbersome fer significant power, expressivity, and efficiency over spreadsheet or not easy to do in spreadsheet software. To illustrate this, consider software for data management, while lacking in the ease of use a simple example, where a user is studying a spreadsheet contain- and ad-hoc analysis capabilities. We demonstrate DATA S PREAD, a ing course assignment scores and eventual grades for students from data exploration tool that holistically unifies databases and spread- rows 1–100, columns 1–5 in one sheet, and demographic informa- sheets. It continues to offer a Microsoft Excel-based spreadsheet tion for the students from rows 1–100, columns 1–10 in another front-end, while in parallel managing all the data in a back-end sheet. Consider the following operations that the user may want to database, specifically, PostgreSQL. DATA S PREAD retains all the do to compute some intermediate tabular result (The user may then advantages of spreadsheets, including ease of use, ad-hoc analysis visualize or study this result in some way.) : and visualization capabilities, and a schema-free nature, while also • Say the user wants to understand the impact of assignment adding the advantages of traditional relational databases, such as grades on the course grade, for which they want to select the scalability and the ability to use arbitrary SQL to import, filter, or students having points higher than 90 in at least one assign- join external or internal tables and have the results appear in the ment. There is no way for the user to sub-select a set of rows spreadsheet. DATA S PREAD needs to reason about and reconcile of this form for further analysis, except manually identifying differences in the notions of schema, addressing of cells and tu- these rows, and then copy-pasting each one into another area. ples, and the current “pane” (which exists in spreadsheets but not • Say the user wants to plot the average grade by demographic in traditional databases), and support data modifications at both the group (undergrad, MS, PhD). This requires a “join” of the front-end and the back-end. Our demonstration will center on our two sheets of the spreadsheet to generate the desired result, first and early prototype of the DATA S PREAD, and will give the also very cumbersome to do on current spreadsheet software. attendees a sense for the enormous data exploration capabilities of- • Say the course management software outputs actions per- fered by unifying spreadsheets and databases. formed by students into a relational database or a CSV file; there is no easy way for the user to study this data within the spreadsheet, as the data is continuously added. 1. INTRODUCTION There are many other data analysis operations that are similarly Since the early days of computing, spreadsheet software, such very cumbersome on current spreadsheet software. as VisiCalc, Lotus 1-2-3, and more recently Microsoft Excel and Therefore, we propose to bring the power of relational databases Google Sheets, have found ubiquitous use in ad-hoc tabular data to bear on spreadsheets. Relational databases are efficient and ex- analysis, especially by non-programmers; including statisticians, pressive, and are certainly capable of natively handling the opera- finance professionals, consultants, and physical scientists. The main tions described above via SQL. On the contrary, relational databases advantages of spreadsheets include the ability for direct manipula- are not as easy-to-use or as amenable to direct manipulation as tion of data, an intuitive user interface, and a flexible data model spreadsheet software, e.g., seamlessly adding a new column, copy- with the ability to add new rows, columns, or tuples, seamlessly. pasting data. Thus, we unify relational databases with spreadsheet However, spreadsheet software has many limitations, making it software, in order to preserve the benefits of both. unsuitable for present-day big data analysis, primarily due to poor We propose a system, DATA S PREAD, that is a holistic unification performance on large data sets, and the low expressivity of the of relational databases and spreadsheets. Here we use spreadsheet spreadsheet syntax [1, 2]. For the former issue, i.e., poor perfor- as an intuitive user interface and database as a back-end engine. However, designing DATA S PREAD is not trivial since databases This work is licensed under the Creative Commons Attribution- and spreadsheets adopt very different architectures and ideologies. NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- In particular, we need to deal with the following challenges: cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- • Schema: databases have a strict schema-first data model, mission prior to any use beyond those covered by the license. Contact copyright holder by emailing info@vldb.org. Articles from this volume which is based on tables and tuples, while the spreadsheet were invited to present their results at the 41st International Conference on data model is based on sheets with rows and columns, and Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast, no explicitly defined schema. Hawaii. • Addressing: spreadsheets treat rows and columns as identi- Proceedings of the VLDB Endowment, Vol. 8, No. 12 cal, while databases operate on sets of tuples. Copyright 2015 VLDB Endowment 2150-8097/15/08.

2. • Window: spreadsheet have the notion of the current window, to leverage the intuitiveness and the richness of a spreadsheet inter- which is the portion of the spreadsheet that the user is cur- face, rather than changing it significantly, we enhance it with con- rently looking at; there is no such notion in databases. cepts borrowed from databases. Underneath the interface, we pro- • Modifications: spreadsheets support updates at any level and pose to have a relational database that is enhanced to support the granularity: rows or columns, while databases only support spreadsheet interface. Second, to improve the expressivity of the modifications that correspond to a SQL query. interface, we expose some database features, for example, declara- • Computation: spreadsheets support value-at-a-time formulae tive querying, from the underling database to the interface. Using to allow derived computation, while databases support arbi- these two key ideas, we enable users to leverage the strengths of trary SQL queries operating on groups of tuples at once. both spreadsheets and databases for dealing with tabular data. We have identified these as research issues and have build the first version of DATA S PREAD to explore them, among others. Ex- 2.2 Semantics and Syntax ternally, DATA S PREAD retains many of the front-end user interface Although spreadsheets and databases have both been designed aspects of spreadsheets that make it as easy to use, while at the same to manage data in form of tables, their treatment of this data is time enhanced and supported by a back-end relational database, vastly different. Spreadsheets have been developed primarily with providing efficiency and expressivity. In the front-end, in addition presentation of data in mind and hence their design focuses pri- to all the traditional spreadsheet commands, DATA S PREAD sup- marily on simplicity, intuitiveness and a rich user interface. On ports the use of arbitrary SQL via custom DBSQL and DBTable com- the other hand databases have been designed with powerful data mands, enabling the import, and constant updating of data from management capabilities to work with large tables. Hence, certain relational databases, as well as the computation of selections and data manipulation operations, e.g., queries, joins, summarization, joins of data contained in the spreadsheets. Conceptually, these are very naturally expressed as SQL statements in databases. commands, along with other spreadsheet commands, are stored as We propose semantics for DATA S PREAD such that we are able to interface views in the underlying database. In the back-end, an naturally leverage the strengths of both systems. Since we plan to optimizer, optimizes for keeping the user window up-to-date and enrich databases to effectively support interfaces, we use the strong in-sync with the underlying relational database. Even though the points of spreadsheets to motivate our semantics. spreadsheet can only support a few rows, as the user pans through Support for Dynamic Schema. Spreadsheets enable users to ef- the spreadsheet, the burden of supplying or refreshing the current fortlessly create tables and update their schema. A user typically window is placed on the relational database, which is very efficient. structures data on a spreadsheet as tables, with columns and rows, Demonstration. In our demonstration, we will allow conference where columns generally correspond to attributes and rows to tu- attendees to interact with our prototype of DATA S PREAD (built us- ples. Here, adding an attribute, which is essentially a change to ing Microsoft Excel and PostgreSQL), enabling them to interac- schema, is as natural and convenient as adding a tuple. This is due tively analyze a two-way synchronized view of relational data using to the fact that spreadsheets do not treat columns and rows differ- more expressive DBSQL, DBTable commands to filter, join, project, ently when we consider the operations possible on each. On the and export data residing in multiple sheets. other hand, relational databases have a schema-first data model. Related Work. With the goal to achieve the benefits of spread- Relational tables, which belong to a database’s schema, need a pre- sheets and relational databases while dealing with tabular data, our defined structure in terms of attributes. Since changing the structure holistic unification strives to unify the notion of table in both sys- of a table in a database requires an update to all its tuples, it is not tems. Recent works have proposed to enrich spreadsheets and rela- efficient as adding, deleting or updating the tuples of the table. tional databases with features from one another in three orthogonal To make relational table creation as effortless as table creation on directions: a) Use of spreadsheets to mimic the relational database a spreadsheet, we propose the ability for a user to select an arbitrary functionalities [3]: Although this approach achieves expressivity range on the spreadsheet and use it to define the structure and the of SQL, it is unable to leverage the scalability of databases. b) Use data for a table within the database. Once created it should behave of databases to mimic spreadsheet functionalities [4, 5]: Although like a regular table within the database, and the user should be able this approach achieves scalability of databases, it is does not sup- to refer to it and use it in queries. port ad-hoc tabular management provided by spreadsheets. c) Use To streamline the concept of a dynamic schema, we propose that of spreadsheet interface for querying data [6]. This approach pro- a user is able to update a table’s schema and tuples that are dis- vides an intuitive interface to query data, but looses the expressivity played on a spreadsheet, which in turn updates the schema and tu- of SQL as well as ad-hoc data management capabilities. ples of the underlying table in the database. Further, the database Rest of the Paper. In the next section we propose a desired design should be able to handle this schema change with an efficiency sim- by developing a unification semantics. We then use the semantics ilar to tuple updates. This makes table updates within a database as to propose an architecture for DATA S PREAD. Finally, we discuss natural as updating them on a spreadsheet. demonstration scenarios for our DATA S PREAD prototype. Make Databases Interface Aware. Since spreadsheets have been designed with an interface in mind, they very naturally lay out data that is both consumed and manipulated by users. This interface has 2. DESIGN OF DATA S PREAD a very strong influence on functionality offered to the user. Features like laying out a table in a desired format and obtaining the totals In this section, we describe the semantics for DATA S PREAD. In of some attributes beneath the table (using a spreadsheet formula) particular, we discuss some important concepts and challenges that feel natural. Thus, the interface provides a context to the operations arise due to the unification of the two disparate ideologies: spread- performed on a spreadsheet. sheets and databases. Positional addressing, which enable users to address data based on its position on a spreadsheet, is an intuitive and effective way to 2.1 DATA S PREAD Overview refer to presented data. By laying out data on a spreadsheet, a po- With a goal of unifying databases and spreadsheets, we now pro- sition gets implicitly assigned to the displayed data, due to which a pose a framework for DATA S PREAD based on two key ideas. First, spreadsheet is able to use positional referencing, e.g., a cell refer-

3.ence of A2 from cell C2 implies a cell that is two columns left and in the same row. The positional referencing is a commonly used fea- ture while building expressions as it enables us to copy expressions Interface Manager Transactional across cells while still maintaining the relative references. Manager Conversely, databases completely lack interface aspects. Once a Query Processor Compute Engine query result is output, the database is no longer cognizant of how Concurrency Positional Control that result is consumed. This disconnect is a key weakness due Records Indexes Indexes to which a database cannot be used as-is to effectively support a Main spreadsheet interface. For instance, when a user wants to update a Buffer Manager Memory specific attribute of a displayed table, the database is unable to help Buffers: data, Relational Storage Interface Storage index, log, etc. because it is not aware of the tuple or attribute being modified. Manager Manager We propose to make databases aware of the interface’s data lay- out. This enables them to understand interactions on the presented Physical Storage data, e.g., for a join using displayed tuples, the database is able to identify the tuples just based on their implicit context. This further Figure 1: DATA S PREAD Architecture. enables the databases to optimize the query execution by prioritiz- ing the displayed tuples over the ones that are not displayed. visible cells should be prioritized and the remaining long running After making the database interface aware, we propose to lever- computations should be performed in background. age this to enable positional addressing in databases. This implies Challenge. Realizing the unified semantics is not a trivial task, that the user should be able to refer to a value by its location on the since it stretches the capabilities of today’s relational databases be- spreadsheet and use it in any arbitrary query. yond what is available. For example, consider the semantics of Novel Spreadsheet Constructs. We now describe how the po- schema, for today’s databases a table’s schema change requires an sitional addressing is leveraged in the front-end spreadsheet, en- update to all the tuples of the table. Further, the activity is con- abling users to pose rich SQL queries while referring to data in the sidered as “data definition language” and generally cannot partici- spreadsheet as well as the underlying relational table. pate in transactions. This requires us to propose the architecture of We encapsulate SQL references within the spreadsheet using one DATA S PREAD by radically rethinking the databases’ architecture. of two formulae: DBSQL and DBTable. DBSQL enables users to pose arbitrary queries combining data present on the spreadsheet, and 3. PROPOSED ARCHITECTURE data stored in the relational database. DBTable enables users to de- clare a portion of the spreadsheet as being either exported to or im- Since relational databases are not designed to be interface-aware, ported from the relational database, i.e., that portion of the spread- when we unify the presentation layer of spreadsheets with databases, sheet directly reflects the contents of a relational database table. we need to redesign the underlying architecture of the database, as In order to support arbitrary positional addressing or referenc- well as the interaction with the front-end interface. ing of data on the spreadsheet for DBSQL, we add two new con- To enable databases to support the semantics described earlier, structs: RangeValue and RangeTable. This enables users to refer to we propose a redesigned database architecture as shown in Fig- a cell and a table on a sheet respectively relative to the cell where ure 1, where the shaded blocks represent new or enhanced com- the query is entered. RangeValue enables a user to refer to scalar ponents. The interface manager is tasked with the goal of mak- values contained in a cell, e.g., SELECT FROM Actors WHERE ActorId ing databases interface-aware. The query processor is enhanced to = RangeValue(A1), referring to the value in cell A1. RangeTable on the support and optimize the execution for positional addressing, a nat- other hand enables a user to refer to a range, and perform operations ural way to locate data presented on the interface. The compute on it assuming it is a regular database table. This enables any range engine leverages interface aspects, e.g., windowing, to optimize on a spreadsheet to be potentially a table, and all the operations, execution. We introduce a new type of index, positional, which e.g., join, that the database allows on a table can be performed, makes interface-oriented operations, e.g., ordered presentation, ef- e.g., SELECT FROM Actors NATURAL JOIN RangeTable(A1:D100). ficient. The interface storage manager stores data that is presented on the interface but not designated as a relational table. The rela- Other Semantic Issues. Although we have discussed two impor- tional storage manager is enhanced to effectively support interface tant concepts, there are still many semantics that require attention related operations such as schema changes. if we want to realize a complete unification. Due to the space re- While we have identified the extent of modifications needed for striction, rather than discussing them in detail we have listed a few databases to effectively support an interface, our current implemen- of them below: a) SQL support on spreadsheets: To leverage the tation and discussion focuses on enhancing some core components. expressiveness of SQL and the simplicity of formulae we propose Naturally, there are other components that require modification, to support both, and give flexibility to the user to interchangeably such as the transaction manager, and we leave them for future work. use either. b) Real-time sync: Using spreadsheets users are ac- customed to having an always updated copy with them. For this we Interface Storage Manager. In this unified framework, a spread- propose a real time two way synchronization of the displayed on the sheet not only has tabular data, corresponding to relational tables spreadsheet with the underlying database. c) Data typing: Spread- in the underlying database, but also has other interface data, e.g., sheets dynamically type the data stored as cells. To make this work formulae or data entered by the user. This interface data requires with databases, we propose the idea of automatically assigning data special treatment as it does not have a schema. The interface stor- types within the databases based on the tuples. d) Computation op- age component stores this data as a collection of cells. To enable timization: By scaling up the amount of data, which can be pre- efficient retrieval for a given range, the component groups the cells sented on a spreadsheet, efficient computation become a necessity. together by proximity and splits the groups into data blocks as re- We propose to leverage the presentation information for prioritizing quired by the underlying storage. To enable efficient access, the computations for the data that is displayed. e) Lazy Computation: blocks are further indexed by a two-dimensional indexing method. To maintain interactivity, we propose that the calculations of the Relational Storage Manager. Our unification semantics demand that the schema changes to the tabular data, which we persist in the

4. Figure 2: (a) Executing SQL with relative referencing. (b) Table creation. (c) Two-way table sync. database as relational tables, should be very efficient, almost as effi- put of the query is not limited to a single cell, but spans the range cient as changes to tuples. With an insight to reduce the disk blocks B3:B10. This enables the collection of cells to be computed collec- to update during a schema change, the relational storage manager tively in a single pass (as opposed to traditional spreadsheet formu- uses a hybrid of column-store and row-store to physically store the lae that are one-per-cell). This will demonstrate how DATA S PREAD table. Here, data is structured along a collection of attribute groups, provides the ability to naturally query the underlying database, and thereby radically reducing the disk blocks that need an update dur- other data in the spreadsheet. ing a schema change. Feature 2: Import/Export. Consider Figure 2b. Here, on select- Interface Manager. The interface manager keeps close tabs on the ing a range in the sheet and selecting the create table command data presented to the user. For every data item, e.g., the output of from the add-ins menu, we provide the ability to users to trans- a query, a table imported from the database, that is displayed on form it into a relational database table. The schema of this table is the interface, the presentation manager assigns a context; a context automatically inferred using the column heading and the data. Op- comprises a positional address along with a reference to the sheet. tionally, users will be allowed to specify constraints on the table, This context can then be utilized to enable functionalities such as such as primary keys. On completion, the table is created in the two-way sync and relative addressing. underlying database. The data on the sheet is replaced by DBTable, Along with positional addressing, the interface manager allows which is a spreadsheet function that selects data from the database a two-way synchronization for the tables displayed on the inter- and displays it on the spreadsheet. DBTable could also be used to face. Since primary keys are a natural way to identify tuples in a directly import data already present in the relational database into relational database, the interface manager maintains a mapping be- the spreadsheet. This will demonstrate how DATA S PREAD allows tween a tuple’s key attribute and its corresponding location. This us to import or export data to and from the relational database. enables translation of an update on the interface, having a locational Feature 3: Modifications. Consider Figure 2c. Here, after a table context, to the underlying relational database, which requires a key is displayed on the spreadsheet using DBTable, and formatted in cells to uniquely identify a tuple. A3 to B5, as modifications are made to the table on the front-end the Compute Engine. To optimally support interface interactions and data in the relational database is updated, and the data displayed data updates, we introduce a new component termed as “compute in cells from A10 to B12 (corresponding to a DBSql command ref- engine”. By using ideas like shared computation, the compute en- erencing that data) is immediately updated. This will demonstrate gine enables efficient handling of formulae and queries with po- how DATA S PREAD provides the ability to keep data in-sync during sitional referencing, e.g., DBSQL. It performs computations asyn- modifications at both the front-end and back-end chronously, free from a user’s context, as updates are made to ei- Overall, the aforementioned demonstration scenarios will convince ther the interface or the database. It further improves the interface’s attendees that our DATA S PREAD system offers a valuable hybrid interactivity by prioritizing the computation for visible cells. between spreadsheets and databases, retaining the ease-of-use of spreadsheets, and the power of databases. 4. DEMONSTRATION DESCRIPTION Our DATA S PREAD prototype is implemented using Microsoft 5. REFERENCES [1] S. Clemens, “5 Ways To Tell You Have Outgrown Excel.” Excel (that presumably most conference attendees as well as even- http://www.insightsquared.com/2011/06/ tual users are already familiar with) as the front-end spreadsheet 5-ways-to-tell-you-have-outgrown-excel/. application, backed by PostgreSQL as the relational database back- [2] R. Collie, “Big Data is Just Data, Why Excel “Sucks”, and 1,000 end. All the screenshots we depict are from our current prototype. Miles of Data.” http://www.powerpivotpro.com/2012/10/ big-data-is-just-data-why-excel-sucks-and-1000-miles-of-data/. A video demonstrating the features of DATA S PREAD can be found [3] J. Tyszkiewicz, “Spreadsheet as a relational database engine,” in at http://dataspread.cs.illinois.edu. SIGMOD, pp. 195–206, ACM, 2010. We demonstrate the following features of the DATA S PREAD pro- [4] A. Witkowski, S. Bellamkonda, T. Bozkaya, G. Dorman, N. Folkert, totype: a) analytic queries that reference data on the spreadsheet, A. Gupta, L. Shen, and S. Subramanian, “Spreadsheets in rdbms for as well as data in other database relations. b) importing or ex- olap,” in Proceedings of the 2003 ACM SIGMOD International porting data from the relational database. c) demonstrating that Conference on Management of Data, SIGMOD ’03, (New York, NY, DATA S PREAD keeps data in the front-end and back-end in-sync USA), pp. 52–63, ACM, 2003. during modifications at either end. [5] A. Witkowski, S. Bellamkonda, T. Bozkaya, A. Naimat, L. Sheng, S. Subramanian, and A. Waingold, “Query by excel,” in Proceedings Feature 1: Querying. Consider Figure 2a. Here, expressed using of the 31st International Conference on Very Large Data Bases, the DBSQL spreadsheet function, the SQL query in B3 uses data from VLDB ’05, pp. 1204–1215, VLDB Endowment, 2005. three relations in the database (movies, movies2actors, actors), and ref- [6] B. Liu and H. Jagadish, “A spreadsheet algebra for a direct data erences the two cells above (B1 and B2), via special relative ref- manipulation query interface,” in Data Engineering, 2009. ICDE’09. erencing commands (RangeValue(B1) and RangeValue(B2)). The out- IEEE 25th International Conference on, pp. 417–428, IEEE, 2009.