Google Fusion Tables

Google Fusion Tables represents an initial answer to the question of how data management functionality that focussed on enabling new users and applications would look in today’s computing environment. This paper characterizes such users and applications and highlights the resulting principles, such as seamless Web integration, emphasis on ease of use, and incentives for data sharing, that underlie the design of Fusion Tables. We describe key novel features, such as the support for data acquisition, collaboration, visualization, and web-publishing.
展开查看详情

1. Google Fusion Tables: Web-Centered Data Management and Collaboration ∗ Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen , Anno Langen, † Jayant Madhavan, Rebecca Shapley, Warren Shen, Jonathan Goldberg-Kidon Google Inc. ABSTRACT der the assumption that data typically belongs to a single It has long been observed that database management sys- enterprise. tems focus on traditional business applications, and that While there will continue to be significant need for such few people use a database management system outside their systems, an increasing body of evidence points to drasti- workplace. Many have wondered what it will take to enable cally different requirements. First, data management needs the use of data management technology by a broader class to support collaboration among multiple users and multiple of users and for a much wider range of applications. organizations at its very core. Second, data management Google Fusion Tables represents an initial answer to the systems need to appeal to a broader audience of users who question of how data management functionality that focussed are less technically skilled [5]. Third, data management and on enabling new users and applications would look in today’s the Web need to be integrated seamlessly—data collection, computing environment. This paper characterizes such users presentation and visualization should be immediately com- and applications and highlights the resulting principles, such patible with the Web [7, 8]. as seamless Web integration, emphasis on ease of use, and in- This paper describes Google Fusion Tables, a cloud-based centives for data sharing, that underlie the design of Fusion data management and integration service that aims to meet Tables. We describe key novel features, such as the sup- the aforementioned requirements. The Fusion Tables ser- port for data acquisition, collaboration, visualization, and vice was launched in June, 2009, and has since received web-publishing. considerable use (see tables.googlelabs.com). While we have witnessed a wide range of applications for Fusion Tables, Categories and Subject Descriptors: H.3.5 Online In- the original audience, for which the service was designed, formation Services: [Data sharing, Web-based services]; H.2 is organizations that are struggling with making their data Database Management: [Miscellaneous] available online (internally or externally), and communities General Terms: Design of users that need to collaborate on data management across multiple enterprises and organizations. Keywords: Cloud Services, Visualization, Collaboration Fusion Tables enables users to upload tabular data files (in Spreadsheet, CSV, and KML formats) of up to 100MB. 1. INTRODUCTION The data can contain geographical objects such as points, lines and polygons. The system provides several ways of vi- Given the sea-change in computing environments driven sualizing the data (charts, maps, timelines). The table can by the cloud, the Web, and the proliferation of connected, also be exported as KML so it can be viewed on Google powerful personal computing devices, it is tempting to ask Earth. The system supports filters and aggregates as a the following question: How would we design data manage- means of querying the data. Integration of data from mul- ment functionality for today’s connected world? To put this tiple sources is supported by means of joins across tables question in context, recall that the foundations of database that may belong to different users. Users can keep the data management systems were established several decades ago private, share it with a select set of collaborators, or make it when the focus was on high-throughput business transac- public. The discussion feature of Fusion Tables allows col- tions, and the processing of complex SQL queries, and un- laborators to post or respond to comments at the level of ∗On leave from Aalborg University. individual rows, columns, or cells. Users can interact with data through a Web interface or through programs that ac- †On leave from M.I.T. cess the data through an API. The features we currently support represent an initial subset of new and traditional data management functionality that were deemed to be of the highest priority. Permission to make digital or hard copies of all or part of this work for This paper describes the design goals of Fusion Tables personal or classroom use is granted without fee provided that copies are and the functionality we built in support of this design. A not made or distributed for profit or commercial advantage and that copies companion paper [4] provides details of the underlying ar- bear this notice and the full citation on the first page. To copy otherwise, to chitecture and our implementation. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. We begin by outlining our design goals and the principles SIGMOD’10, June 6–11, 2010, Indianapolis, Indiana, USA. underlying the design of Fusion Tables. Section 3 describes Copyright 2010 ACM 978-1-4503-0032-2/10/06 ...$10.00.

2.selected features that address our goals in the context of 2.2 Underlying Principles data acquisition, sharing, collaboration and visualization. Fusion Tables aims to adhere to a small set of guiding Section 4 describes our API. Finally, Section 5 covers related principles that we believe are important in enabling appli- work, and Section 6 concludes. cations such as those just outlined. Subsequent sections de- scribe how these principles are currently reflected in Fusion Tables. We note that in all cases, following these principles 2. DESIGN FOUNDATIONS offers an agenda for a continuous process of improvements. Fusion Tables is designed with new applications in mind and according to a set of guiding principles that we consider Provide Seamless Integration with the Web important to enabling the intended applications. In today’s computing environment, access to the Internet can increasingly be taken for granted. It is therefore impor- 2.1 New Applications tant that data management functionality is integrated seam- The goal of Fusion Tables is not to replace traditional lessly with the Web, and is able to leverage other properties database management systems and applications, and nei- and services on the Web. Other web properties may serve as ther is the goal to simply move such applications into the entry points into data management as well as venues for pub- cloud. In contrast, the objective is to offer data management lishing and visualizing data. Other services, e.g., geocoding functionality that exploits today’s computing environment of location names, can be used to add value to the data. in order to effectively enable new users and uses of data First, Fusion Tables allows users to publish their visual- management technology. izations on the Web. We enable users to create bar charts, The following are example applications for which Fusion pie charts, timelines, motion charts, etc., and embed them Tables was designed. Each application either mirrors or is on any web page. Especially popular are map visualizations inspired by actual, ongoing uses of Fusion Tables. that enable users to display geo-spatial datasets on Google Maps. • Ecologists in the rain forests of Costa Rica collect spec- Second, public datasets on Fusion Tables can be crawled imens of animal and plant life. They want to maintain by search engines and hence have a chance of showing up as records of the specimens and also want to include the search results. The data is therefore automatically accessible related genetic information that is being produced for through web search, the primary method for locating data them by a laboratory in Canada. on the Web. Third, Google and others already have a powerful collab- • A non-profit that wants to publish datasets about the oration model for documents and spreadsheets, and Fusion availability and usage of water resources. They would Tables is designed to integrate seamlessly with such estab- like to use visualizations as a means of painting a com- lished models. pelling story about the dire state of water availability In a nutshell, we wanted the data management service to and quality around the planet (see www.circleofblue.org [2]). be an integral component of the eco-system of data on the Web. • The Ministry of Health in an African country wants Emphasize Ease of Use to obtain community input on the current status of health clinics dispersed across the country. A fundamental emphasis on ease of use is essential in reach- ing a much broader class of potential users with data man- agement needs, but who are not IT professionals and who • The International Coffee Organization collects and dis- may not have any training in data management. In keeping tributes data about coffee exports and imports, and with this principle, design decisions are made that prioritize wants to make this data available to interested parties ease of use over other requirements. One aspect of ensuring globally. ease of use is to apply pay-as-you-go data management prin- ciples [3], the key idea being that a user should see an imme- • An epidemiologist seeks to turn dry tables of numbers diate benefit of investing time in using the data management into a visual story, creating broader awareness more functionality. As part of this, little initial investment should quickly, this way facilitating faster and more effective be required by the user. responses to disease trends. For example, being a cloud-based service, Fusion Tables requires no initial installation. As another example, Fusion • Congressional staffers want to visualize data to help a Tables does not require the user to declare a schema up senator make an argument. front, but rather tries to automatically determine the data types of columns for common and useful data types. • The administrators of a web site managing a database of biking trails want to publish the contents of their Provide Incentives for Sharing Data database in such a way that they can programmatically Users often desire to share data with others. However, they manage their data while their users explore (browse are faced with a number of disincentives. Data owners are and filter) the trails on a Map (see www.mtbguru.com [6]). afraid of loss of attribution, of misuse and corruption of their data, and of others not being able to find the data easily. • A dairy farm in Brazil that is being managed by its Fusion Tables aims to address such concerns. owner in Thailand and his partner in California wants As already mentioned, when a user specifies that a certain to enable data-based collaboration among the three dataset is public, we make that data crawlable by search en- sites. gines, so it can appear in search results. Likewise, while all

3.datasets can be visualized in different ways on the Fusion Tables web site, only the public ones can have their visual- izations embedded on web pages. Facilitate Collaboration Collaboration among multiple parties on the Web from dif- ferent enterprises and organizations is a key to data manage- ment today. Valuable insight arises when data is combined from multiple sources and when data is scrutinized from the perspectives of multiple users. Fusion Tables facilitates such joining of data and enables collaborators to discuss and comment on the data at several levels of granularity. Figure 1: Collaborating with others. In addition to 3. DATA MANAGEMENT WITH FUSION specifying collaborators with read or write permis- sions, Fusion Tables also allows collaborators who TABLES can add columns to an existing dataset, but still We cover novel aspects of Fusion Tables, covering first the keeping distinct write permissions. acquisition of data, then the support for collaboration and sharing, and finally the support for visualizations. a local copy), they can restrict the ability of other users to do so. 3.1 Data Acquisition Search: Next, making data public in order to share it with Fusion Tables enables users to upload files containing struc- others is useless unless the data can easily be discovered by tured data. The currently supported formats include CSV interested users. Thus, rendering the data in Fusion Tables (Comma Separated Values), different spreadsheet formats searchable is an important ongoing effort. Our main goal (Excel, Open Office, and Google Spreadsheets), and KML is to make public data discoverable by search engines, so (Keyhole Markup Language). they can direct users to the data when this is relevant to To achieve ease of use, the number of steps a user needs their queries. Whenever a user makes a dataset public in to go through before the data is in the systems is reduced Fusion Tables, we create a corresponding HTML page that as much as possible. Rather than having the user declare a is crawlable by search engines. As such, some queries will schema for the tabular data and then having the user also obtain these tables as results. describe how the data to be imported matches the schema, It is also important to enable discovery of data inside Fu- the system tries to detect automatically which row in the sion Tables. Thus, we are pursuing efforts to enable an ad- imported file is the header row (i.e., specifying the names of vanced search for tables from within Fusion Tables. This the columns), and it simply asks the user to verify its guess. service will be known to a much narrower set of users (as In addition, even though some processes (e.g., indexing, type is often the case for specialized search engines); it is meant guessing) are going on in the background, we try to maintain to support those users who have a specific need to explore a responsive import process. the collection of structured data. Searching over a reposi- Further, the system does not ask the user to specify data tory of tables is neither a simple nor a solved problem. The types for the columns identified. Instead, as we describe typical signals that are used for document retrieval may not shortly, it attempts to guess the types from the data (some apply in this context [1]. Our search service is based on an of our users may not even understand the concept of schema extension of the techniques developed by Cafarella et al. [1]. versus instance or that of a data type). In keeping with pay- as-you-go principles, users can always specify data types if Sharing and integration: As a step towards supporting they so desire. The system also encourages the users to collaboration, Fusion Tables enables users to explicitly and specify any other descriptive information about their data easily share data with one another, even if the users are not that may be useful to others. in the same organization; and it enables users to merge data from multiple owners. 3.2 Data Sharing and Collaboration Fusion Tables follows the typical model of document shar- ing in the cloud. A user can decide to invite a set of collab- The data import step also addresses concerns that some orators to either view a table or provide them update access users may have about sharing data. The typical concerns as well. In addition, a user can decide to make a dataset we hear from data owners involve (1) loss of control over public, which enables anyone to view and comment on it. their data once they upload and share it with others, (2) The basic collaboration model is extended with the ability losing credit for creating the data, and (3) the possibility to invite contributors (see Figure 1), who can contribute that others will use or interpret the data incorrectly. their own columns to a table, with the different owners of Attribution and export: Fusion Tables provides several columns maintaining write conditions on their own columns. features to address these concerns. First, users can finely control who they share the data with. Second, users can For example, in the scenario of the ecologists in Costa specify an attribution for the data that is always carried Rica, the ecologists contribute the columns about the spec- around with the data, no matter what transformation gets imens, and the lab in Canada contributes the columns with applied to the data. Third, while the original owners of the the genetic information. This yields a shared, jointly owned data can always export it outside Fusion Tables (e.g., to save dataset that they can explore together. This scenario illus-

4.Figure 2: Merging data from multiple sources. Ta- bles belonging to different owners can be merged by a join on a column containing values from the same domain. Figure 3: Displaying an intensity map after import- ing data about coffee consumption. Fusion Tables trates the ability to merge data from multiple tables with detects a geo-location column in the data and offers different owners. Despite the progress on data integration map visualizations. tools, such sharing of data across multiple enterprises is still very hard in practice. In Fusion Tables, we decided to begin by supporting the simplest kind of data integration possible. view of the data. Specifically, if a user creates two views V1 When multiple parties have data about the same entities, and V2 from a table T , then the discussions on V1 will not the first step in initiating a collaboration is to be able to be visible in V2 and vice-versa nor will they be visible in T see the data side by side. Hence, Fusion Tables supports a itself. The reason for this design is that the views V1 and V2 Merge operation that essentially performs a join on a key col- may be visible to different sets of users, and therefore some umn that is shared by two (or more) sources. For example, discussions may need to be hidden. In addition, discussions Figure 2 shows how we can merge data about coffee produc- may serve different purposes, so even if they are visible by tion and coffee consumption on a common key (country). the same users, we may want to keep them apart. The tables may come from two different sources. In the case of the ecologists in Costa Rica and the lab in Canada, they 3.3 Data Manipulation and Visualization would collaborate by merging their tables on the Specimen Once the data is imported, Fusion Tables enables users to ID column. explore their data with a combination of data visualization Discussions: Sharing with others or integrating data from and SQL-like querying. multiple sources typically just represents an intermediate Fusion Tables makes it easy to visualize the data in dif- step in the lifecycle of the data, where the different collabo- ferent ways. In particular, if the system finds that a certain rators want to study the data. column has values that are mostly geographic locations, then Fusion Tables offers a discussion feature that supports it will offer the user several map viewing options, as exem- in-depth collaboration. Discussions may point out outliers, plified in Figures 3, 4 and 5. Likewise, the presence of date may detect incorrect data, or may question the underlying and time columns enable timelines and motion charts, while assumptions and semantics of the data. Specifically, Fusion numeric columns enable bar charts and pie charts. Tables lets users post and respond to comments at several The applicability of a particular visualization depends on levels of granularity: rows, columns, and individual cells. the data types of the columns in a dataset. Thus, consider- We find that enabling discussions at all these levels of gran- able attention has been given to recognizing when a column ularity is crucial for large datasets because it is otherwise in a dataset is a set of locations that can be plotted on a hard to keep track of the discussions and of their specific map, or when it is a time point that can be shown on a time contexts. line. Discussions are handled in an append-only fashion so that We chose to focus on the geo-location and time data types new comments are simply appended to the existing comment because they by far overwhelm any other data types in terms trails. An interesting aspect of the discussion mechanism is of the opportunities they present for useful visualizations, that if a change is made to the value of a cell then the change and because they occur frequently in practice. Once a user also becomes part of the discussion trail on the cell. This chooses a type of visualization, the system uses the column preserves the context of the comments and enables docu- data type information available to guess how to specifically mentation of the reasons for changing the value. When a visualize the data. user views a table, a discussion panel shows the user which When a visualization has been created, the user can ask parts of the table are being discussed actively. the system for an HTML snippet that can be pasted into We note that discussions are associated with a particular another web property such as a blog. The visualization then

5. An important aspect of being a platform for data man- agement and collaboration is to provide developers with a way to extend the functionality of the site. We accomplish this through an API. The API allows external developers to write applications that use Fusion Tables as a database. For example, the site mtbguru.com, has written an application that synchronizes their collection of bike routes with a table in Fusion Tables. The map visualization in Figure 4 was created over their dataset. The API supports querying of data through select state- ments, update of the data through insert, delete, and update statements, and data definition through a create table state- ment. All access to data through the API is authenticated through pre-existing methods for all Google properties. 5. RELATED WORK Figure 4: Visualization of all bike trails in the San Fusion Tables is inspired in part by ManyEyes (many- Francisco Bay Area that are shorter than 20 miles. eyes.com) that enables users to upload data and visualize it in several ways. We go further by providing data manage- ment capabilities and a sharing model that does not require always making your data public. We strive to preserve the ease of use provided by spreadsheets, but adapt it to larger datasets where the data and the presentation need not nec- essarily be one of the same. Several online database manage- ment tools exist (e.g., DabbleDB (dabbledb.com), Socrata (socrata.com), Factual (factual.com), but Fusion Tables fo- cuses on the collaboration aspects of data management and handles larger datasets. In comparison to other products, Fusion Tables emphasizes the deep integration into a maps infrastructure that is proving immensely popular. Wolfram Alpha is a search engine for structured data, but our focus here is on enabling users to manager their own data. Figure 5: Heatmap for the forest cover in Mexico. There are also several project related to structured data at Google. Google Public Data is an effort to import pub- lic government data and provide high-quality and carefully- appears as a live gadget on that property, meaning that the chosen visualizations of data in response search queries. For visualization acts as a view on the underlying data and thus example, a query on “california unemployment rate” will tracks the changes in the data. The user can also share a yield a thumbnail of a graph with the data that the user visualization with others by sending them a URL (called a can explore in more detail. The Google Squared Service snap) of the visualization. lets users specify categories of objects (e.g., US Presidents, The result is a very short path from data import to a useful espresso machines) and explore attributes of these entity visualization. The visualization feature with its simple task sets. In this case, the data populating the tables is auto- flow has been incredibly popular with our users. matically extracted from various sources on the Web, and A very popular component of Fusion Tables is the render- may not always be accurate. ing of large geographic datasets. We allow users to upload tables with street addresses, points, lines, or polygons. We render these tables as map layers. The rendering is done on the server side, i.e., we send the client a collection of small 6. CONCLUSIONS images (tiles) that contain the rendered map. Figure 4 shows The goal of Fusion Tables is to enable a much larger class an example of rendering the bike trails in the San Francisco of users to manage their data and to do so in a way that is bay area that are shorter than 20 miles. Figure 5 shows an integrated with their other online activities. Fusion Tables example of a heat map created by Fusion Tables, simply by is part of a bigger effort to encourage data owners to publish selecting the option on the map menu. data on the Web and to make it easier for users to discover We currently provide only the most common query facili- data that is relevant to their needs. ties. We support common selection predicates, grouping and There are many obvious extensions that need to be made aggregation, and projecting out a subset of columns. We de- to to Fusion Tables, starting from providing more expres- scribed briefly our join capability in Section 3.2. The query sive data modeling and query capabilities and providing ad- facilities will be expanded over time. Finally, we also have equate performance on larger datasets. However, our strat- an API for inserting, deleting, and updating rows in a table. egy is to engage our users and prioritize their most acute needs. In particular, the API and the advanced manage- ment of geographical data were inspired by frequent user 4. FUSION TABLES API requests.

6.7. REFERENCES [5] H. V. Jagadish, A. Chapman, A. Elkiss, [1] M. J. Cafarella, A. Halevy, Y. Zhang, D. Z. Wang, and M. Jayapandian, Y. Li, A. Nandi, and C. Yu. Making E. Wu. WebTables: Exploring the Power of Tables on Database Systems Usable. In SIGMOD, 2007. the Web. In VLDB, 2008. [6] MTBGuru tracks as seen through Google Fusion [2] Google Brings Water Data to Life. Tables. http://www.circleofblue.org/waternews/2009/world/google- http://blog.mtbguru.com/2010/02/24/mtbguru-tracks- brings-water-data-to-life/. as-seen-through-google-fusion-tables/. [3] M. Franklin, A. Halevy, and D. Maier. From Databases [7] B. Shneiderman. Extreme Visualization: Squeezing a to Dataspaces: A New Abstraction for Information Billion Records into a Million Pixels. In SIGMOD, Management. SIGMOD Record, 34(4), 2005. 2008. [4] H. Gonzalez, A. Halevy, C. Jensen, A. Langen, [8] F. B. Vi´egas and M. Wattenberg. Transforming data J. Madhavan, R. Shapley, and W. Shen. Google Fusion access through public visualization. In SIGMOD, 2009. Tables: Data Management, Integration and Collaboration in the Cloud. In SOCC, 2010.