The Data Civilizer System

在许多组织中,由于数据通常分散在整个企业,而且往往不一致,因此用户通常很难找到特定任务的发布数据。事实上,数据科学家通常报告说,他们的大部分工作都是在寻找、清理、输入和获取与手头任务相关的数据。为了减少“野外”数据分析所需的“粗劣工作”,我们提出了数据文明人,一个端到端的海量数据管理系统。数据文明器有一个链接图形计算模块来为数据构建一个链接图,以及一个利用链接图来帮助识别与用户任务相关的数据的数据计算模块。它还使用链接图形来发现可能的连接路径,然后在查询中使用。对于实际的查询执行,我们使用一个多存储dbms,它在不同的系统中联合查询处理。此外,数据文明器将数据清理操作集成到查询处理中。因为不同的用户需要调用上述的任务无关顺序,数据教化者嵌入了一个工作流引擎,它可以任意组合不同的模块,以及处理数据更新。我们已经在两个机构,麻省理工学院和默克,部署了我们的初步数据文明系统,并描述了初步的积极经验,表明该系统缩短了寻找、准备和分析数据所需的时间和努力。
展开查看详情

1. The Data Civilizer System Dong Deng∞ Raul Castro Fernandez∞ Ziawasch Abedjan‡ ∗ Sibo Wang† Michael Stonebraker∞ Ahmed Elmagarmid Ihab F. Ilyas Samuel Madden∞ Mourad Ouzzani Nan Tang MIT CSAIL ‡ TU Berlin † Nanyang Technological University ∞ Qatar Computing Research Institute, HBKU University of Waterloo {dongdeng, raulcf, stonebraker, madden}@csail.mit.edu abedjan@tu-berlin.de swang@ntu.edu.sg ilyas@uwaterloo.ca {aelmagarmid, mouzzani, ntang}@hbku.org.qa ABSTRACT On-Demand Data Cleaning In many organizations, it is often challenging for users to find rele- Join Path vant data for specific tasks, since the data is usually scattered across Online Selection the enterprise and often inconsistent. In fact, data scientists routinely Data Discovery report that the majority of their effort is spent finding, cleaning, in- Workflow Orchestrator tegrating, and accessing data of interest to a task at hand. In order Data Profiler to decrease the “grunt work” needed to facilitate the analysis of Silo 2 data “in the wild”, we present DATA C IVILIZER, an end-to-end big Offline Graph Builder Linkage Graph data management system. DATA C IVILIZER has a linkage graph computation module to build a linkage graph for the data and a data Cleanliness Silo 1 Estimator discovery module which utilizes the linkage graph to help identify data that is relevant to user tasks. It also uses the linkage graph Polystore (BigDawg) to discover possible join paths that can then be used in a query. Figure 1: Data Civilizer Architecture For the actual query execution, we use a polystore DBMS, which federates query processing across disparate systems. In addition, purpose is to decrease the “grunt work factor” by helping data DATA C IVILIZER integrates data cleaning operations into query pro- scientists to (i) quickly discover data sets of interest from large cessing. Because different users need to invoke the above tasks in numbers of tables; (ii) link relevant data sets; (iii) compute answers different orders, DATA C IVILIZER embeds a workflow engine which from the disparate data stores that host the discovered data sets; enables the arbitrary composition of different modules, as well as (iv) clean the desired data; and (v) iterate through these tasks using the handling of data updates. We have deployed our preliminary a workflow system, as data scientists often perform these tasks in DATA C IVILIZER system in two institutions, MIT and Merck and different orders. describe initial positive experiences that show the system shortens DATA C IVILIZER consists of two major components, as shown the time and effort required to find, prepare, and analyze data. in Figure 1. The offline component indexes and profiles data sets; these profiles are stored in a linkage graph, which is used to process 1. INTRODUCTION workflow queries online. Data sets in DATA C IVILIZER consist An oft-cited statistic is that data scientists spend 80% of their of structured data, which may be stored in relational databases, time finding, preparing, integrating, and cleaning data sets. The spreadsheets, or other structured formats. In the remainder of the remaining 20% is spent doing the desired analytic tasks. In practice, paper, we use the term tables or data sets to refer to these. The online 80% may be a lower bound; for example one data officer, Mark component involves executing a user-supplied workflow that consists Schreiber of Merck, a large pharmaceutical company, estimates that of a mix of discovery, join path selection, and cleaning operations data scientists in Merck spend 98% of their time on “grunt work” on data sets, all supported via interactions with the linkage graph. and only one hour per week on “useful work”. We elaborate on each module in the remainder of this section, using In this paper, we present DATA C IVILIZER, a system we are Merck as an example. building at MIT, QCRI, Waterloo, and TU Berlin, whose main [Linkage Graph Computation.] The linkages amongst attributes ∗Work done while doing an internship at QCRI. and tables are needed so that the user can discover interesting data, assemble relational schemas, and run SQL queries on the discovered data sets. Initially, all the linkages that can be found in linear time are built during the offline stage. As existing data changes and new tables are added, the graph is incrementally updated to reflect This article is published under a Creative Commons Attribution License these changes. All the other linkages, such as primary key-foreign (http://creativecommons.org/licenses/by/3.0/), which permits distribution key (PK-FK) relationships, will be built in the background or on- and reproduction in any medium as well allowing derivative works, pro- demand during the online stage. Our current prototype identifies vided that you attribute the original work to the author(s) and CIDR 2017. PK-FK relationships, using inclusion dependencies. As such, the 8th Biennial Conference on Innovative Data Systems Research (CIDR ‘17) January 8-11, 2017 , Chaminade, California, USA. process of building the linkage graph can be run dynamically and

2.on-demand; we discuss its details in Section 2. accuracy and point solutions neither offer enough functionalities nor [Discovery.] A data scientist at Merck has a hypothesis, for exam- achieve acceptable performance. Therefore an ensemble approach ple, the drug Ritalin causes brain cancer in rats weighing more than is needed. 300 grams. His first job is to identify relevant data sets, both inside and outside of Merck, that might contribute to testing this hypoth- 2. LINKAGE GRAPH COMPUTATION esis. Inside the company alone, Merck has approximately 4,000 In this section, we discuss how to build the linkage graph of Oracle databases and countless other repositories. The discovery DATA C IVILIZER. We first show how to create the data profiles in component in DATA C IVILIZER will assist the scientist in finding linear time — which can then be fed into the linkage graph — and tables of interest from all the Merck tables. Discovery queries are then formally define our graph model in Section 2.2. Finally, as run as a part of the online workflow which is discussed in Section 3. an example of linkages, we focus on how we discover and refine [Polystore Query Processing and Curation.] Since organizations PK-FK relationships in Sections 2.3 and 2.4, respectively. such as Merck have a variety of massive-scale data storage systems, it is not feasible to move all data to a central data warehouse. Also, 2.1 Data Profiling at Scale it is neither economically nor technically practical to perform data The key idea is to summarize each column of each table into a processing on all of the thousands of databases in advance. DATA profile. A profile consists of one or more signatures; each signature C IVILIZER is built using a polystore architecture [12] that federates summarizes the original contents of the column into a domain- query processing across disparate systems inside an enterprise. Our dependent, compact representation of the original data. By default, plan is to leverage the BigDAWG polystore system [13] to pull data signatures for numerical values consist of a histogram representing from multiple underlying storage engines to compute the final result the data distribution and for textual values, they consist of a vector (or the view) that satisfies the user’s specifications. of the most significant words in the column, indexed using a hash Obviously, data cleaning, data transformation and entity consoli- table. Our profiles also contain information about data cardinality, dation must be integrated with querying the polystore and construct- data type, and numerical ranges if applicable [5]. In addition to the ing the desired user view. This is an expensive process with the profiles, we maintain a global keyword index to facilitate keyword human effort required to validate cleaning decisions being the most queries. The index maps keywords to columns containing the word important cost. Since there may be multiple views that “solve” a in their names or content. data scientist’s query, each with different accuracy and human vali- To run at scale, data profiling relies on sampling to estimate the dation cost, DATA C IVILIZER must estimate the cost of curating the cardinality of columns and the quantiles for numerical data. This possible views, given a scientist’s time budget. The above aspects avoids having to sort each column, which is not a linear computation. will be discussed in Section 4. This also allows us to avoid making a copy of the data, reducing memory pressure. [Updates.] If a source data set is updated, these updates must The profiler consists of a pipeline of stages. The first stage per- be incrementally propagated through the data curation pipeline to forms basic de-noising of the data, such as removing empty values update downstream materialized views. In some cases, the human and dealing with potential formatting issues of the underlying files, effort involved may be daunting and the materialized view should e.g., figuring out the separator for a CSV file. The second stage be discarded rather than updated. In addition, if a scientist updates determines the data type of each column, information that is propa- a view, we need to propagate changes to other derived views, as gated along with the data to the next stages. The rest of the stages well as back upstream to data sources, if this is possible. Section 5 are responsible for computing cardinalities, and performing quantile discusses these issues. estimation, e.g., to determine numerical ranges. [Workflow.] DATA C IVILIZER offers a workflow engine whereby The profiler has connectors to read data from different data data scientists can iterate over its components in whatever order they sources, such as HDFS files, RDBMS, and CSV files. Data sets wish. Moreover, they need to be able to undo previous workflow are streamed through the profiler pipeline; the pipeline outputs the steps and perform alternate branching from the result of a workflow results to a profile store where it is later consumed by the graph step. Section 6 discusses our workflow management ideas. builder, which will be explained in Section 2.2. The profiler works in a distributed environment using a lightweight coordinator that We describe the state of the current implementation of DATA splits the work among multiple processors, helping to scale the C IVILIZER in Section 7. We also report on initial user experience computation. for two use cases: the MIT data warehouse and Merck. We conclude with final remarks in Section 8. 2.2 Graph Builder The main contribution of DATA C IVILIZER is that it is an “end- The linkage graph is a graph with both simple nodes, which rep- to-end” system. In contrast, there has been much work on “point resent columns, and hyper-nodes which are multiple simple nodes solutions” that solve small pieces of the overall problem. For ex- that represent tables or compound keys. Each edge represents a rela- ample, Data Wrangler [16] and DataXFormer [6] automate some tionship between two nodes (either simple nodes or hyper-nodes) aspects of cleaning and transforming data, Data Tamer [21] mainly in the graph. Note that in this way the model can express rela- performs schema mappings and record linkage, NADEEF [10] and tionships involving individual columns as well as groups of them. BigDansing [17] provide generic programming interface for users This is necessary to capture relationships among tables, e.g., PK-FK to define their data quality rules, and DeepDive [20] extracts facts relationship between compound keys. and structured information from large corpora of text, images and Examples of relationships are column similarity, schema similar- other unstructured sources. However, no solution performs dis- ity, structure similarity, inclusion dependency, PK-FK relationship, covery, linkage graph computation, and polystore operations in and table subsumption. The node label contains table metadata, concert. In addition, we have recently studied several representa- such as table name, cardinality and other information computed tive data cleaning systems on a collection of real world data sets by the profiler. The edge label includes metadata about the rela- “from the wild” [4] and we found out that there was no cleaning tionship it represents, e.g., type and score, if any. Computing the Esperanto. Hence multiple tools are required to achieve reasonable different relationships requires different time complexities. We thus

3.categorize them into light relationships, which can be computed When dealing with compound columns, we can utilize different in sub-quadratic time, and heavy relationships, which need at least text similarity functions on different fields. Also, we must compose quadratic time. The light relationships contain column similarity the individual column scores to achieve an overall strength. and schema similarity. The heavy relationships include primary key-foreign key (PK-FK), inclusion dependency, and structure simi- 2.4 Refine Candidate PK-FK Relationships larity. The error-robust inclusion dependency algorithm returns a col- The graph builder first computes the light relationships amongst lection of candidate PK-FK relationships. We use the algorithms pairs of nodes in advance during offline processing. In particular, from [19] to refine our candidate selection. The authors proposed the graph builder first finds column similarity and schema similarity 10 different features to distinguish foreign key constraints from relationships. Each edge is assigned a weight that represents the spurious inclusion dependencies. Consider two columns R[X] and strength of the relationship: the particular meaning depends on S[Y ] with an inclusion dependency from the first one to the second the edge semantics. Also, edges with a similarity less than a user- one. Examples of features include coverage, the ratio of distinct defined threshold are discarded to avoid having a graph with too values in R[X] that are contained in S[Y ], column name similarity, many edges that are not significant. A straightforward computation the similarity between the attribute names, and out-of-domain range, of such relationships requires an O(n2 ) computation, which will not the percentage of values in S[Y ] not within [min(R[X]), max(R[X])] scale. However, for light-relationships, which depend on similarity where min(R[X]) and max(R[X]) are respectively the minimum and metrics such as edit distance and Jaccard similarlity, we can employ maximum values in R[X]. locality sensitive hashing (LSH) [11] to yield sub-quadratic runtime; Using the defined features, we implemented the four machine we have found LSH to run quite well in practice, even for data sets learning classification algorithms from Rostin et al. [19]. These up to several Terabytes. allow us to distinguish spurious inclusion dependencies from real It is usually expensive to compute the heavy relationships. To ones with high accuracy. We add the highest confidence candidate address this issue, the graph builder adopts the following general PK-FK pairs as edges in the linkage graph. approaches: (i) the computation of heavy relationships is performed In Section 3, we will discuss how to utilize the linkage graph to in the background, (ii) data profiles and light relationships are used help the user discover his interesting data. The PK-FK relationships as much as possible to prune the search space for the heavy relation- can be used to link the tables of interest from the discovery module. ships, and (iii) after the user chooses the data of interest, if the heavy However, there may exist multiple subgraphs in the linkage graph relationships are not ready amongst this data, the graph builder can that can connect all the interesting tables, which we call join paths. construct the heavy relationships online. In Section 4, we discuss the choice of which join paths to use to We use the PK-FK relationships as an example of how we effi- materialize a view for the user. ciently compute a heavy relationship and deal with the complexities that arise in real data where errors and noise make finding such 3. DISCOVERY relationships tricky. The graph builder utilizes an implementation The data discovery module’s goal is to find relevant data from per- of inclusion dependency (i.e., column A includes all the values haps hundreds to thousands of data sets spread across the different of column B) to find candidate PK-FK relationships, since a PK storage systems of an enterprise, using the linkage graph. column should include all values in an FK column. We then refine Current solutions to finding relevant data include asking an expert the candidates using machine learning methods. However, as it is (if one is available) or performing manual exploration by inspect- well known, data in the wild is quite dirty, which makes discovering ing data sets one by one. Obviously, both approaches are time- inclusion dependencies difficult. Specifically, foreign keys may consuming and prone to missing relevant data sources. Recently, not match a primary key, because of errors in either the PK or the there have been efforts to automate this process, some examples are FK. To tolerate errors, we extend traditional inclusion dependency Goods [14] and InfoGather [23]. Goods permits users to inspect discovery by both key coverage and text similarity and propose an datasets and find others similar to datasets of interest. Infogather error-robust inclusion dependency approach. We describe this ap- permits to extend the attributes of a dataset with attributes from proach in Section 2.3 and the machine learning techniques to refine other datasets. These systems are designed to solve these particular the candidate PK-FK relationships in Section 2.4. use cases, and so their indexes are built for this specific purpose. In contrast, our discovery module aims to support general discovery 2.3 Error-Robust Inclusion Dependency queries that can then be used for unanticipated discovery needs. As noted above, PK-FK relationships are usually identified using Hence, the discovery module uses a linkage graph (Section 2.2) to inclusion dependency techniques. To overcome the presence of permit a broader range of queries. dirty data, we propose an error-robust inclusion dependency scheme. The data discovery module narrows down the search for relevant Consider two columns (or compound columns) from two tables R data from the entire space of available data sources to a handful, and S denoted by R[X] and S[Y ]. If there is a foreign key constraint on which finer-grained relationships can be computed efficiently. on R[X] with reference S[Y ], all the values in R[X] must appear DATA C IVILIZER discovery supports a user-level API to search for in S[Y ], which yields an inclusion dependency from R[X] to S[Y ]. relevant data using a variety of techniques such as searching for However, in the real world, values in foreign key fields may not tables with specific substrings in the schemas (schema search) or exist in the primary key fields due to errors. In this case an inclusion table values (content search). dependency from the foreign key to the primary key does not hold. To address this, for each distinct value in R[X], we calculate 3.1 Discovery Queries the text similarity to values in S[Y ] and use the similarity as the Users can submit discovery queries to find data sets whose schemas strength of a value matching. We then compute the total strength are relevant to the task at hand. For example, to find all data sets of the maximum matching between the two columns divided by that contain names of employees in a company, the user issues a the number of values in R[X]. This is the overall strength of the discovery query that searches for all attributes referring to an em- inclusion dependency. If this number exceeds a set threshold δ , we ployee or a name. The user can then write a SQL query to filter add an inclusion dependency from R[X] to S[Y ] in our linkage graph. from a list. Here are some examples of functions that the discovery

4.module provides to users; each of these queries can be answered In DATA C IVILIZER, a user has to decide how to trade off data using data collected as a part of profiling as well as the information quality and cleaning cost. DATA C IVILIZER defines two parameters, in the linkage graph: both under the user’s control. ◦ Fill-in schema. Given a set of names, this function retrieves 1. Minimize cost for a specific cleanliness metric. In this case, tables or groups of tables that contain the desired attributes. the user requires the data to be a certain percentage, P, correct The core of the operation consists of finding tables with simi- and will spend the least it takes to get to that point. lar attribute names to provided ones, using the keyword index. 2. Maximize accuracy for a specific cost. In this case, the user ◦ Extend attribute. This is similar to the extend operator of is willing to spend M and wishes to make the data as clean as Octopus [7], or the ABA operation of Infogather [23]. Using possible. this function users can extend tables of interest with addi- Sometimes the user is the one actually cleaning the data. In this tional attributes. The core of the operation consists of finding case, she can use P and M to quantify the value of her time. In other matches to the current table using the linkage graph, and then cases, cleaning is performed by other domain experts, who generally retrieving attributes that do not appear in this table. need to be paid. In this case, P and M are statements about budget ◦ Subsumption relationships. Given some reference table, priorities. this function provides a list of tables or groups of tables that Sometimes the user may prefer the join path that yields the largest have some form of subsumption relationship (i.e., are con- view size, i.e., the number of rows in the view. For example, the tained in or contain) with respect to the reference table; this user may want a view with 1,000 rows and 90% cleanliness rather can be done directly from the linkage graph. than a view with 10 rows and 98% cleanliness. We can combine all these facets as the criteria for join path selection. Estimating result ◦ Similarity and lookup. Discovery can also be used to find set sizes has been extensively studied in the literature [15] and we schemas that contain data similar to some provided schema just utilize the cardinality estimates produced by BigDAWG. (attribute or table), as well as to find schemas that contain As a result, DATA C IVILIZER must make the following decisions. certain values, i.e., keyword search for textual data and range First, it needs to assess the cleanliness of the result of any given join search for numerical data. path. We discuss this issue in Section 4.2. Then, we need to choose Discovery works similarly to an information retrieval system: where to place, in the resulting query plan, data cleaning operations schema retrieval queries do not return a specific answer, but in- to be the most efficient. This is the topic of Section 4.3. stead return a ranked set of results. We are currently exploring 4.2 Cleanliness Model several ranking options, and a model that decouples discovery query To assess the cleanliness of join paths and eventually full queries, processing from ranking evaluation seems the most promising. we need to devise an accuracy metric for each cell, namely an estimate for the probability of the cell to be erroneous. To this end, 4. POLYSTORE QUERY PROCESSING DATA C IVILIZER first applies automatic error detection tools and DATA C IVILIZER uses the BigDAWG polystore [13] to federate then uses a composite metric to accumulate detected errors. access to multiple storage systems inside an organization. BigDAWG While there are several types of data errors that we could consider, consists of a middleware query optimizer and executor, and shims to in DATA C IVILIZER we limit to the following three types: various local storage systems. We assume that a user has run discov- (i) Outliers – these are data values that are distant from the distribu- ery and that the corresponding linkage graph has been computed. If tion of values in the same column. the user is interested in a composite table containing a particular set of columns that are not available in any single data source, we can (ii) Duplicates – these are groups of tuples that refer to the same use the PK-FK mining techniques described in the previous section real-world entity. The mismatching values between these duplicate to find a set of possible join paths that can be used to materialize tuples is considered to be an error. the table of interest to the user. Furthermore, for any particular (iii) Integrity Constraint (IC) Violations – these include functional join path, it is straightforward to construct the BigDAWG query that dependency (FD) violations and inclusion dependency (IND) viola- materializes the view specified by each join path. However, we still tions. There is a violation between two tuples in a table R of an FD need to choose which of several possible join paths is the best to ϕ : R(X → Y ) if their values on the LHS of ϕ are the same while compute the table of interest to the user. their values on the RHS are not. There is a violation between tuples 4.1 Join Path Selection in tables R and S of an IND: R[X] ⊆ S[Y ] if a value on the LHS does not exist in the RHS. To detect more errors, we can further conduct The conventional data federation wisdom is to choose the join schema mapping, merge semantically equivalent tables, and detect path that minimizes the query processing cost. However, this ignores the IC violations in the merged table. data cleaning issues, to which we now turn. To achieve high quality results, one has to clean the data in conjunction with querying the We have recently studied several representative data cleaning sys- table. For example, if one has a data value, New Yark, and wants tems on a collection of real world data sets “from the wild” [4] to transform it to one of its airport codes (JFK, LGA), then one and we found out that there was no cleaning Esperanto. Hence must correct the data to New York, prior to the airport code lookup. multiple tools are required to achieve reasonable accuracy and point Cleaning usually entails a human specification of the correct value solutions neither offer enough functionalities nor achieve acceptable or a review of the value produced by an automatic algorithm. Such performance. Therefore an ensemble approach is required. For this human costs can easily dominate the total query execution time and purpose, we model the errors detected from multiple tools (we use represent a significant financial cost in many data cleaning scenarios. one tool for each of the above error category) as a conflict hyper- As such, one goal of DATA C IVILIZER is to choose the join path graph [9] and compose an overall cleanliness score for each cell. that produces the highest quality answer, rather than the one that is Next we discuss how to estimate the cleanliness of a join path, based easiest to compute. on the cleanliness estimation of each cell participating in the join.

5. For each join path, we first construct a query corresponding to DATA C IVILIZER uses three corresponding strategies to cater to this join path and execute it. Then we extend the techniques in data the above updates. lineage [22] to find the cell level lineage λ . Given a cell in the (i) MV maintenance. DATA C IVILIZER will need to incrementally result, λ indicates all the cells in the source table that can affect the propagate updates through the data curation pipeline to update down- probability of the result cell to be clean. To obtain the lineage λ , stream MVs, as well as from intermediate results that are cleaned we propagate for each join predicate the cells in the attributes that back to data sources, if possible. To perform this propagation, we participate in the predicate to the other cells within the same tuple. plan to leverage the techniques in DBRx [8], a system developed by Then for each result cell, we multiply the cleanliness values of all QCRI and Waterloo. In some cases, the human effort to update and the source cells in its lineage to get its cleanliness. In this way, we clean MVs may be daunting, and automatic methods may fail. In can propagate the cleanliness of the source cells to the result cells. such cases, MVs should be discarded rather than updated. The cleanliness of a join path is measured by the average clean- liness of all the cells in the result. DATA C IVILIZER selects the (ii) Provenance management. To perform update propagation, join paths based on the criteria that linearly combines the sizes and DATA C IVILIZER will need to keep track of the relationships be- the cleanlinesses of their corresponding query results. Note that tween data sets (their provenance). DATA C IVILIZER will leverage DATA C IVILIZER also uses schema mapping [21] to cluster the join Decibel [18], a system developed at MIT for this purpose. paths based on their semantics and ranks the join paths within each (iii) Incremental graph updates. As data changes, the linkage categories for users to choose. graph and profiles will also need to be updated. To support in- cremental updates to profiles we use a simple reference counting 4.3 Query Planning with Data Cleaning scheme, where each keyword has a count associated with it that is In a query plan with cleaning operations, the cleaning cost and decremented when a word is removed. A word is removed from the the result quality are different based on the position of the data profile when its count reaches zero. To support updates to the graph, cleaning operation. For example, cleaning a whole column before rather than recomputing relationships every time a table changes, we the selection on this column yields a higher cleaning cost and query maintain an estimate of the fraction of rows in a column that have accuracy than putting the cleaning operation after the selection changed since we last updated its node and edges in the graph, and operation. Obviously expensive cleaning should be performed on re-index a column when this estimate goes above a set threshold. as few records as possible and have as most impact as possible. In DATA C IVILIZER, we assume that given a query and the cleanliness estimation value on each result cell we are give a set budget B 6. TRACTABLE CURATION WORKFLOW of user’s actions to check and eventually change a value if it is The process of data curation entails multiple iterations of discov- erroneous. Based on this input, we need to determine which cells ery, linking, querying, and curation. Cleaning and transformation need to be sent to the user for feedback in order to achieve the procedures must be guided by the user. DATA C IVILIZER will use highest quality gain. a fairly conventional workflow system that will allow a human to More specifically, we first execute the query and obtain its result. construct sequences of operations, undo ones that are unproductive, We then map the cells in the query result back to the cells in the and utilize branching to try multiple processing operations. source tables using the data lineage as discussed in the previous In addition, we plan to build a workflow orchestrator which will section. To achieve the maximum quality and be able to reuse user’s retain previous operation sequences and propose those that best fit cleaning efforts, we focus only on the source cells that are involved a new situation. We expect this tactic to be successful because the in the erroneous result cells. As each source cell can contribute to types of data curation and preparation steps in an enterprise often fol- multiple erroneous result cells, for each source cell, we accumulate low repetitive patterns. Specifically, our workflow orchestrator will its impact on the result and clean them in the decreasing order of store the user query, the sequence of operations in a central workflow their impacts until the budget is exhausted. This is not only effective registry together with the metadata and signatures provided by the in cleaning the current join path results, but also help enhance the data discovery component. quality of the whole system and subsequent queries. The orchestrator will evaluate a new user query and the initial re- sults of the data discovery component against the workflow registry. 5. UPDATES Similar queries and similar data profiles are likely to demand simi- lar cleaning procedures. Previous workflows will be ranked based Real-world data is rarely static and hence some way to deal new on the similarity of the user query and retrieved discovery results. datasets and updates is needed. We consider three types of updates. Accordingly, the related curation and preparation procedures will be (1) Insertions/deletions on source tables. This happens when shown in ranked order to the user. there is a change to a table, e.g., the insertion of a new procure- ment record in the MIT Data Warehouse. This may also happen when data sources are cleaned (see Section 4). 7. DATA CIVILIZER IN THE WILD We have built an initial prototype of DATA C IVILIZER containing (2) Replacement of source tables. Large companies typically rely the linkage graph builder and the data discovery module as discussed on both internal and external information to build their knowledge in Sections 2 and 3, respectively. Ultimately, we will integrate DATA bases. For instance, Merck collects published standard medical C IVILIZER with BigDawg, our implementation of a polystore, to names from the World Health Organization (WHO) to help construct realize the end-to-end system described in this paper. We have been their own ontology. These data sources are updated periodically by working closely with end users to adapt our prototype to relevant WHO. Sometimes, even the format may be changed, e.g., from a real world problems. In particular, we have deployed a preliminary JSON file to a CSV file. prototype of discovery in two different organizations. The MIT (3) Updating Materialized Views (MV). MVs might be created Data warehouse (MIT DWH) is a group at MIT responsible for based on other MVs. Since new data can arrive at any time, and building and maintaining a data warehouse that integrates data from cleaning can be done at any time to any record, MVs may need to multiple source systems. Merck, a big pharmaceutical company, be updated. manages large volumes of data, which are handled by different

6.storage systems. In the following, we first outline the current state ical databases is that each puts an emphasis on different information. of the system and the modules that are currently deployed. We then Analysts typically face situations in which they are interested in a provide details on each organizations’ requirements, and how we set of attributes that are spread across different tables on different are using DATA C IVILIZER to help them. databases. DATA C IVILIZER helps to detect such attributes and bring them together on-demand to serve the users’ purpose. 7.1 Current Data Civilizer Prototype Identify entities. One single chemical entity may be referred to with The current prototype runs as a server and can access data that different identifiers and formats in different databases. Chemical resides in a file system or one or more databases. Clients connect identifiers have been a subject of research in the bioinformatics to DATA C IVILIZER through a RESTful API. The current deploy- community: multiple different formats have been proposed with ment of DATA C IVILIZER includes both the graph builder with the different properties according to the scenario. DATA C IVILIZER error-robust inclusion dependency discovery from Section 2 and the helps on this task by discovering datasets that contain schemas with discovery component described in Section 3. Thus a user is able to the entities of interest. submit queries that consist of attribute names, attribute values, or We have been collaborating with 4 engineers and bioinformatics tables, and receive different types of results based on the selected experts at Merck during the last 2 months. During this period we similarity function. The similarity function depends on the specific have learned about their use cases and we have used discovery on use case, which we will discuss based on our two use cases MIT chemical databases that are publicly available. It is interesting that and Merck. We plan to demo our DATA C IVILIZER prototype at the DATA C IVILIZER was used to discover publicly available chemical conference. datasets first, and that internal datasets will be future work. 7.2 MIT Data Warehouse 8. CONCLUSIONS One of the key tasks of the MIT DWH team is to assist its cus- In this paper, we presented DATA C IVILIZER, an end-to-end big tomers – generally MIT administrators – to answer any of their data management system. DATA C IVILIZER aims to support the questions about MIT data. For example, staff usually want to create discovery of data that is relevant to specific user tasks, the link- reports, for which they need access to various kinds of data. The age of the relevant data for allowing complex polystore queries, warehouse contains around 1TB of data spread across approximately the cleaning of the data under limited budget, the handling of data 3K tables. updates, and the iterative processing of the data. As data in large In the current workflow, a DWH customer presents a question, companies are usually scattered across multiple data storage plat- which a DWH team members addresses by manually searching for forms, we proposed the use of a polystore architecture to deploy our tables containing relevant data. Once they have determined the system. A key characteristic of DATA C IVILIZER is that it places tables of interest, they create a view that is accessed by the customer the data cleaning operations in the query plan while trading-off the to solve the question at hand. Below are some of the common use query result quality with the cleaning costs. We have deployed our cases we have encountered: preliminary system at two different institutions, MIT and Merck, Fill in virtual schema. When a customer arrives with a question and obtained positive feedback from the users. such as: I need to create a report with the gender distribution of the faculty per department and year, the data warehouse personnel can use DATA C IVILIZER to find all the tables that contain schema 9. REFERENCES [1] Chembl. https://www.ebi.ac.uk/chembl/. names similar to the attributes exposed by the query, e.g., gender, faculty name, department, and year. [2] Drugbank. http://www.drugbank.ca/. [3] Pubchem. https://pubchem.ncbi.nlm.nih.gov/. Table redundancy. Multiple views are created for different cus- [4] Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, tomers. Many of them contain very similar data, as multiple cus- M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. tomers are interested in similar items. To reduce the redundancy Detecting data errors: Where are we and what needs to be of data, DATA C IVILIZER helps detect complementary as well as done? PVLDB, 9(12):993–1004, 2016. repeated sources. This sheds light on the status of the warehouse and helps to maintain it tidy and minimal. [5] Z. Abedjan, L. Golab, and F. Naumann. Profiling relational Our prototype is deployed in an Amazon EC2 instance managed data: a survey. 24(4):557–581, 2015. by the MIT DWH team with access to their data. We have been [6] Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, and working with them for 3 months, iterating over priorities and learn- M. Stonebraker. Dataxformer: A robust transformation ing about the problems they are facing. In the future, we aim to discovery system. In ICDE, pages 1134–1145, 2016. deploy the entire DATA C IVILIZER system to enable running queries [7] M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data directly over the hundreds of data sources, by using the polystore integration for the relational web. PVLDB, 2(1):1090–1101, query processing and curation module and linkage graph builder to 2009. create the necessary views on demand. [8] A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and prescriptive data cleaning. In SIGMOD, pages 7.3 Merck 445–456, 2014. Merck is a large pharmaceutical company that manages massive [9] X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: volumes of data spread across around 4K databases in addition to Putting violations into context. In 29th IEEE International several data lakes. One of the data assets of any pharmaceutical com- Conference on Data Engineering, ICDE 2013, Brisbane, pany are internal databases of chemical compounds and structures. Australia, April 8-12, 2013, pages 458–469, 2013. Usually, these are more valuable when integrated with external, well- [10] M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. known and curated databases, such as PubChem [3], ChEMBL [1], Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data or DrugBank [2]. We describe two common use cases that occur in cleaning system. In SIGMOD, 2013. this context: [11] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Enrich data. One of the reasons for the existence of multiple chem- Locality-sensitive hashing scheme based on p-stable

7. distributions. In Symposium on Computational Geometry, J.-A. Quiane-Ruiz, P. Papotti, N. Tang, and S. Yin. pages 253–262, 2004. BigDansing: a system for big data cleansing. In SIGMOD, [12] J. Duggan, A. J. Elmore, M. Stonebraker, M. Balazinska, 2015. B. Howe, J. Kepner, S. Madden, D. Maier, T. Mattson, and [18] M. Maddox, D. Goehring, A. J. Elmore, S. Madden, A. G. S. B. Zdonik. The bigdawg polystore system. SIGMOD Parameswaran, and A. Deshpande. Decibel: The relational Record, 44(2):11–16, 2015. dataset branching system. PVLDB, 9(9):624–635, 2016. [13] A. J. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, [19] A. Rostin, O. Albrecht, J. Bauckmann, F. Naumann, and U. Çetintemel, V. Gadepally, J. Heer, B. Howe, J. Kepner, U. Leser. A machine learning approach to foreign key T. Kraska, S. Madden, D. Maier, T. G. Mattson, discovery. In WebDB, 2009. S. Papadopoulos, J. Parkhurst, N. Tatbul, M. Vartak, and [20] J. Shin, S. Wu, F. Wang, C. D. Sa, C. Zhang, and C. Ré. S. Zdonik. A demonstration of the bigdawg polystore system. Incremental knowledge base construction using deepdive. PVLDB, 8(12):1908–1911, 2015. PVLDB, 8(11):1310–1321, 2015. [14] A. Y. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, [21] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, S. Roy, and S. E. Whang. Goods: Organizing google’s M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data datasets. In SIGMOD, pages 795–806, 2016. curation at scale: The data tamer system. In CIDR, 2013. [15] Y. E. Ioannidis and V. Poosala. Balancing histogram [22] J. Widom. Trio: A system for integrated management of data, optimality and practicality for query result size estimation. In accuracy, and lineage. In CIDR, pages 262–276, 2005. SIGMOD, pages 233–244, 1995. [23] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. [16] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Infogather: entity augmentation and attribute discovery by Interactive visual specification of data transformation scripts. holistic matching with web tables. In SIGMOD, pages 97–108, In ACM Human Factors in Computing Systems (CHI), 2011. 2012. [17] Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani,