Arnold: Declarative Crowd-Machine Data Integration

We have developed a declarative approach to data cleaning and integration that balances when and where to apply crowd-sourcing and machine computation using a new type of data independence that we term Labor Independence. Labor Independence divides the logical operations that should be performed on each record from the physical implementations of those operations. Using this layer of independence,the data cleaning process can choose the physical operator for each logical operation that yields the highest quality for the lowest cost. We introduce Arnold, a data cleaning and integration architecture that utilizes Labor Independence to efficiently clean and integrate large amounts of data.
展开查看详情

1. Arnold: Declarative Crowd-Machine Data Integration Shawn R. Jeffery Liwen Sun Matt DeLand Groupon, Inc. UC Berkeley Groupon, Inc. sjeffery@groupon.com liwen@cs.berkeley.edu mdeland@groupon.com Nick Pendar Rick Barber Andrew Galdi Groupon, Inc. Stanford University Stanford University npendar@groupon.com barber5@cs.stanford.edu agaldi@stanford.edu ABSTRACT cel at finding and correcting errors in data, manual input, The availability of rich data from sources such as the World or crowd-sourcing, can be used to provide the high quality Wide Web, social media, and sensor streams is giving rise to data required by these applications. a range of applications that rely on a clean, consistent, and integrated database built over these sources. Human input, ID Name Address Category or crowd-sourcing, is an effective tool to help produce such 1 SFO San Francisco, CA Airport high-quality data. It is infeasible, however, to involve hu- 2 Washington Park Burlingame, CA Park mans at every step of the data cleaning process for all data. 3 French Laundry Yountville, CA ——— We have developed a declarative approach to data clean- ing and integration that balances when and where to apply Table 1: An example geographic location database crowd-sourcing and machine computation using a new type of data independence that we term Labor Independence. La- As a motivating example, consider a data integration sys- bor Independence divides the logical operations that should tem designed to produce a database of geographic locations be performed on each record from the physical implementa- such as businesses, parks, and airports. Such a database is tions of those operations. Using this layer of independence, necessary to drive many location-based services. Input data the data cleaning process can choose the physical operator for this database may come from a variety of sources, includ- for each logical operation that yields the highest quality for ing licensed datasets [5], governmental sources [26], and user- the lowest cost. We introduce Arnold, a data cleaning and generated content. Each data source needs to be individu- integration architecture that utilizes Labor Independence to ally cleaned then integrated to create a single database. A efficiently clean and integrate large amounts of data. simplified version of such a database is shown in Table 1 that consists of a set of records, each of which describes a phys- 1. INTRODUCTION ical location containing a unique identifier, name, address, and category. The quality requirements for these databases Our world is becoming increasingly data-driven; vast amounts are high; for instance, an incorrect address could cause a of data provide unprecedented insights into our own lives [6], user of the database to go to the wrong place. Figure 1 our connections to others [4], and the world around us [3]. shows a simplified cleaning pipeline necessary to build such These data, however, suffer from many data quality issues: a database. First, the address for each record is cleaned they are incorrect, incomplete, and duplicated [24]. Thus, to correct for spelling errors and formatted in a standard in order to leverage these data in applications, they need to manner. Second, each record is classified to determine the be cleaned and integrated to create a correct, complete, and category of the location. Finally, each input record is de- comprehensive database. Furthermore, applications that use duplicated with existing records in the database. these data tend to require a high-level of data quality. Achieving such data quality, however, is a challenge as there is typically a long-tail of “corner cases”: quality is- Clean sues in the data that do not easily fit into any pattern, and Categorize De-duplicate Address as such are not easily handled by existing “good-enough” data cleaning solutions. For instance, state-of-the-art entity resolution algorithms have many challenges in determining duplicates in real-world scenarios [15]. Since humans ex- Figure 1: An example cleaning pipeline to produce a location database. Each node represents a clean- ing operation that the system applies to each input record. In this example, the crowd can be a powerful tool to help This article is published under a Creative Commons Attribution License produce a high-quality database. For instance, a classifica- (http://creativecommons.org/licenses/by/3.0/), which permits distribution tion algorithm may categorize French Laundry as a laun- and reproduction in any medium as well allowing derivative works, pro- dromat based on the name; however, a human would easily vided that you attribute the original work to the author(s) and CIDR 2013. 6th Biennial Conference on Innovative Data Systems Research (CIDR ’13) discover that French Laundry is in fact a restaurant. January 6-9, 2013, Asilomar, California, USA. Of course, human involvement does not easily scale; it is

2.expensive to involve humans in the data cleaning process. is a new type of data independence we term Labor Indepen- Given that it is not feasible to consult the crowd for every dence. Labor Independence provides a high-level declarative operation and for every record, one must decide when and interface that allows applications to define the quality and how to utilize crowd-sourcing. In a typical data integration cost requirements of the database, without needing to spec- system, the complexity of such a decision is compounded by ify how that database is produced. a multitude of factors. While declarative approaches to crowd-sourcing have been Heterogeneous data quality. Since data are integrated extensively studied in the literature (e.g., [8, 19, 23]), Labor from heterogeneous sources, they arrive with various quality Independence is the first approach that hides all details of levels. For the good records, it is often sufficient to process the underlying crowd-sourcing mechanisms from the applica- them with automated algorithms; they would only benefit tion. Existing approaches apply declarative principles at the to a small degree from crowd-sourcing since the quality is operator, task, or crowd level, but still force the application already high. The bad records are typically not easily han- to specify some portion of the crowd-based execution. For dled by automated mechanisms and need to be curated by example, in order to use CrowdDB’s crowd-sourcing func- a human. For example, an address cleaning algorithm will tionality, an application must speficy crowd-sourcing opera- fail if the address is very dirty and cannot be parsed. tions in the DDL and DML [8], requiring intimate knowledge Non-uniform quality requirements. There is typi- of the operations that the crowd is capable of perfoming and cally a wide variance in the importance to the application of their efficacy. When utilizing crowd-sourcing at large scale, the records and attributes in the database. For example, the however, such knowledge is largely inaccessible to an appli- address attribute is critical to power location-based applica- cation developer. In contrast, Labor Independence provides tions, while the category attribute is merely a helpful guide. to applications an interface in which they only need to spec- Similarly, SFO (San Francisco Airport) is a popular loca- ify the data they want, the quality they desire, and how tion in the San Francisco Bay Area, while Washington Park much they are willing to pay, without any knowledge of the is only known to a small community. For these more impor- crowd-sourcing mechanisms that produced those data. In tant records or attributes, applications desire high-quality fact, such a system could not utilize crowd-sourcing at all data and thus more resources should be invested in cleaning as long as it met the application’s desires, all without inter- them. vention from the application developer. As a result, systems Multiple cleaning operations. A typical data integra- can seamlessly integrate crowd-sourcing without having to tion system consists of tens to hundreds of cleaning opera- understand the details of the crowd(s). tions. For each task, the system must decide how and when Beneath that interface, Labor Independence abstracts data to use human input. cleaning tasks such that they can be performed by machines Different expertise. Human and machines excel at dif- or the crowd. It does this by dividing logical operations ferent tasks. Machine algorithms are good at executing rou- that should be performed on each record from the physi- tine tasks efficiently and scalably. Humans, on the other cal implementations of that operation. Using this layer of hand, are able to utilize external data or infer patterns independence, the data integration system can choose the through human judgement that may not be evident in the physical operator for each logical operation that yields the data. highest quality for the lowest cost. Different costs. Machines, in general, are inexpensive, In order to realize Labor Independence, we define a unified while human involvement is expensive. Furthermore, the re- mechanism for expressing and tracking quality and cost in a sponse time of machine algorithms is generally much faster data integration system. Our quality model is based on the than that of the crowd. However, the converse may be true. intuition that the closer the database matches the reality it For some complex machine learning tasks, it may take large represents, the better quality it is. Using this intuition, we amounts of computing resources and time to train and ex- derive a multi-granularity quality model that quantifies the ecute a model while a human could perform the task easily closeness to reality at the attribute, record, and database and cheaply. levels. We also define a cost function that distills all costs, Multiple crowds. While generally crowd-sourcing has including time and resources, into a single monetary cost. referred to a single amorphous crowd, in reality there are We present Arnold, a system that utilizes Labor Inde- typically many different crowds. Basic crowds range from pendence to clean and integrate data in an efficient man- the general-purpose crowds (e.g., Mechanical Turk [20]) to ner. Arnold processes streams of input records from multiple more full-featured crowd platforms [2]. For specialized tasks, sources through set of logical operations, each implemented crowds can be highly-trained workers, such as an internal by one or more physical operators. In order to drive its editorial staff, or groups designed for certain purposes (e.g., optimization, Arnold continuously estimates the quality of physical data collection [10]). Each of these crowds has a the data it is processing as well as the quality and cost of different level and type of expertise. Workers with domain its physical operators. Using these estimates, Arnold opti- knowledge of data (e.g., an editorial staff) are more expen- mizes the selection of physical operators for each record and sive to use but can provide better results. The system must operation to maximize quality or minimize cost. decide between different crowds and expertise. This paper is organized as follows. In Section 2 we present In general, applications desire a high-quality, integrated Labor Independence. We provide an overview of Labor Inde- database with minimal cost; they do not want to under- pendence’s quality and cost models in Section 3. In Section 4 stand the complex trade-offs presented by each of the fac- we outline the Arnold Architecture that instantiates Labor tors above. To this end, we have developed a declarative ap- Independence. Section 5 describes the optimization algo- proach to data cleaning and integration that balances when rithms that enable Arnold to efficiently blend machine and and where to apply crowd-sourcing and machine computa- human labor. Finally, we highlight related work in Section 6 tion. The primary mechanism we use to drive our system and conclude in Section 7.

3.2. LABOR INDEPENDENCE ried out using a regular expression or by utilizing an external In this section we describe Labor Independence, the fun- service [29], whereas de-duplication could be done either al- damental mechanism we use to seamlessly blend machine gorithmically or by two different crowd-based mechanisms. and human labor. We now describe the fundamental components of Labor In this work, we consider large-scale, pay-as-you-go data Independence in more detail. cleaning and integration systems [7, 17]. In such systems, streams of input records are cleaned and integrated into a 2.2 Declarative Interface single database using multiple cleaning operations. These An application interacts with a system implementing La- cleaning operations are expressed as a pipeline of potentially bor Independence through a declarative interface using two hundreds of fine-grained operations that successively process parameters, quality and cost. Quality is quantified as a nu- input records to produce an integrated database. Such a merical score in [0, 1]. Cost is a single number that unifies database is a constantly evolving collection of records, each all types of costs, including money, time, and resources, into of which is represented by a set of attribute-value pairs. The a monetary cost. In Section 3, we will discuss the quality number of sources in these scenarios typically number in and cost models in detail. Using the quality and cost pa- the hundreds or thousands and have high variance in cor- rameters, an application can specify the level of quality it rectness, completeness, and sparsity. In total, such a sys- needs and the amount of money it is willing to pay. In some tem may process billions of records. A prerequisite for any instantiations of Labor Independence, an application may data-driven application built on top of these systems is a only specify one of the two parameters, in which case the high-quality, integrated database. system will minimize or maximize the other. For example, a user may request that the system maximize the quality of 2.1 Overview the data utilizing less than $1.00 per record. Here we present a new type of data independence, Labor 2.3 Logical Layer Independence, that provides applications with a high-level declarative interface to specify the quality requirements of The primary abstraction at the logical layer is a logical the data and the cost the application is willing to pay. Be- operation, which defines a semantic operation that is to be neath that interface, Labor Independence separates logical performed on an input record. Each logical operation is cleaning operations from the physical mechanisms that per- specified via the input, output, and the set attributes of a form those tasks. A system implementing Labor Indepen- record can be mutated by this operation. For example, the dence has a set of one or more physical operators for each CleanAddress operation takes as input a record, outputs a logical operation, each of which has different cost and qual- record, and mutates the address attribute but not others. ity. Utilizing this layer of independence, the system is then Importantly, a logical operation does not specify how that free to choose, for each logical operation and for each record, operation is to be implemented. the appropriate physical operator. Logical operations are combined into a logical operation graph that defines the order in which logical operations should be applied to input records. This ordering may be a fixed ordering of logical operations, or a dependency graph where Clean Address Categorize De-duplicate each logical operation depends on zero or more other logical operations. Logical Layer 2.4 Physical Layer Physical Layer At the physical layer, each logical operation has one or Algorithmic more corresponding physical operators: a concrete imple- Regex Classifier Matching mentation that specifies how to carry out the logical oper- ation. A physical operator may utilize algorithmic mecha- USPS API Mechanical Turk Mechanical Turk nisms or crowd-sourcing techniques, but it must conform to the specifications of the logical operation that it implements. Editorial Staff Each physical operator has an associated quality and cost: the expected quality of data produced by the operator and the monetary cost of executing it. Figure 2: The logical and physical layers of Labor Independence for a location data cleaning pipeline. 3. QUALITY AND COST In order to make informed decisions about when and where We illustrate Labor Independence in Figure 2 as applied to apply manual input, Labor Independence depends an un- to the location database example. In this example, we rep- derstanding of the quality and cost of the records it pro- resent each of the cleaning tasks (Clean Address, Categorize, cesses. In this section we detail our quality and cost models. and De-duplicate) as a logical operation; that is, it specifies the action that should occur for each record, not how to per- 3.1 Quality Model form that action. For each of these operations, there is one The intuition underlying our quality model is that appli- or more physical operators. For example, the Categorize log- cations need data that closely represents reality; thus, we ical operation could be carried out by either an algorithmic quantify data quality by the degree to which the database classifier or by asking the crowd. Note that there may be matches the reality it represents. We describe the salient multiple algorithmic or crowd-based operators for a single features of our quality model here and defer a detailed dis- logical operation. For example, Clean Address could be car- cussion to future work.

4. The primary building block of our quality model is a dis- quality model to capture both missing and duplicate records sonance function, that measures the difference between an at the aggregate-level. Conceptually, a missing record has attribute value and the reality it represents. a quality of 0. Of course, it is unknown what records are missing from a database. Therefore, we can use standard Definition 1 (Dissonance Function). Let r be a record population estimation techniques or even crowd-based esti- in our database and r.a be an attribute a of record r. A disso- mation [28] to estimate the number of missing records. nance function for attribute a, denoted as za , takes as input Similarly, for duplicates we assign a uniqueness score to a the record r and the true value of r.a, T (r.a), and returns set of records corresponding to the percentage of duplicates a numerical score za ∈ [0, 1] that quantifies the distance be- in that set. The set may be the entire database or subsets of tween r.a and T (r.a). the records (e.g., the records in each postal code). Unique- ness scores can also be estimated using sampling techniques The smaller the value for za , the closer the value r.a is to (e.g., [12]). reality. If r.a has a missing value, then za = 1. For each attribute a in an application, the system must define a dis- sonance function. The implementation of these dissonance 3.1.2 Weighted Quality functions are application-specific. For example, we can de- In most data-driven applications, there is a wide variance fine dissonance functions for the example location database: in the importance of different attributes and records: some zname is the normalized string edit distance between the data is queried more frequently, more valuable to a business, name string and the true name string, and zaddress is the or drives critical application features. This observation can normalized euclidean distance between the address and the help realize further efficiency in our data cleaning efforts by true address. focusing on the important data. To quantify importance, we We assert that defining a dissonance function for every define importance weights at both the attribute and record attribute is a tractable problem because the number of at- level. We define for each attribute and record in an applica- tributes is at human scale for any given application (e.g., tion an importance weight in [0, 1] that denotes the relative in the tens to hundreds). Furthermore, since a dissonance importance of the attribute or record to the application. function is at the granularity of a single attribute value, it Since there are typically a large number of records, record is easy for a human to reason about and derive a dissonance weights can be set at an aggregate level. For instance, in the function. location database example record weights can be set based In order to compute a dissonance function, we need to on postal code. Importance weights are normalized across know the true value for the attribute T (r.a). Thus, we rely all attributes or records. We use these weights to compute a on an oracle that is capable of providing the value of T (r.a). weighted average of quality at the record and database level. An oracle may be a trained domain expert or a sophisticated crowd-sourcing algorithm designed to attain high-fidelity an- 3.1.3 Quality of a Physical Operator swers (e.g., [25]). Given r.a and the oracle-provided T (r.a), In order to choose physical operators based on quality, a computing a dissonance function is trivial (e.g., compute the system implementing Labor Independence needs to under- edit distance between r.a and T (r.a)). stand the quality of each physical operator. We define the Using the dissonance value za , we define the quality of an quality of a physical operator as the average quality score of attribute value a as: its output records. By definition, this quality is a numerical score in [0, 1]. We discuss in the next section how we track qa = 1 − za quality for physical operators. Furthermore, using the quality definition for a single at- tribute value, we can derive aggregate quality scores for 3.2 Cost Model coarser-granularity objects such as an entire record and the Similar to quality, Labor Independence needs a unified entire database by simply using an average of the single qual- mechanism for tracking the cost of each record as well as ity scores. For instance, the quality of a record is the average each physical operator. Costs can come in many forms: time of its individual attribute qualities, and the quality of the for an operation to complete, monetary cost for hardware to entire database is the average quality of all records. run algorithms, and payment to crowd workers. We reduce Of course, querying the oracle to discover the true value of all of these costs to a single monetary cost. an attribute is expensive. For example, if we want to assess the average quality of an attribute (e.g., address) in the en- 3.2.1 Cost of a Physical Operator tire database, we cannot afford to send all the records to the Each physical operator incurs a cost when it is executed. oracle. Instead, we use sampling estimation. First, we draw The primary challenge in tracking the cost of physical opera- some uniform samples from the database. For each record tors is that the cost may not be known a priori, especially for r in the sample, we send it to the oracle that provides the crowd-based operators. For example, the completion time of true value of its address T (r.address). We then compute a finding missing data can vary widely. As described in the quality score for r.address using the appropriate dissonance next section, we monitor the cost of each operator and use function. Given the quality scores of all the samples, we use average cost to predict future costs. their average to infer the average quality of the entire pop- ulation. The inaccuracy of this estimation is be bounded by standard error [16]. 4. THE ARNOLD ARCHITECTURE In this section, we describe the Arnold Architecture that 3.1.1 Missing and Duplicate Records instantiates Labor Independence to provide holistic crowd- In any data integration system, both missing and dupli- machine data integration. Figure 3 shows the components cate records are a common challenge. We can extend our of Arnold.

5. Quality and Cost Constraints, Logical Operation Graph Algorithm 1 Execution Engine’s algorithm for processing Attribute and Record Weights records Precondition: input records enqueued into processQueue Declarative Interface Logical Operation Manager 1: while “monkey” = “banana” do Quality and Cost 2: record ← processQueue.dequeue() Budget Manager 3: op ← operatorOptimizer.bestOperator(record) 4: record ← op.process(record) Execution Engine 5: qca.updateQualityAndCost(record) Operator Optimizer 6: if budgetM anager.meetsConstraints(record) then Input Data Output Data 7: outputQueue.enqueue(record) 8: else Physical Operator Library 9: processQueue.enqueue(record) Quality and Cost Assessor (QCA) 10: end if 11: end while Crowd-sourcing Platform used by the Operator Optimizer in Section 5. Operator Attribute Quality Regex Cleanup Address 0.73 (± 0.10) Figure 3: The Arnold Architecture USPS API Address 0.87 (± 0.05) Category Classifier Category 0.81 (± 0.07) M. Turk Classifier Category 0.92 (± 0.02) Declarative Interface. A user interacts with Arnold ... ... ... through a declarative interface. As per Labor Independence, the user specifies cost and quality constraints for each record Table 2: Example quality estimates tracked by the processed by Arnold. Additionally, a user can define impor- QCA tance weights for attributes or records. If unspecified, every attribute and record is treated uniformly. Quality and Cost Assessor (QCA). This module is The user also specifies the logical operations that should responsible for continuously tracking the quality and cost of occur for each record and in what order. The domain of each record. As described in Algorithm 1, the QCA receives logical operations is defined as part of system set-up, as well records after they have been processed by each operator (and as the corresponding set of physical operators (stored in the the actual cost to process that record by that operator, if Physical Operator Library). known). It is then responsible for assessing and updating the These specifications are stored in a Budget Manager and a quality and cost for that record. To do so, the QCA tracks, Logical Operation Manager, respectively, that are consulted for every operator and for every attribute, an estimate of the as the system runs to ensure it is meeting the user’s require- quality (and error bound) that can be expected from that ments. operator for that attribute. It continuously monitors the Execution Engine. This component is responsible for quality of the record stream output by the operator using processing records given the user’s declaration. We describe reservoir sampling [30]. the operation of the Execution Engine in Algorithm 1. As These estimates are stored in a quality estimate table. records stream in from input sources, they are placed on a For example, Table 2 shows a truncated estimate table for processing queue. The Execution Engine consumes records the location database scenario. For each physical operator from this queue and consults the Operator Optimizer to de- from Figure 2 and for each attribute in the application (e.g., termine the next physical operator for each record. After Name, Address, Category), there is an entry in the table a record is processed by a physical operator, the Execution with the estimated quality that that operator can produce Engine passes it to the Quality and Cost Assessor module for that attribute. Using the estimation, the QCA updates to assess the quality of the record. Finally, the engine asks the quality of each of the attributes in the record given the the Budget Manager if the record has met the user’s cost previous operator. Of course, if a record was selected as and quality constraints. If so (or if the record has been pro- a sample to learn the true quality, it will be assigned that cessed by all logical operations), the engine sends the record quality. to the output data stream. If the record does not meet the The QCA is also responsible for tracking the cost of pro- constraints, the engine enqueues the record back on to the cessing each record by each operator. Similar to quality queue for further processing. tracking, the QCA utilizes the cost estimate or the exact cost Operator Optimizer. This component determines the from the operator to update the record’s remaining budget. appropriate next physical operator to process a record. After Crowd-Sourcing Platform. Crowd-sourcing may be checking which logical operations have already been applied utilized by many physical operators as well as by the QCA. to the record, the optimizer consults the Logical Operation Thus, Arnold includes a crowd-sourcing platform that in- Manager to determine which are the possible next logical terfaces with one or more crowds and hides the details of operations. It then runs an optimization algorithm given the crowd from the rest of the system. The details of this the set of physical operators available for the possible logical platform are outside the scope of this paper. operations as well as the the current and desired quality and cost of the record. We discuss the optimization algorithms 5. ADAPTIVE OPERATOR OPTIMIZATION

6. Algorithmic Regex Classifier Matching Having outlined Arnold’s overall functionality, we now ad- dress how Arnold optimizes the set of operators to apply to Clean De- Mechanical Input Categorize Output a given record and in what order. Address duplicate Turk There are many questions to consider when approaching Mechanical Editorial USPS API this problem. We first outline the scope of design consid- Turk Staff erations, and then propose a solution to one version of the problem. Choice of Objective Function. There are a range of Figure 4: Visualization of a linear operator graph objective functions for which the system can be designed to optimize. On one hand, the system can be given a cost budget (a strict limit) and try to maximize the quality of is to chose which physical operator to employ for each log- data it produces. Conversely, it could be given a quality ical operation. In order to solve this problem, we create a threshold it must meet and then try to minimize cost. For combined graph with each logical operation represented as ease of exposition, we focus on maximizing quality given a a node in the graph and each physical operator represented fixed budget, but these two problems are essentially duals of as a directed edge connecting two nodes. An example of this each other and our solutions are equally effective for both. graph is depicted in Figure 4. The problem now becomes one Fixed vs. Dynamic Budget. As discussed above, each of dynamically choosing the best path through this graph for record gets a fixed budget upon entering the system and each any given record that maximizes quality subject to budget operator subtracts from that budget. This budget could be constraints. set once at the start of processing, or it could be dynami- The simplest solution to the problem would be to enumer- cally adjusted as the system processes the record. Such an ate all possible paths with their associated cost and quality, adjustment could occur as the system gains a deeper un- and at any given point in the graph, choose the path which derstanding of the data in the record. For example, if after maximizes quality according to the currently available bud- cleaning a record’s address the system learns that the record get. This method, though effective for small graphs, will refers to a location in a popular area (e.g., downtown San quickly become unreasonable as more operators are added. Francisco), then the system may increase that record’s bud- Next, we present a more efficient solution based on dynamic get. programming. Ordering of Operations. We have presented the log- ical operator graph as a set of logical operations occurring 5.1.1 The Optimization Problem in a fixed order, e.g., Clean Address followed by Categorize We formulate this problem as an optimization problem followed by De-duplicate. An alternative is to choose any where the input is a linear operator graph G, record R, and logical operator at any time, given that a set of prerequisites budget b; and the objective is to choose a path P that passes is met. For example, the system can apply Clean Address or through every node of G according to Categorize in either order, but Clean Address is a prerequi- maximize Quality(P ) site for De-duplicate, which relies on accurate addresses. If P we allow alternate orderings, it transforms the linear graph subject to Cost(P ) ≤ b into a directed acyclic graph. Working with this additional complexity guarantees at least as good and possibly bet- The resulting optimization problem requires us to be able ter quality, but the resulting optimization problem becomes to choose the best bath from any node given any budget. more difficult. If we assume that the per record budget is an integer that Repetition or Omission of Logical Operators. So is not very large and has a known maximum value B, the far, we have discussed the system applying each logical op- problem can be efficiently solved offline using dynamic pro- eration once and only once to each record. However, it may gramming. Working backwards from the output node of the be beneficial to allow the system to re-execute some logical graph, at each node and for any given budget, choose the operations with costlier physical operators. For example, if highest quality path that can be afforded. If n is the num- a record that had its address cleaned using a low-cost and ber of nodes, the problem can be solved in O(nB) time and low-quality operator was later discovered to have high im- space. Note that this algorithm runs only once and material- portance, it may be desirable to re-clean its address using a izes an optimal path for each possible budget value at every higher-quality address cleaning operator. Conversely, some step. When a record with a budget b enters the system, we operators may not be necessary for some records and can be simply lookup the optimal path for budget b. If the budget omitted to save costs. For example, if a record comes from changes during execution of the chosen path, an additional a source that is known to have high-quality addresses, the lookup is needed to pick the optimal path moving forward. address cleanup can be skipped. The design choices described above conform to the most Different choices for these design considerations raise dif- common practical requirements that we have seen in the ferent questions as to how to allow a record to move through real-world. More flexibility of operator selection may lead to the graph. We now formalize one variant of the problem and better results, but increases the complexity of the optimiza- present a solution. Other variants of the problem will be ad- tion. For example, if a record can go through the logical dressed in future research. operators in arbitrary order, the optimization problem re- quires a O(2n B) dynamic programming algorithm. We will 5.1 Optimization on a Linear Operator Graph explore such complexity in future work. We focus on a linear operator graph, where operators are statically ordered in a path. Thus, the primary challenge 6. RELATED WORK

7. There has been much recent interest in crowd-sourcing bor Independence in more detail, and develop more sophis- applied to data management and data cleaning. At the ticated optimization algorithms. We also look to provide systems-level, many groups propose a declarative approach more insight into the real-world data integration challenges to utilizing crowd-sourcing in a database [8, 19, 23]. Labor we face at Groupon and demonstrate Arnold’s effectiveness Independence and Arnold are closely related to these sys- at addressing these issues. tems; similar to these systems, Labor Independence hides the details of where and how crowd-sourcing is used. How- ever, a primary distinction is that the declarative interface 8. REFERENCES of Labor Independence is at the application-level rather than [1] Avnur, R., and Hellerstein, J. M. Eddies: at the operator-level; that is, applications specify the level continuously adaptive query processing. In Proc. of of quality they want for their data overall (and the price ACM SIGMOD 2000, pp. 261–272. they are willing to pay), rather than specifying constraints [2] Crowdflower. http://www.crowdflower.com. at the individual operator level. On the other hand, the [3] Culler, D., Estrin, D., and Srivastava, M. granularity of optimization in Arnold is on a per-record ba- Overview of Sensor Networks. Computer 37, 8 (Aug. 2004), 41–49. sis, similar to Eddies [1] for optimizing relational queries. [4] Facebook. http://www.facebook.com. Arnold is focused specifically on providing high-quality data [5] Factual. http://www.factual.com. cleaning instead performance optimization of general data [6] Fitbit. http://www.fitbit.com. processing scenarios. [7] Franklin, M., Halevy, A., and Maier, D. From At a more detailed level, there is a wealth of work on how databases to dataspaces: a new abstraction for to use the crowd to accomplish individual tasks including fil- information management. SIGMOD Rec. 34, 4 (Dec. tering [22], graph search [21], max discovery [11], sorting and 2005), 27–33. joining [18], and entity resolution [31]. These approaches [8] Franklin, M. J., Kossmann, D., Kraska, T., could be incorporated into the Arnold architecture as phys- Ramesh, S., and Xin, R. CrowdDB: Answering ical operators. Roomba [14] is a precursor to our work in Queries with Crowdsourcing. In Proc. of ACM SIGMOD 2011. that it is focused on optimizing how and when to apply hu- [9] Galhardas, H., Florescu, D., Shasha, D., Simon, man involvement in an individual task; Arnold extends that E., and Saita, C.-A. Declarative Data Cleaning: work to optimize human input across multiple tasks. Language, Model, and Algorithms. In Proc. of VLDB Data cleaning and integration is a well-studied problem [13, 2001. 24]. The work closest to ours is the AJAX system [9], which [10] Gigwalk. http://gigwalk.com/. separates the logical and physical operations for data clean- [11] Guo, S., Parameswaran, A., and Garcia-Molina, ing. Due to the incorporation of crowdsourcing, our work is H. So Who Won? Dynamic Max Discovery with the different than AJAX in several ways. First, we incorporate Crowd. In Proc. of ACM SIGMOD 2012, pp. 385–396. [12] Haas, P. J., Naughton, J. F., Seshadri, S., and both quality and cost as a first-class elements of our model Stokes, L. Sampling-based estimation of the number and define our declarative interface using these parameters. of distinct values of an attribute. In VLDB (1995), Second, we consider a much broader range of logical and pp. 311–322. physical operations, which can be algorithmic or crowd- [13] Hellerstein, J. M. Quantitative Data Cleaning for based. Finally, our optimization problem involves a mul- Large Databases. United Nations Economic titude of new factors that were not present in AJAX, such Commission for Europe (UNECE), 2008. as the trade-off between crowds and machine algorithms. [14] Jeffery, S. R., Franklin, M. J., and Halevy, A. Y. Pay-as-you-go User Feedback for Dataspace Systems. In Proc. of ACM SIGMOD 2008. 7. CONCLUSION [15] K¨opcke, H., Thor, A., and Rahm, E. Evaluation of As applications come to depend on data to a greater de- entity resolution approaches on real-world match gree, so too does their expectation of the quality of those problems. PVLDB 3, 1 (2010), 484–493. data. While traditional data cleaning and integration sys- [16] Lohr, S. Sampling: Design and Analysis. Advanced Series. Brooks/Cole, 2009. tems are capable of producing “good” data, human input, [17] Madhavan, J., Jeffery, S., Cohen, S., Dong, X., or crowd-sourcing, is required in order to deal with the Ko, D., Yu, G., and Halevy, A. Web-scale Data long-tail of quality issues in large-scale data integration sys- Integration: You can only afford to Pay As You Go. tems. However, human involvement is expensive, and the In CIDR 2007. choice of how utilize humans is non-trivial. Fundamentally, [18] Marcus, A., Wu, E., Karger, D., Madden, S., a data-driven application desires high-quality data, and does and Miller, R. Human-powered Sorts and Joins. not want to be concerned with the complexities of crowd- Proc. of VLDB Endow. 5, 1 (Sept. 2011), 13–24. sourcing. [19] Marcus, A., Wu, E., Madden, S., and Miller, R. C. Crowdsourced Databases: Query Processing In this work, we developed Labor Independence, a declar- with People. In CIDR 2011. ative approach to data cleaning and integration that enables [20] Mechanical Turk. https://www.mturk.com/. applications to specify their quality and cost requirements [21] Parameswaran, A., Sarma, A. D., for their data without needing to understand how that data Garcia-Molina, H., Polyzotis, N., and Widom, are produced. We have described Arnold, a system that im- J. Human-Assisted Graph Search: It’s Okay to Ask plements Labor Independence to clean and integrate large Questions. Proc. of VLDB Endow. 4, 5 (Feb. 2011), numbers of input records while holistically using machine 267–278. and crowd labor. [22] Parameswaran, A. G., Garcia-Molina, H., Park, H., Polyzotis, N., Ramesh, A., and Widom, J. Arnold provides a rich test-bed for us to further research Crowdscreen: Algorithms for Filtering Data with the union of machines and humans. In the future, we intend Humans. In Proc. of ACM SIGMOD 2012, to explore the cost and quality trade-offs provided by La- pp. 361–372.

8.[23] Park, H., Pang, R., Parameswaran, A., ACM SIGKDD 2002 (New York, NY, USA), Garcia-Molina, H., Polyzotis, N., and Widom, pp. 269–278. J. Deco: A System for Declarative Crowdsourcing. In [28] Trushkowsky, B., Kraska, T., Franklin, M. J., Proc. of VLDB 2012. and Sarkar, P. Getting it all from the crowd. CoRR [24] Rahm, E., and Do, H. H. Data Cleaning: Problems abs/1202.2335 (2012). and Current Approaches. IEEE D. E. Bull. 23, 4 [29] USPS. www.usps.com. (2000), 3–13. [30] Vitter, J. S. Random sampling with a reservoir. [25] Ramesh, A., Parameswaran, A., Garcia-Molina, ACM Trans. Math. Softw. 11, 1 (Mar. 1985), 37–57. H., and Polyzotis, N. Identifying Reliable Workers [31] Wang, J., Franklin, M., and Kraska, T. Swiftly. Technical report, Stanford University, 2012. CrowdER: Crowdsourcing entity resolution. PVLDB [26] San Francisco Data. data.sfgov.org. 5, 11 (2012). [27] Sarawagi, S., and Bhamidipaty, A. Interactive Deduplication Using Active Learning. In Proc. of