- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Potter’s Wheel: An Interactive Data Cleaning System
展开查看详情
1 . Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman and Joseph M. Hellerstein University of California at Berkeley {rshankar, jmh}@cs.berkeley.edu Abstract transformation tools. The user first audits the data to detect discrepancies using an auditing tool like Unitech Systems’ Cleaning data of errors in structure and content is im- ACR/Data or Evoke Software’s Migration Architect. Then portant for data warehousing and integration. Current she either writes a custom script or uses an ETL (Extrac- solutions for data cleaning involve many iterations of tion/Transformation/Loading) tool like Data Junction or As- data “auditing” to find errors, and long-running trans- cential Software’s DataStage to transform the data, fixing formations to fix them. Users need to endure long errors and converting it to the format needed for analysis. waits, and often write complex transformation scripts. The data often has many hard-to-find special cases, so this We present Potter’s Wheel, an interactive data clean- process of auditing and transformation must be repeated un- ing system that tightly integrates transformation and til the “data quality” is good enough. This approach has two discrepancy detection. Users gradually build trans- problems. formations to clean the data by adding or undoing • Lack of interactivity: Transformation is typically done as transforms on a spreadsheet-like interface; the effect a batch process, operating on the whole dataset without of a transform is shown at once on records visible on any feedback. This leads to long, frustrating delays during screen. These transforms are specified either through which users have no idea if a transformation is effective. simple graphical operations, or by showing the de- sired effects on example data values. In the back- Such delays are compounded by a decoupling of transfor- ground, Potter’s Wheel automatically infers structures mation and discrepancy detection – these are often done as for data values in terms of user-defined domains, and separate steps, with separate software. This forces users to accordingly checks for constraint violations. Thus wait for a transformation to finish before they can check if users can gradually build a transformation as discrep- it has fixed all anomalies. More importantly, some nested ancies are found, and clean the data without writing discrepancies arise only after others have been fixed. E.g., complex programs or enduring long delays. a typo in a year field such as “19997” can be found (by running a suitable algorithm on the year values) only af- ter all dates have been converted to a uniform date type – 1 Introduction until then, the year values cannot be isolated from the date Organizations accumulate much data that they want to access strings. Thus the decoupling makes it hard to find multi- and analyze as a consolidated whole. However the data of- ple discrepancies in one pass, leading to many unnecessary ten has inconsistencies in schema, formats, and adherence to iterations. constraints, due to many factors including data entry errors • Need for much user effort: Both transformation and dis- and merging from multiple sources [6, 13]. The data must be crepancy detection need significant user effort, making purged of such discrepancies and transformed into a uniform each step of the cleaning process painful and error-prone. format before it can be used. Such data cleaning is a key challenge in data warehousing [6]. Data transformation is Commercial ETL tools typically support only some re- also needed for extracting data from legacy data formats, and stricted transforms1 between a small set of formats via for Business-to-Business Enterprise Data Integration [26]. a GUI, and provide ad hoc programming interfaces for general transforms (these are essentially libraries of con- 1.1 Current Approaches to Data Cleaning versions between standard formats: e.g. Data Junction’s Data cleaning has three components: auditing data to find DJXL). Even system-supported transforms often need to discrepancies, choosing transformations to fix these, and ap- be specified in sophisticated ways that involve regular ex- plying the transformations on the dataset. There are currently pressions or grammars (Section 4.3). many commercial solutions for data cleaning (e.g., see [9] for The discrepancy detection technique must match the data an overview). They come in two forms: auditing tools and domain – it may be a standard method like spell-checking, or a specialized one like spotting non-standard names for Permission to copy without fee all or part of this material is granted pro- vided that the copies are not made or distributed for direct commercial ad- automobile parts. Unfortunately data values are often vantage, the VLDB copyright notice and the title of the publication and its composite structures of values from different domains, like date appear, and notice is given that copying is by permission of the Very “Rebecca by Daphne du Maurier, et. al Hardcover (April Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. 8, 1948) $22.00” (from Amazon.com search results for Proceedings of the 27th VLDB Conference, 1 We use transform as a noun to denote a single operation, and transfor- Roma, Italy, 2001 mation to denote a sequence of operations.
2 . Daphne Du Maurier). Hence users must either write cus- tom programs for each such structure, or design transforms to parse data values into atomic components for anomaly detection, and then back into unified wholes for output. 1.2 Potter’s Wheel Approach Data cleaning is intrinsically a complex, knotty task, with many interrelated problems. Any solution must support transformation and discrepancy detection in an integrated fashion. On one hand, the transforms provided must be gen- eral and powerful enough to do most tasks without explicit programming, and the system must extensibly support the variety of discrepancy detection algorithms applicable in dif- ferent domains. On the other hand, since the cleaning pro- Figure 1: A snapshot of the Potter’s Wheel user interface on cess involves user interaction, the system must support trans- flight delay data from FEDSTATS (www.fedstats.gov). formation and discrepancy detection through simple specifi- 1.2.2 Extensible Discrepancy Detection cation interfaces and with minimal delays. Potter’s Wheel is an interactive data cleaning system that Potter’s Wheel allows users to define custom domains, and integrates transformation and discrepancy detection in a sin- corresponding algorithms to enforce domain constraints. gle interface. The software is publicly available from Berke- However since the data values are often composite struc- ley [22], and some of the main ideas are also used in Cohera’s tures, the system needs to automatically parse a string value Content WorkBench [8]. into a structure composed of user-defined domains, and then Users gradually build transformations in Potter’s Wheel apply suitable discrepancy detection algorithms. by composing and debugging transforms, one step at a time, This is similar to the problem of inferring regular expres- on a spreadsheet-like interface (see Figure 1; the details will sion structures from examples, that has been addressed in be explained in later sections). Transforms are specified the machine learning literature (e.g., [20, 5]). We are how- graphically, their effect is shown immediately on records vis- ever not interested in abstract structures like regular expres- ible on screen, and they can be undone easily if their effects sions, but rather in structures in terms of user-defined do- are undesirable. Discrepancy detection is done automatically mains. For example, parsing flight records like “Taylor, Jane, in the background, on the latest transformed view of the data, JFK to ORD on April 23, 2000 Coach” as “[A-Za-z, ]∗ [A- and anomalies are flagged as they are found. This pipelin- Z]3 to [A-Z]3 on [A-Za-z]* [0-9]*, [0-9]* [A-Za-z]*” does ing of transformation and discrepancy detection makes data not help much with detecting anomalies that satisfy the ba- cleaning a tight, closed loop where users can gradually de- sic pattern. Whereas parsing it as “[A-Za-z, ]* <Airport> velop and refine transformations as discrepancies are found. to <Airport> on <Date> <Class>” would allow us to detect logical errors like false airport codes or dates. 1.2.1 Interactive Transformation We believe that application developers will specify use- From the literature on transformation languages (e.g., [1, ful application-specific domains and corresponding domain 7, 16]) we have adapted a small set of transforms that sup- constraints (like date, airport code, construction part name), port common transformations without explicit programming. provided Potter’s Wheel can automatically infer patterns in Most of these are simple and easy to specify graphically. terms of these domains and apply suitable algorithms. The However some transforms used to parse and split values into difficulty is that these domains are typically not specified as atomic components are quite complex. Their specification explicit patterns but rather as encapsulated set-membership requires users to enter regular expressions or grammars, and functions that Potter’s Wheel cannot understand. A second in some cases write custom programs (Section 4.3). Instead unique feature of pattern learning in the data cleaning con- Potter’s Wheel lets users specify the desired results on exam- text is that the values will have discrepancies in their struc- ple values, and automatically infers a suitable transform, us- ture itself; hence Potter’s Wheel can only detect approximate ing the structure extraction techniques described below. We structures. There is a tradeoff here between choosing struc- describe such graphical specification, and the incremental tures that match most of the values in a column and choosing application of these transforms, in Section 4. structures that do not overfit the data values. Section 3 de- Potter’s Wheel compiles a sequence of transforms into a scribes how the Minimum Description Length principle [24] program after the user is satisfied, instead of applying them can be used to extract approximate structures for values in a piecemeal over many iterations. Users specify or undo these way that balances this tradeoff. transforms in orders they find natural, often only when dis- crepancies are found, and this exploratory behavior could re- 2 Potter’s Wheel Architecture sult in redundant or sub-optimal transforms. In addition, the main cost in the transformation is that of memory allocation The main components of the Potter’s Wheel architecture and copying. In the full version of the paper [22], we discuss (Figure 2) are a Data Source, a Transformation Engine that how the final sequence of transforms can be converted to a applies transforms along 2 paths, an Online Reorderer to more optimal form, including ways of pipelining transforms support interactive scrolling and sorting at the user inter- to minimize memory allocations and copies. face [23, 21], and an Automatic Discrepancy Detector.
3 .2.1 Data Source Discrepancy Detector Spreadsheet Display scroll Potter’s Wheel accepts input data as a single, pre-merged stream, that can come from an ODBC source or any ASCII check file descriptor (or pipe). The ODBC source can be used to for errors Input query data from DBMSs, or even from distributed sources data via middleware. In practice, schematic differences between Transform- source sources will restrict the tightness of the integration via a ation Engine get page scrollba query (even Figure 1 shows poor mapping in the Source and r posn. Destination columns). Potter’s Wheel will flag areas of poor compile Online integration as errors, and the user can transform the data, specify/undo moving values across columns to unify the data format. reorderer transforms Optimized Program When reading from ASCII files, each record is viewed as a single wide column. The user can identify col- Figure 2: Potter’s Wheel Architecture umn delimiters graphically and split the record into con- stituent columns. Such parsing is more complex and time- tuples fetched from the source are transformed and sent to consuming on poorly structured data (such as from web the discrepancy detector, in addition to being sent to the pages). Potter’s Wheel helps this process through a Split Online Reorderer. The discrepancy detector first parses transform that can be specified by example (Section 4). Col- values in each field into sub-components according to the umn types and delimiters can also be specified in a metadata structure inferred for the column. The structure of a col- file. Once a dataset has been parsed, the transformation can umn is a sequence of user-defined domains, and is inferred be stored as a macro for easy application on similar datasets. as soon as it is formed (i.e., either when the input stream is started or when a new column is formed by a trans- 2.2 Interface used for Displaying Data form), as we describe in Section 3.2. Then suitable algo- Data read from the input is displayed on a Scalable Spread- rthims are applied for each sub-component, depending on sheet interface [21] that allows users to interactively re-sort its domain. For example, if the structure of a column is on any column, and scroll in a representative sample of the <number><word><time> and a value is 19 January data, even over large datasets. When the user starts Potter’s 06:45, the discrepancy detector finds 19, January, and 06:45 Wheel on a dataset, the spreadsheet interface appears imme- as sub-components belonging to the <number>, <word>, diately, without waiting until the input has been completely and <time> domains, and applies the detection algorithms read. This is important when transforming large datasets or specified for those domains. never-ending data streams. 2.5 Compiling a Sequence of Transforms The interface supports this behavior using an Online Re- orderer [23] that continually fetches tuples from the source After the user is satisfied with the sequence of transforms, and divides them into buckets based on a (dynamically com- Potter’s Wheel can compile it into a transformation, and ex- puted) histogram on the sort column, spooling them to disk if port it as either a C or Perl program, or a Potter’s Wheel needed. When the user scrolls to a new region, the reorderer macro – the latter can be invoked on other datasets to reap- picks a sample of tuples from the bucket corresponding to the ply the transformation without respecifying each transform. scrollbar position and displays them on screen. Thus users In future we want to support compilation into declarative lan- can explore large amounts of data along any dimension. Ex- guages like SQL or XSLT, so that a database system could ploration helps users spot simple discrepancies by observing perform further optimizations. the structure of data as values in the sort-column change. 3 Extensible Discrepancy Detection 2.3 Transformation Engine Potter’s Wheel allows users to define arbitrary domains, Transforms specified by the user need to be applied in two along with corresponding discrepancy detection algorithms. scenarios. First, they need to be applied when records are We describe the API for specifying domains in Section 3.1. rendered on the screen. With the spreadsheet user interface Based on these domains, the system automatically infers ap- this is done when the user scrolls or jumps to a new scrollbar propriate structures for values in each column (Section 3.2). position. Since the number of rows that can be displayed on Some of the domains in this structure are then parameterized screen at a time is small, users perceive transformations as for the specific column values, as discussed in Section 3.3. being instantaneous (this clearly depends on the nature of the Once this detailed structure is inferred, the system parses val- transforms; we return to this issue in Section 4.2). Second, ues and sends individual components to suitable discrepancy transforms need to be applied to records used for discrepancy detection algorithms (Section 2.4). detection because, as argued earlier, we want to check for discrepancies on transformed versions of data. 3.1 Domains in Potter’s Wheel Domains in Potter’s Wheel are defined through the interface 2.4 Automatic Discrepancy Detector shown in Figure 3. The only function required to be im- While the user is specifying transforms and exploring the plemented is an inclusion function match to identify values data, the discrepancy detector runs in the background, apply- in the domain. The optional cardinality function is helpful ing appropriate algorithms to find errors in the data. Hence in structure extraction as we describe in Section 3.2. up-
4 .public abstract class Domain { /** Required Inclusion Function — Checks if value satisfies domain constraints. (Sections 3.1) */ public abstract boolean match(char *value); /** Optional function – finds the number of values in this domain with given length. This could vary based on parameterization. (Sections 3.2 and 3.3) */ public int cardinality(int length); /** Optional function – updates any state for this domain using the given value. ( Sections 3.1 and 3.3) */ public void updateStats(char* value); /** Optional function – checks if a given value is a discrepancy, with a certain probability. Typically needs to know the total number of tuples in the data set (e.g. see [14]). (Section 3.1) */ public float matchWithConfidence(char *value, int dataSize); /** Optional function – checks if one pattern is redundant after another. (Section 3.2) */ public boolean isRedundantAfter(Domain d); } Figure 3: API for user-defined domains. These functions are explained in the sections indicated. dateStats is mainly used to parameterize the domains (Sec- 3.2.1 Evaluating the Suitability of a Structure tion 3.3). It can also be used by a discrepancy detection algo- There are three characteristics that we want in a structure for rithm to accumulate state about the data. This accumulated the column values. state can also be used to catch multi-row anomalies where • Recall: The structure should match as many of the column a set of values are individually correct, but together violate values as possible. some constraint. For example, a duplicate elimination algo- • Precision: The structure should match as few other values rithm could use updateStats to build an approximate hash ta- as possible. ble or Bloom filter of the values seen so far. The matchWith- • Conciseness: The structure should have minimum length. Confidence method is helpful for probabilistic and incremen- The first two criteria are standard IR metrics for evaluat- tal discrepancy detection algorithms, such as sampling based ing the effectiveness of a pattern [27]. We need to consider algorithms (e.g. [14]). The isRedundantAfter method is used recall because the values might be erroneous even in struc- while enumerating structures, as described in Section 3.2. ture; all unmatched values are considered as discrepancies. Potter’s Wheel provides the following default domains: Considering precision helps us avoid overly broad structures arbitrary ASCII strings (henceforth called ξ ∗ ), character like ξ ∗ that do not uniquely match this column. strings (called Words; likewise AllCapsWords and CapWords The last criterion of conciseness is used to avoid over- refer to words with all capitals and capitalized words respec- fitting the structure to the example values. For instance, we tively), Integers, sequences of Punctuation, C-style Iden- want to parse March 17, 2000 as [A-Za-z]∗ [0-9]∗ , [0-9]∗ tif iers, floating point values (henceforth called Doubles), rather than as M a r c h 1 7 , 2 0 0 0. English words checked according to ispell (IspellWords), This last example highlights the importance of allowing common Names (checked by referring to the online 1990 user-defined domains in the alphabet from which we create census results), Money, and a generic regular-expression do- the structure. For instance if we did not have Word and In- main that checks values using the PCRE library. teger as domains in the alphabet, M a r c h 1 7 , 2 0 0 0 would be the better structure than [A-Za-z]∗ [0-9]∗ , [0-9]∗ 3.2 Structure Extraction since it has the same recall (100%), better precision (since it A given value will typically be parseable in terms of avoids matching any other date), and smaller pattern length the default and user-defined domains in multiple ways. than [A-Za-z]∗ [0-9]∗ , [0-9]∗ . Intuitively, the latter is a For example, March 17, 2000 can be parsed as ξ ∗ , as more concise pattern, but this is only because we think of [A-Za-z]∗ [0-9]∗ , [0-9]∗ , or as [achrM]∗ [17]∗ , [20]∗ , to the [A-Za-z]∗ as the domain Word, rather than as the Kleene name a few possible structures. Structure extraction involves closure of a set of 56 characters. choosing the best structure for values in a column. Formally, These three criteria are typically conflicting, with broad given a set of column values v1 , v2 , . . . , vn and a set of do- patterns like ξ ∗ having high recall and conciseness but low mains d1 , d2 , . . . , dm , we want to extract a suitable structure precision, and specific patterns having high precision but low S = ds1 ds2 . . . dsp , where 1 ≤ s1 . . . sp ≤ m. conciseness. An effective way to make the tradeoff between over-fitting and under-fitting is through the Minimum De- All that we know about these domains is from the func- scription Length (MDL) principle [24], that minimizes the tions defined in Figure 3 – even among these, only the set- total length required to encode the data using a structure. membership function (match) may be available. In general the inferred structure must be approximate, since the data Description Length: A metric for structure quality could have errors in the structure itself. We first describe We now derive the description length (DL) for encoding a set how to evaluate the appropriateness of a structure for a set of of values with a structure, as a measure of the appropriate- values and then look at ways of enumerating all structures so ness of the structure; better structures result in smaller DLs. as to choose the best one. According to the MDL principle, the DL for using a structure
5 .to describe a set of column values is defined as: the length of to encode the values in the column is, the theory (the structure definition) plus the length required n p p log m+(f /n)× i=1 p log M axLen + j=1 sp(wi,j |dsj ) to encode the values given the structure. AvgV alLen We need a DL that can encapsulate the goals of Recall, + (1 − f )(log |ξ |) Precision, and Conciseness (as penalties). Conciseness is After some transformation this becomes directly captured by the length of theory for the structure. p log m + AvgV alLen log |ξ| + f p log M axLen+ For values that match the structure, the length required for n p encoding the data values captures the Precision. We tackle f |values of length len(wi,j ) that satisfy dsj | log erroneous data values by positing that values not matching n i=1 j=1 |values of length len(wi.j )| the structure are encoded explicitly by writing them out, i.e. using the structure ξ ∗ . The latter encoding is typically more The best way to compute the cardinality in the above ex- space-intensive since it assumes no structure, forcing values pression is using the int cardinality(int length) function for the to be written out explicitly. Thereby we capture Recall. domain dsj , if it has been defined. For other domains we ap- Example: Consider a structure of Word Integer Integer and proximate the fraction directly by repeatedly choosing ran- a value of May 17 2025. The number of bits needed to dom strings of appropriate length and checking if they satisfy encode the structure is 3 log (number of domains). Then dsj . Since these fractions are independent of the actual val- we encode the value by first specifying the length of ues, they can be pre-computed and cached. each sub-component and then, for each sub-component, If the length is too high we may need to check many val- specifying the actual value from all values of the same ues before we can estimate this fraction. Hence in the ab- length. In this case, the sub-component lengths are sence of a user-defined cardinality function, we compute the 3, 2, and 4. The domains are strings over alphabets fraction of matches for a few small lengths, and extrapolate of size 52, 10, and 10 ([a-zA-Z] and [0-9]). Thus it to larger lengths assuming that the number of matches is a the description length is: 3 log(number of domains) + strict exponential function of the string length. 3 log (maximum length of values in each sub-component) + 3.2.2 Choosing the best structure log 523 + log 102 + log 104 . We have seen how to compute a description length that mea- In the above example, we are able to calculate the lengths sures the suitability of a structure for a given set of values. of the value encodings for integers and words because we We now want to enumerate all structures that can match the know the properties of the domains. We now look at encod- values in a column and choose the most suitable one. This ings for structures of arbitrary domains. Consider a structure enumeration needs to be done carefully since since the struc- S of p domains, ds1 ds2 . . . dsp . Let |T | denote the cardinality tures are arbitrary strings from the alphabet of domains, and of any set T . The description length of a string vi of length it will be too expensive to enumerate all such strings. len(vi ) using S is DL(vi , S) = We apply the algorithm of Figure 4 on a set of sample val- length of theory for S + length to encode vi given S ues from the column, and take the union of all structures enu- Given m domains, we can represent each domain with log m merated thus. We use 100 values as a default; this has proven bits. Let f be the probability that vi matches the structure S. adequate in all the cases we have encountered. During this If vi does not match, we encode it explicitly. Thus, enumeration, we prune the extent of recursion by not han- DL(vi , S) = p log m + (1 − f )(log |ξ len(vi ) |) dling structures with meaningless combinations of domains + f (space to express vi with S) such as <word> <word> or <integer> <decimal>. These with the three parts of the right hand side representing are unnecessarily complicated versions of simpler structures penalties for Conciseness, Recall, and Precision respec- like <word> and <decimal>, and will result in structures tively. Let v1 , v2 , . . . vn be the values in the column, and with identical precision and recall but lesser conciseness. AvgV alLen = We identify such unnecessary sequences using the isRe- 1≤j≤n len(vj ) be the average length of the values in the column. It is easy to see that the average dundantAfter(Domain d) method of Domain that determines space needed to encode the values is whether this domain is redundant immediately after the given domain. DL(S) = p log m + (1 − f )(log |ξ AvgV alLen |) Even though this is an exponential algorithm, pruning re- + f (avg. space to express v1 . . . vn using S) duces the number of structures we enumerate for a column Just as in the example, we express the values vi using S considerably. As shown in Figure 5, the number of enumer- by first encoding the lengths of their components in each do- ated structures is typically less than 10. main and then encoding the actual values of the components. For any string w that falls in a domain du , let len(w) be its 3.3 Structures with Parameterized Domains length, and let sp(w|du ) be the space required to uniquely So far the structures that we have extracted are simply strings encode w among all the len(w)-length strings in du . of domains. But the column values are often much more re- Suppose that value vi matches the structure S = stricted, consisting only of certain parameterizations of the ds1 ds2 . . . dsp through the concatenation of sub-components domains. For example, all the sub-components from a do- vi = wi,1 wi,2 . . . wi,p , with wi,j ∈ dsj ∀1 ≤ j ≤ p (this main might have a constant value, or might be of constant parsing of vi is itself not easy; we discuss efficient parsing length, as shown in the examples of Figure 5. in Section 4.3.2). Let M axLen be the maximum length of Potter’s Wheel currently detects two parameterizations the values in the column. Then the average space required automatically: domains with constant values and domains
6 . Example Column Value # Structures Final Structure Chosen (Example erroneous values) Enumerated (Punc = Punctuation) /** Enumerate all structures of domains ds1 . . . dsp -60 5 Integer that can be used to match a value vi . */ UNITED, DELTA, AMERICAN etc. 5 IspellWord void enumerate(vi , d1 , . . . dp ) { SFO, LAX etc. (JFK to OAK) 12 AllCapsWord Let vi be a string of characters w1 . . . wm 1998/01/12 9 Int Punc(/) Int Punc(/) Int for all domains d matching prefix w1 . . . wk of vi M, Tu, Thu etc. 5 Capitalized Word do enumerate(wk+1 . . . wm , ds1 , . . . dsp ) 06:22 5 Int(len 2) Punc(:) Int(len 2) – avoid structures beginning with domains 12.8.15.147 (ferret03.webtop.com) 9 Double Punc(’.’) Double d that satisfy d .isRedundantAfter(d) ”GET\b (\b) 5 Punc(”) IspellWord Punc(\) prepend d to all structures enumerated above /postmodern/lecs/xia/sld013.htm 4 ξ∗ } HTTP 3 AllCapsWord(HTTP) Figure 4: Enumerating various structures for a set /1.0 6 Punc(/) Double(1.0) of values Figure 5: Structures extracted for different kinds of columns, using the default domains listed in Section 3.1. Structure parameterizations are given in parenthesis. with values of constant length. Such parameterized struc- ing a Short domain for values less than 255 (to form tures are especially useful for automatically parsing the val- Short.Short.Short.Short), or even by allowing a parameter- ues in a column, when inferring Split transforms by example ization of the form Integer (len ≤ 3). (Section 4.3). An interesting example of over-fitting is the choice of In addition, users can define domains that infer custom IspellWord for flight carriers. Although most flight carrier parameterizations, using the updateStats method. These do- names occur in the ispell dictionary, some like TWA do not. mains could use specialized algorithms to further refine the Still IspellWord is chosen because it is cheaper to encode structure of the sub-components that fall within their domain. TWA explicitly with a ξ ∗ structure than to encode all carri- For example, the default Integer domain in Potter’s Wheel ers with the next best structure, AllCapsWord. The system computes the mean and standard deviation of its values and flags TWA as an anomaly – the user could choose to ignore uses these as parameters, to flag values that are more than 2 this, or specify a minimum Recall threshold to avoid over- standard deviations away as potential anomalies. Likewise fitting. In any case, this example highlights the importance a domain can accept all strings by default, but parameterize of involving the user in the data cleaning process. itself by inferring a regular expression that matches the sub- Figure 10 gives more examples of inferred structures. component values. The description length for values using a structure often 4 Interactive Transformation reduces when the structure is parameterized. For the default parameterizations of constant values and constant lengths it Having seen how Potter’s Wheel infers structures and iden- is easy to adjust the formulas given in the previous section. tifies discrepancies, we turn our attention to its support for For custom parameterizations like the regular expression in- interactive transformation. We want users to construct trans- ference discussed above, the user must define the cardinality formations gradually, adjusting them based on continual function based on the parameterization. feedback. This breaks down into the following sub-goals: 3.4 Example Structures Extracted Ease of specification: Transforms must be specifiable through graphical operations rather than custom program- Consider the snapshot shown in Figure 1 containing flight ming. Moreover, in these operations, we want to avoid use delay statistics. Figure 5 shows the structures extracted for of regular-expressions or grammars and instead allow users some of its column values, and also for some columns from a to specify transforms by example as far as possible. web access log. We see that the dominant structure is chosen even in the face of inconsistencies; thereby the system can Ease of interactive application: Once the user has specified flag these structural inconsistencies as errors to the user, and a transform, they must be given immediate feedback on the parse and apply suitable detection algorithms for other values results of its application so that they can correct it. that match the structure. Undos and Data Lineage: Users must be able to easily undo Using these the system flags several discrepancies that we transforms after seeing their effect. In addition, the lineage had earlier added to the data. For example, the system flags of errors must be clear – i.e., errors intrinsic to the data must dates such as 19998/05/31 in the date column of Figure 1 as be differentiable from those resulting from other transforms. anomalies because the Integer domain for the year column parameterizes with a mean of 2043.5 and a standard devia- 4.1 Transforms supported in Potter’s Wheel tion of 909.2. It finds the poor mapping in the Source and The transforms used in Potter’s Wheel are adapted from ex- Destination columns of Figure 1 as structural anomalies. isting literature on transformation languages (e.g. [16, 7]). Figure 5 also shows that a column of IP addresses with We describe them briefly here before proceeding to discuss values like 12.8.15.147 has its structure inferred as Dou- their interactive application and graphical specification. Ta- ble.Double, rather than Integer.Integer.Integer.Integer. This ble 1 gives formal definitions for these transforms. Addi- arises because Double is a more concise structure than tional illustrative examples and proofs of expressive power Integer.Integer. This could be avoided either by defin- are given in the full version of the paper [22].
7 . Transform Definition Format φ(R, i, f ) = {(a1 , . . . , ai−1 , ai+1 , . . . , an , f (ai )) | (a1 , . . . , an ) ∈ R} Add α(R, x) = {(a1 , . . . , an , x) | (a1 , . . . , an ) ∈ R} Drop π(R, i) = {(a1 , . . . , ai−1 , ai+1 , . . . , an ) | (a1 , . . . , an ) ∈ R} Copy κ((a1 , . . . , an ), i) = {(a1 , . . . , an , ai ) | (a1 , . . . , an ) ∈ R} Merge µ((a1 , . . . , an ), i, j, glue) = {(a1 , . . . , ai−1 , ai+1 , . . . , aj−1 , aj+1 , . . . , an , ai ⊕ glue ⊕ aj ) | (a1 , . . . , an ) ∈ R} Split ω((a1 , . . . , an ), i, splitter) = {(a1 , . . . , ai−1 , ai+1 , . . . , an , left(ai , splitter), right(ai , splitter)) | (a1 , . . . , an ) ∈ R} Divide δ((a1 , . . . , an ), i, pred) = {(a1 , . . . , ai−1 , ai+1 , . . . , an , ai , null) | (a1 , . . . , an ) ∈ R ∧ pred(ai )} ∪ {(a1 , . . . , ai−1 , ai+1 , . . . , an , null, ai ) | (a1 , . . . , an ) ∈ R ∧ ¬pred(ai )} Fold λ(R, i1 , i2 , . . . ik ) = {(a1 , . . . , ai1 −1 , ai1 +1 , . . . , ai2 −1 , ai2 +1 , . . . , aik −1 , aik +1 , . . . , an , ail ) | (a1 , . . . , an ) ∈ R ∧ 1 ≤ l ≤ k} Select σ(R, pred) = {(a1 , . . . , an ) | (a1 , . . . , an ) ∈ R ∧ pred((a1 , . . . , an ))} Notation: R is a relation with n columns. i, j are column indices and ai represents the value of a column in a row. x and glue are values. f is a function mapping values to values. x ⊕ y concatenates x and y. splitter is a position in a string or a regular expression, left(x, splitter) is the left part of x after splitting by splitter. pred is a function returning a boolean. Table 1: Definitions of the various transforms. Unfold is defined in the full paper [22]. Stewart,Bob Format Bob Stewart partly in data values, and partly in the schema, as shown in Anna Davis '(.*), (.*)' to '\2 \1' Anna Davis Figure 8. Fold ”flattens” tables by converting one row into Dole,Jerry Jerry Dole multiple rows, folding a set of columns together into one col- Joan Marsh Joan Marsh umn and replicating the rest. Conversely Unfold ”unflattens” Split at ' ' tables; it takes two columns, collects rows that have the same Bob Stewart 2 Merges Bob Stewart values for all the other columns, and unfolds the two chosen Anna Davis Anna Davis columns. Values in one column are used as column names to Jerry Dole Jerry Dole align the values in the other column. Figures 8 and 9 show Joan Marsh Joan Marsh an example with student grades where the subject names are demoted into the row via Format, grades are Folded together, Figure 6: Using Format, Merge and Split to clean name for- and then Split to separate the subject from the grade. Fold mat differences and UnFold are adapted from the restructuring operators of SchemaSQL [16], and are discussed in more detail in the Value Translation full paper [22]. The Format transform applies a function to every value in a column. We provide built-in functions for common oper- Power of Transforms: As we prove in the full paper [22], ations like regular-expression based substitutions and arith- these transforms can be used to perform all one-to-many row metic operations, but also allow user defined functions. Col- mappings of rows. Fold and Unfold can also be used to f latten umn and table names can be demoted into column values us- tables, converting them to a form where column and table ing special characters in regular expressions; these are useful names are all literals and do not have data values. For a for- in conjunction with the Fold transform described below. mal definition of (un)flattening and an analysis of the power of Fold and Unfold, see [16]. One-to-one Mappings of Rows One-to-one transforms are column operations that transform 4.2 Interactive Application of Transforms individual rows. As illustrated in Figures 6 and 7, they can We want to apply the transforms on tuples incrementally, as be used to unify data collected from different sources. they stream in, so that the effects of transforms can be imme- The Merge transform concatenates values in two columns, diately shown on tuples visible on the screen of the UI. It also optionally interposing a constant (the delimiter) in the mid- lets the system pipeline discrepancy detection on the results dle, to form a single new column. Split splits a column into of the transforms, thereby giving the interactivity advantages two or more parts, and is used typically to parse a value into described in the introduction. its constituent parts. The split positions are often difficult Among the transforms discussed above, all the one-to-one to specify if the data is not well structured. We allow split- transforms as well as the Fold transform are functions on a ting by specifying character positions, regular expressions, single row. Hence they are easy to apply incrementally. or by interactively performing splits on example values (Sec- tion 4.3). However Unfold operates on a set of rows with match- Drop, Copy, and Add allow users to drop or copy a col- ing values. Since this could potentially involve scanning the umn, or add a new column. Occasionally, logically different entire data, we do not allow Unfold to be specified graphi- values (maybe from multiple sources) are bunched into the cally. For displaying records on the screen we can avoid this same column, and we want to transform only some of them. problem by not showing a complete row but instead show- Divide conditionally divides a column, sending values into ing more and more columns as distinct values are found, and one of two new columns based on a predicate. filling data values in these columns as the corresponding in- put rows are read. Such progressive column addition in the Many-to-Many Mappings of Rows spreadsheet interface could confuse the user; hence we plan Many-to-Many transforms help to tackle higher-order to implement an abstraction interface where all newly cre- schematic heterogeneities [18] where information is stored ated columns are shown as one rolled up column. When
8 . Such,Bob Name Math Bio 2 Formats Name Ann Davis Ann 43 78 (demotes) Ann Math:43 Bio:78 Name Name Math Bio Sci Dole,Jerry Bob 96 54 Bob Math:96 Bio:54 Anna Math 43 Anna 43 78 Joan Song Fold Anna Bio 78 unfold(2,3) Bob 96 54 Divide (like ', ') Name Name Bob Math 96 Joan 79 Such,Bob Ann Math 43 Split Ann Math:43 Ann Davis Ann Bio 78 Ann Bio:78 Bob Bio 54 Dole,Jerry Bob Math 96 Bob Math:96 Joan Sci 79 Joan Song Bob Bio 54 Bob Bio:54 Figure 7: Divide-ing to sepa- Figure 9: Unfold-ing into three columns rate various name formats Figure 8: Fold-ing to fix higher-order variations the user clicks to unroll the column it expands into a set of /** Split a string q (of characters w1 . . . wm ) using structures columns corresponding to the distinct values found so far. * S1 , S2 , . . . Sk .*/ void LeftRight(q,S1, . . . Sk ) { 4.3 Graphical Specification of Transforms If k == 0, check if q is empty Transforms like Add, Drop, Copy, Fold, and Merge are sim- for all prefixes w1 . . . wj of q satisfying S1 do ple to specify graphically. Users can highlight the desired LeftRight(wj+1 . . . wm , S2 S3 . . . Sk ) columns and pick the appropriate transform. } However Split is often hard to specify precisely. Splits are needed to parse values in a column into consituent parts, void DecSpecificity(q,S1 , . . . Sk ) { as illustrated in Figure 10. This is an important problem Let v1 , v2 , . . . vn be the example values used for commercial integration products. Tools like Microsoft to infer the structures for the split. SQL Server and Access have wizards for parsing ASCII data Let xi,1 xi,2 . . . xi,k be the user-specified split for each vi . files with constant delimiters. There are many research and As in Section 3.2, compute for all structures Sj commercial “wrapper-generation” tools (e.g. Araneus [12], spj = space needed to express x1,j , . . . xn,j using Sj Cohera Net Query (CNQ) [8], Nodose [2]) that tackle this Choose the structure Sj with the least value of spj . problem for “screen-scraping” unstructured data found on for all substrings wa . . . wb of q satisfying Sj do web pages. However these tools often require sophisticated DecSpecificity(w1 . . . wa−1 , S1 . . . Sj−1 ) specification of the split, ranging from regular expression DecSpecificity(wb+1 . . . wm , Sj+1 . . . Sp ) split delimiters to context free grammars. But even these } cannot always be used to split unambiguously. For instance Figure 11: Two methods of splitting a value. in the first entry of Figure 10, commas occur both in delim- Some values may still not be unambiguously parseable iters and in the data values. As as result, users often have to using the inferred structures. Other values may be anoma- write custom scripts to parse the data. lous and not match the inferred structure at all. We flag all 4.3.1 Split by Example such values as errors to the user, who can apply other trans- In Potter’s Wheel we want users to be able to parse and split forms to split them, or clean the data further. values without specifying complex regular expressions or 4.3.2 Splitting based on inferred structures writing programs. Instead we want to allow users to spec- Since the structures inferred involve domains with Turing- ify most splits by performing them on example values. complete match functions, splitting a value based on these is The user selects a few example values v1 , v2 , . . . , vn not easy. The first algorithm of Figure 11, LeftRight, is a sim- and in a graphical, direct-manipulation [25] way shows ple recursive algorithm for parsing a value that considers the the positions at which these values are to be split, into inferred structures from left to right, and tries to match them sub-components (x1,1 x1,2 . . . x1,m ), (x2,1 . . . x2,m ), . . . , against all prefixes of the unparsed value. This algorithm (xn,1 . . . xn,m ) respectively. As is done during discrepancy is potentially very expensive for “imprecise” structures that detection, the system infers a structure for each of the m new match many prefixes. Quick parsing is particularly needed columns using MDL, and uses these structures to split the when the split is to be applied on a large dataset after the rest of the values. These structures are general, ranging from user has chosen the needed sequence of transforms. simple ones like constant delimiters or constant length de- Therefore we use an alternative algorithm called Dec- limiters, to structures involving parameterized user-defined Specificity (the second algorithm of Figure 11) that matches domains like “Word(len 3) Word(len 3) Integer(len 2) time the inferred structures in decreasing order of specificity. It Integer(len 4)” (for the output of the UNIX date command). first tries to find a match for the most specific structure, and Therefore they are better than simple regular expressions at then recursively tries to match the remaining part of the data identifying split positions. value against the other structures. The motivation is that in Figure 10 contains some sample structures that Potter’s the initial stages the most specific structures (typically con- Wheel extracts from example splits on different datasets. We stant delimiters) will match only a few substrings, and so the see that even for the ambiguous-delimiter case described ear- value will be quickly broken down into smaller pieces that lier, it can extract good structures that can be used to split can be parsed later. The specificity of a structure is com- unambiguously. puted as the sum of the description lengths of the (appropri-
9 . Example Values Split By User Inferred Structure Comments (| is user specified split position) Taylor, Jane |, $52,072 Parsing is doable despite no good de- Blair, John |, $73,238 (< ξ ∗ > < ’,’ Money >) limiter. A regular expression domain Tony Smith |, $1,00,533 can infer a structure of $[0-9,]* for last component. MAA |to| SIN JFK |to| SFO (<len 3 identifier> < ξ ∗ > Parsing is possible despite multiple LAX |–| ORD < len 3 identifier> ) delimiters. SEA |/| OAK 321 Blake #7 |, Berkeley |, CA 94720 (<number ξ ∗ > < ’,’ word> Parsing is easy because of consistent 719 MLK Road |, Fremont |, CA 95743 <’,’ (2 letter word) (5 letter integer)>) delimiter. Figure 10: Parse structures inferred from various split-by-examples ate substrings) of the example values using the structure. The data [1, 7, 16, 18]. Our horizontal transforms are very similar less specific structures need to be used only after the value to the restructuring operators of SchemaSQL [16]. However has been decomposed into much smaller substrings, and the our focus is on the ease of specification and incremental ap- splitting is not too expensive on these. plication, and not merely on expressive power. To study the effect of parsing according to specificity The research literature on finding discrepancies in data we ran DecSpecificity, LeftRight, and IncSpecificity on a few has focused on two main things: general-purpose algorithms structures. IncSpecificity is the exact opposite of DecSpeci- for finding outliers in data (e.g. [3]), and algorithms for find- ficity and considers structures starting with the least specific ing approximate duplicates in data [13, 17, 10]. There has one; it illustrates how crucial the choice of starting struc- also been some work on finding hidden dependencies in data ture is. Figure 12 compares the throughput at which one can and correspondingly their violations [14]. Such general pur- split values using these methods. We see that DecSpecificity pose algorithms are useful as default algorithms for Potter’s performs much better than the others, with the improvement Wheel’s discrepancy detector. However we believe that in being dramatic at splits involving many structures. many cases the discrepancies will be domain-specific, and that data cleaning tools must handle these domains extensi- 4.4 Undoing Transforms and Tracking Data Lineage bly. The ability to undo incorrect transforms is an important re- A companion problem to data cleaning is the integration quirement for interactive transformation. However, if the of schemas from various data sources. We intend to extend specified transforms are directly applied on the input data, Potter’s Wheel with a system that handles interactive speci- many transforms (such as regular-expression-based substi- fication of schema mappings (such as Clio [19]). tutions and some arithmetic expressions) cannot be undone Extracting structure from poorly structured data is in- unambiguously – there exist no “compensating” transforms. creasingly important for “wrapping” data from web pages, Undoing these requires “physical undo”, i.e., the system and many tools exist in both the research and commercial has to maintain multiple versions of the (potentially large) world (e.g. [2, 12, 8]). As discussed in Section 4.3, these dataset. tools typically require users to specify regular expressions or Instead Potter’s Wheel never changes the actual data grammars; even these are often not sufficient to unambigu- records. It merely collects transforms as the user adds them, ously parse the data, so users have to write custom scripts. and applies them only on the records displayed on the screen, There have also been some learning-based approaches for in essence showing a view using the transforms specified so automatic text wrapping and segmentation [15, 4]. We be- far. Undos are done “logically,” by removing the concerned lieve, however, that a semi-automatic, interactive approach transform from the sequence and “redoing” the rest before using a combination of graphical operations and statistical repainting the screen. methods is more powerful. This approach also solves the ambiguous data lineage There has been some work in the machine learning litera- problem of whether a discrepancy is due to an error in the ture [20, 5] and the database literature [11] on inferring reg- data or because of a poor transform. If the user wishes to ular expressions from a set of values. However as argued be- know the lineage of a particular discrepancy, the system only fore, for detecting discrepancies it is important to infer struc- needs to apply the transforms one after another, checking for tures in terms of generic user-defined domains, in a way that discrepancies after each transform. is robust to structural data errors. 5 Related Work 6 Conclusions and Future Work The commercial data cleaning process is based on ETL tools Data cleaning and transformation are important tasks in and auditing tools, as described in the introduction. [6, 9] many contexts such as data warehousing and data integra- give good descriptions of the process and some popular tools. tion. The current approaches to data cleaning are time- There is much literature on transformation languages, es- consuming and frustrating due to long-running noninterac- pecially for performing higher-order operations on relational tive operations, poor coupling between analysis and trans-
10 . Example Values Structure to Split by Split Throughput (usecs/value) (Int=Integer, Dbl=Double) (DecSpec LeftRight IncSpec) 8:45 <Int> < : > < Int > 5.96 9.18 9.18 1997/10/23 <Int> < / > < Int > < / > < Int > 11.52 17.95 57.89 12.8.15.14 - - [01/May/2000 ... ”GET ... 404 306 <Dbl ’.’ Dbl > < ’- - [’ > < ξ ∗ > 144.8 539.8 27670 (<Dbl ’.’ Dbl > < ’- - [’ > < ξ ∗ > 12.8.15.14 - - [01/May/2000 ... ”GET ... 404 306 219.38 943.8 1525590 < Int ’ ’ Int >) (<Dbl ’.’ Dbl> <- - [> <Int/Word/Int> 12.8.15.14 - - [01/May/2000 ... ”GET ... 404 306 233.55 1960.95 1036090 < ”GET > < ξ ∗ > < Int ’ ’ Int > ) Figure 12: Comparison of split throughputs using three methods. formation, and complex transformation interfaces that often [4] V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic seg- require user programming. mentation of text into structured records. In SIGMOD, 2001. We have described Potter’s Wheel, an interactive system [5] A. Brazma. Efficient algorithm for learning simple regular for data transformation and cleaning. By integrating discrep- expressions from noisy examples. In Intl. Wksp. on Algorith- ancy detection and transformation, Potter’s Wheel allows mic Learning Theory, 1994. [6] S. Chaudhuri and U. Dayal. An overview of data warehousing users to gradually build a transformation to clean the data by and OLAP technology. In SIGMOD Record, 1997. adding transforms as discrepancies are detected. Users can [7] W. Chen, M. Kifer, and D. S. Warren. HiLog: A founda- specify transforms through graphical operations or through tion for higher-order logic programming. In Journal of Logic examples, and see the effect instantaneously, thereby allow- Programming, volume 15, pages 187–230, 1993. ing easy experimentation with different transforms. [8] Cohera Corp. http://www.cohera.com. We have seen that parsing strings using structures of user- [9] Data extraction, transformation, and loading tools (ETL). defined domains results in a general and extensible discrep- www.dwinfocenter.org/clean.html. ancy detection mechanism for Potter’s Wheel. Such domains [10] H. Galhardas, D. Florescu, D. Shasha, and E. Simon. AJAX: also provide a powerful basis for specifying Split transfor- An extensible data cleaning tool. In SIGMOD, 2000. mations through example values. In future we would like to [11] M. N. Garofalakis et al. A system for extracting document type descriptors from XML documents. In SIGMOD, 2000. investigate specification of other complex transforms such as [12] S. Grumbach and G. Mecca. In search of the lost schema. In the Format transform, through examples. ICDT, 1999. Our focus with Potter’s Wheel has so far been on flat, [13] M. Hernandez and S. Stolfo. Real-world data is dirty: Data tabular data. However, nested data formats like XML are be- cleansing and the merge/purge problem. Data Mining and coming increasingly common. While much there are many Knowledge Discovery, 2(1), 1997. research efforts on transformation and query languages for [14] J. Kivinen et al. Approximate dependency inference from such data, it would be interesting to investigate graphical and relations. Theoretical Computer Science, 149(1), 1995. example-based approaches for specifying these. [15] N. Kushmerick. Wrapper induction: efficiency and expres- While we have so far looked at Potter’s Wheel as a data siveness. Artificial Intelligence, 118, 2000. cleaning tool, we would like to investigate its effectiveness [16] V.S. Lakshmanan et al. SchemaSQL: A language for intere- operability in relational multi-database systems. In VLDB, as a client interface to a interactive query processing sys- 1996. tem. The transformations applied at the client interface can [17] M. Lee et al. Cleansing data for mining and warehousing. In be viewed as refinements to the ongoing query, and can be DEXA, 1999. fed back into the query processor, thereby combining query [18] R. J. Miller. Using schematically heterogeneous structures. specification, execution, and result browsing. In SIGMOD, 1998. [19] R. J. Miller, L. Haas, and M. A. Hernandez. Schema mapping Acknowledgments as query discovery. VLDB, 2000. The scalable spreadsheet interface that we used was de- [20] L. Pitt. Inductive inference, DFAs, and computational com- veloped along with Andy Chou. Renee Miller and Subbu plexity. Analogical and Inductive Inference, 1989. [21] V. Raman et al. Scalable spreadsheets for interactive data Subramanian gave us pointers to related work on handling analysis. In DMKD Workshop, 1999. schematic heterogeneities. This work was supported by a [22] V. Raman and J. M. Hellerstein. Potter’s Wheel A-B-C. At grant from IBM Corporation, a California MICRO grant, control.cs.berkeley.edu/abc, also UCB CSD-00-1110, 2000. NSF grants IIS-9802051 and RI CDA-9401156, a Microsoft [23] V. Raman, B. Raman, and J. Hellerstein. Online dynamic Fellowship, and a Sloan Foundation Fellowship. reordering for interactive data processing. In VLDB, 1999. [24] J. Rissanen. Modeling by shortest data description. Automat- References ica, 14:465–471, 1978. [25] B. Shneiderman. The future of interactive systems and the [1] S. Abiteboul et al. Tools for data translation and integration. emergence of direct manipulation. Behavior and Information Data Engg. Bulletin, 22(1), 1999. Technology, 1(3):237–256, 1982. [2] B. Adelberg. NoDoSE — A tool for semi-automatically ex- [26] M. Stonebraker and J. M. Hellerstein. Content integration for tracting structured and semistructured data from text docu- E-Commerce. In SIGMOD, 2001. ments. In SIGMOD, 1998. [27] C. van Rijsbergen. Information Retrieval. Butterworths, [3] A. Arning et al. A linear method for deviation detection in 1975. large databases. In Proc. KDD, 1996.