- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Interactive Visual Specification of Data Transformation Scripts
展开查看详情
1 .CHI 2011 • Session: Developers & End-user Programmers May 7–12, 2011 • Vancouver, BC, Canada Wrangler: Interactive Visual Specification of Data Transformation Scripts Sean Kandel∗ , Andreas Paepcke∗ , Joseph Hellerstein† and Jeffrey Heer∗ ∗ † Stanford University University of California, Berkeley skandel, paepcke, jheer@cs.stanford.edu hellerstein@cs.berkeley.edu ABSTRACT data warehousing projects [4]. Such “data wrangling” often Though data analysis tools continue to improve, analysts requires writing idiosyncratic scripts in programming lan- still expend an inordinate amount of time and effort manip- guages such as Python and Perl, or extensive manual editing ulating data and assessing data quality issues. Such “data using interactive tools such as Microsoft Excel. Moreover, wrangling” regularly involves reformatting data values or this hurdle discourages many people from working with data layout, correcting erroneous or missing values, and integrat- in the first place. Sadly, when it comes to the practice of data ing multiple data sources. These transforms are often dif- analysis, “the tedium is the message.” ficult to specify and difficult to reuse across analysis tasks, teams, and tools. In response, we introduce Wrangler, an Part of the problem is that reformatting and validating data interactive system for creating data transformations. Wran- requires transforms that can be difficult to specify and eval- gler combines direct manipulation of visualized data with uate. For instance, analysts often split data into meaning- automatic inference of relevant transforms, enabling ana- ful records and attributes—or validate fields such as dates lysts to iteratively explore the space of applicable operations and addresses—using complex regular expressions that are and preview their effects. Wrangler leverages semantic data error-prone and tedious to interpret [2, 24]. Converting coded types (e.g., geographic locations, dates, classification codes) values, such as mapping FIPS codes to U.S. state names, to aid validation and type conversion. Interactive histories requires integrating data from one or more external tables. support review, refinement, and annotation of transformation The effects of transforms that aggregate data or rearrange scripts. User study results show that Wrangler significantly data layout can be particularly hard to conceptualize ahead reduces specification time and promotes the use of robust, of time. As data sets grow in size and complexity, discover- auditable transforms instead of manual editing. ing data quality issues may be as difficult as correcting them. Author Keywords Data cleaning, programming by demonstration. Of course, transforming and cleaning a data set is only one step in the larger data lifecycle. Data updates and evolving ACM Classification Keywords schemas often necessitate the reuse and revision of transfor- H.5.2 Information Interfaces and Presentation: UI mations. Multiple analysts might use transformed data and General Terms wish to review or refine the transformations that were previ- Human Factors ously applied; the importance of capturing data provenance is magnified when data and scripts are shared. As a result, INTRODUCTION we contend that the proper output of data wrangling is not Despite significant advances in technologies for data man- just transformed data, but an editable and auditable descrip- agement and analysis, it remains time-consuming to inspect tion of the data transformations applied. a data set and mold it to a form that allows meaningful anal- ysis to begin. Analysts must regularly restructure data to This paper presents the design of Wrangler, a system for in- make it palatable to databases, statistics packages, and vi- teractive data transformation. We designed Wrangler to help sualization tools. To improve data quality, analysts must analysts author expressive transformations while simplify- also identify and address issues such as misspellings, miss- ing specification and minimizing manual repetition. To do ing data, unresolved duplicates, and outliers. Our own infor- so, Wrangler couples a mixed-initiative user interface with mal interviews with data analysts have found that these types an underlying declarative transformation language. of transforms constitute the most tedious component of their analytic process. Others estimate that data cleaning is re- With Wrangler, analysts specify transformations by build- sponsible for up to 80% of the development time and cost in ing up a sequence of basic transforms. As users select data, Wrangler suggests applicable transforms based on the cur- rent context of interaction. Programming-by-demonstration Permission to make digital or hard copies of all or part of this work for techniques help analysts specify complex criteria such as personal or classroom use is granted without fee provided that copies are regular expressions. To ensure relevance, Wrangler enu- not made or distributed for profit or commercial advantage and that copies merates and rank-orders possible transforms using a model bear this notice and the full citation on the first page. To copy otherwise, or that incorporates user input with the frequency, diversity, republish, to post on servers or to redistribute to lists, requires prior specific and specification difficulty of applicable transform types. To permission and/or a fee. CHI 2011, May 7–12, 2011, Vancouver, BC, Canada. convey the effects of data transforms, Wrangler provides Copyright 2011 ACM 978-1-4503-0267-8/11/05...$10.00. 3363
2 .DataWrangler CHI 2011 • Session: Developers & End-user Programmers May 7–12, 2011 • Vancouver, BC, Canada Transform Script Import Export Year Property_crime_rate Split data repeatedly on newline into 0 Reported crime in Alabama rows 1 Split split repeatedly on ',' 2 2004 4029.3 3 2005 3900 Promote row 0 to header 4 2006 3937 Text Columns Rows Table Clear 5 2007 3974.9 Split Fill Delete Fold 6 2008 4081.9 Cut Drop Fill Unfold 7 Delete row 7 Promote 8 Reported crime in Alaska 9 Delete empty rows 10 2004 3370.9 11 2005 3615 Fill row 7 by copying values from above 12 2006 3582 Figure 1. The Wrangler Interface. The left panel contains (from13 2007 top-to-bottom) 3373.9 selection menu, and automat- a history of transforms, a transform Fillsuggested ically row 7 by copyingbased transforms valuesonfrom belowselection. Bold text within the transform descriptions indicate parameters that can be clicked and the current 14 2008 2928.3 revised. The right panel contains an interactive data table; above each column is a data quality meter. 15 Fold using row 7 as a key short natural language descriptions—which users can refine 16 Reported Many data cleaning crime in Arizona applications apply direct manipulation via interactive parameters—and visual previews of transform and programming-by-demonstration (PBD) methods to spe- Fold Year 17 results. Theseusing row 7 as techniques a keyanalysts to rapidly navigate enable cific cleaning tasks. Users of SWYN [2] build regular ex- and assess the space of viable transforms. 18 2004 5073.3 pressions by providing example text selections and can eval- 19 2005 uate their effect in visual 4827 previews. Potluck [14] applies si- As analysts transform data, their steps are recorded in a script multaneous text editing [19] to merge data sources. Karma 20 2006 4741.6 to facilitate reuse and provide documentation of data prove- [26] infers text extractors and transformations for web data nance. Wrangler’s interactive history viewer supports 2007 21 re- from examples entered in4502.6a table. Vegemite [18] applies view, refinement, and annotation of these scripts. Wran- 22 2008 PBD to integrate web data, automates the use of web ser- 4087.3 gler’s high-level language supports a variety of runtime plat- vices, and generates shareable scripts. Other interfaces [15] 23 forms: Wrangler scripts can be run in a web browser using apply PBD to data integration via copy and paste actions. JavaScript or translated into MapReduce or Python code. Reported crime in 24 Arkansas Wrangler applies a number of these techniques: it infers reg- We also present a controlled user study comparing Wran- 25 ular expressions from example selections [2] and supports gler and Excel across a set of data wrangling tasks. We find mass editing [14, 19]. Wrangler uses semantic roles akin that Wrangler significantly reduces specification time and to Topes [24] and provides natural language descriptions of promotes the use of robust transforms rather than manual transforms [18]. However, Wrangler differs in important editing. Wrangler is one piece of a larger effort to address ways. PBD data tools support text extraction or data integra- bottlenecks in the data lifecycle by integrating insights and tion, but lack operations such as reshaping, aggregation, and methods from the HCI and database communities. missing value imputation. Prior tools (except for Vegemite [18]) also do not generate scripts to document provenance. RELATED WORK The database and machine learning communities have con- Most closely related to Wrangler is prior work on interactive tributed a number of algorithmic techniques for aiding data data cleaning. Potter’s Wheel [22] provides a transformation cleaning and integration. These techniques include meth- language for data formatting and outlier detection. Wrangler ods for detecting erroneous values [10, 11], information ex- extends the Potter’s Wheel language with key differences traction [1, 25], entity resolution [6], type inference [7], and discussed later. Ajax [8] also provides an interface to spec- schema matching [9, 21]. In the Wrangler interface we seek ify transforms, with advanced facilities for entity resolution. to surface such techniques in an accessible manner. Neither tool provides much support for direct manipulation: interaction is largely restricted to menu-based commands or A number of commercial and research systems provide graph- entering programming statements. Google Refine [13] (for- ical interfaces leveraging the above methods. Many of these merly Freebase GridWorks) leverages Freebase to enable en- tools provide interfaces for schema matching or entity reso- tity resolution and discrepancy detection. It provides sum- lution [3, 9, 16, 23]. Toped++ [24] is an interface for creating marization and filtering support through faceted histograms. Topes, objects that validate and transform data. Topes sup- Though users can specify some commands graphically, oth- port transformations such as text formatting and lookups, but ers must be written in a command language. Moreover, the provide little support for filtering, reshaping, or aggregation. system assumes that input data arrives in a proper tabular Bellman [5] helps users understand the structure and quality format, limiting the forms of data to which it can be applied. of a database, but does not enable transformations. 3364
3 .CHI 2011 • Session: Developers & End-user Programmers May 7–12, 2011 • Vancouver, BC, Canada DataWrangler Transform Script Import Export Wrangler builds on this prior work to contribute novel tech- Split data repeatedly on newline into Year Property_crime_rate 0 Reported crime in Alabama niques for specifying data transforms. Wrangler provides rows 1 2 2004 4029.3 an inference engine that generates and rank-orders suggested Split split repeatedly on ',' 3 2005 3900 Promote row 0 to header transforms in response to direct manipulation of a data table. Text Columns Rows Table Clear 4 2006 5 2007 3937 3974.9 Analysts can navigate the space of transforms by directly se- Split Fill Delete Fold 6 2008 4081.9 Cut Drop Fill Unfold 7 lecting data, indicating a desired transform via menus, or by Delete row 7 Promote 8 Reported crime in Alaska 9 modifying a related transform; each of these actions leads Delete empty rows 10 2004 3370.9 Wrangler to further refine the set of suggestions. To help Fill row 7 by copying values from above 11 2005 3615 Figure 2. Row deletion. The12analyst 2006 selects an 3582 empty row and chooses analysts understand the effects of an operation before they a Fill row 7 by transform. delete 13 2007 Red highlights copying values from below preview which3373.9 rows will be deleted. 14 2008 2928.3 commit to it, Wrangler’s natural language transform descrip- DataWrangler Fold using row 7 as a key 15 16 Reported crime in Arizona tions are augmented by novel transform previews that visu- Transform Script Fold Year using row 7 as a key Import Export 17 Year extract Property_crime_rate alize transform results. In concert, these techniques help an- Split data repeatedly on newline into rows 0 2004 18 Reported crime in Alabama 5073.3 Alabama 1 2005 19 2004 4827 4029.3 alysts iteratively hone in on a desired transformation. Split split repeatedly on ',' 2 2006 20 2005 4741.6 3900 3 2007 21 2006 4502.6 3937 Promote row 0 to header 4 2008 22 2007 4087.3 3974.9 23 5 2008 4081.9 Delete empty rows Reported crime in 6 Reported crime in Alaska 24 Alaska USAGE SCENARIO Text Columns Rows Table Clear Arkansas 7 2004 3370.9 25 Split Fill Delete Fold Consider an example wrangling task, using housing crime Cut Drop Fill Unfold 8 2005 9 2006 3615 3582 data from the U.S. Bureau of Justice Statistics. The data Extract from Year Promote after 'in ' 10 2007 3373.9 11 2008 2928.3 were downloaded as a CSV (comma-separated values) file, Extract from Year after ' in ' 12 Reported crime in Arizona Arizona 13 2004 5073.3 but are not immediately usable by other tools: the data con- Cut from Year after 'in ' 14 2005 4827 tains empty lines, U.S. states are organized in disjoint ma- Figure 3. Text extraction. 16The Cut from Year after ' in ' 2007 15 2006 4741.6 analyst selects state names to extract 4502.6 trices, and the state names are embedded in other text. We them into Split Year a' new column. Yellow after 'in 17 2008highlights show a preview of the result. 4087.3 DataWrangler 18 Reported crime in Arkansas describe how an analyst can use Wrangler to transform the Split Year after ' in ' Arkansas 19 2004 4033.1 Transform Script Import Export data into more usable formats (Figures 1–7). Split data repeatedly on newline into 20 2005 Year State 4068 Property_crime_rate 0 Reported 21 2006 crime in Alabama Alabama 4021.6 rows 1 2004 22 2007 Alabama 4029.3 3945.5 The analyst begins by pasting the text of the file into an input Split split repeatedly on ',' 2 2005 23 3 2008 Reported crime in 2006 Alabama Alabama 3900 3843.7 3937 24 California California box; alternatively, she could upload the file. The interface Promote row 0 to header 4 25 5 2007 2004 2008 Alabama Alabama 3974.9 3423.9 4081.9 Delete empty rows now shows a data table (Fig. 1). To the left of the table is 6 Reported crime in Alaska Alaska Extract from Year after 'in ' 7 2004 Alaska 3370.9 a panel containing an interactive history, a transform menu, 8 2005 Alaska 3615 Set extract's name to State and a transform editor. The history already contains three Text Columns Rows Table Clear 9 2006 10 2007 Alaska Alaska 3582 3373.9 transforms, as Wrangler inferred that the data was in CSV Split Fill Delete Fold 11 2008 Alaska 2928.3 Cut Drop Fill Unfold 12 Reported crime in Arizona Arizona format and so split the text into rows on newline characters, Delete rows wherePromote State is null 13 2004 Arizona 5073.3 14 2005 Arizona 4827 split the rows into columns on commas, and promoted the Fill State by copying values from above 15 2006 Arizona 4741.6 first row to be the table header. Note that the analyst could Fill State by copying values from below 16 2007 17 2008 Arizona Arizona 4502.6 4087.3 undo any transform by clicking the red undo button (which 18 Reported crime in Arkansas Arkansas appears upon mouse-over of a transform), or could modify Figure 4. Filling missing values. 19 2004 The analystArkansas populates empty4033.1 cells 20 2005 Arkansas 4068 by clicking the gray bar (Fig. 3) in the data quality meter above the transform parameters in place. In this case, she has no need. 21 2006 “State” column, and then selecting Arkansas a fill transform. 4021.6 22 2007 Arkansas 3945.5 DataWrangler 23 2008 Arkansas 3843.7 Reported crime in The analyst then begins wrangling the file into a usable form. Transform Script Import Export 24 California California 25 2004 Year California State 3423.9 Property_crime_rate The analyst could specify transforms explicitly by selecting Split data repeatedly on newline into rows 0 Reported crime in Alabama Alabama 1 2004 Alabama 4029.3 a transform type from the menu and then assigning values Split split repeatedly on ',' 2 2005 Alabama 3900 to parameters; however, she instead opts to use direct ma- Promote row 0 to header 3 2006 4 2007 Alabama Alabama 3937 3974.9 nipulation along with Wrangler’s suggestion engine. First, Delete empty rows 5 2008 Alabama 4081.9 6 Reported crime in Alaska Alaska she clicks a row header for an empty row (7) to select it; Extract from Year after 'in ' 7 2004 Alaska 3370.9 the transformation editor suggests possible operations in re- Set extract's name to State 8 2005 9 2006 Alaska Alaska 3615 3582 sponse (Fig. 1). The first suggestion is to delete just the Fill State by copying values from above 10 2007 11 2008 Alaska Alaska 3373.9 2928.3 selected row. The analyst can navigate the suggestions using Text Split Columns Fill Rows Delete Table Fold Clear 12 Reported crime in Arizona Arizona 13 2004 Arizona 5073.3 the keyboard up and down arrows or by mousing over the Cut Drop Fill Unfold 14 2005 Arizona 4827 Delete rows wherePromote Year starts with description in the editor pane. As she navigates the sugges- 'Reported' 15 2006 16 2007 Arizona Arizona 4741.6 4502.6 tions, Wrangler previews the effects of the transforms in the Delete rows where Year contains 17 2008 Reported crime in Arizona 4087.3 'Reported' 18 Arkansas data table. For deletions, the preview highlights the candi- Arkansas 19 2004 Arkansas 4033.1 Extract from Year between positions 0, 8 date deleted rows in red (Fig. 2). The analyst mouses over 20 2005 Arkansas 4068 21 2006 Arkansas 4021.6 the suggestion to delete all empty rows in the table and clicks Figure Extract from5. YearDeleting on 'Reported' rows. The analyst selects text 22 2007 in an unwanted Arkansas row 3945.5 and selects Cut from a delete Year between positions 0, operation 8 within the “Rows” 23 2008 menu. Red highlight- Arkansas 3843.7 the green add button to execute the transform. The system ing previews which rows will24be Reported crime in deleted. California Cut from Year on 'Reported' California then adds the deletion operation to the history view. 25 2004 California 3423.9 The analyst would like to compare data across states, so she inference by selecting ‘Arizona’ in the cell six rows below. now needs to extract the state names and add them to each Wrangler now suggests extracting text occurring after the row of the data. She selects the text ‘Alaska’ in row 6 of the string “in ” (Fig. 3). The analyst executes this transform and “Year” column. Wrangler initially interprets this as select- renames the resulting column “State”. She notices that the ing text at positions 18-24. The analyst updates Wrangler’s column is sparsely populated. These missing values are in- 3365
4 .CHI 2011 • Session: Developers & End-user Programmers May 7–12, 2011 • Vancouver, BC, Canada DataWrangler Transform Script Import Export Year State Property_crime_rate Split data repeatedly on newline into 0 2 004 A l aba m a 4029.3 rows 1 2 005 A l aba m a 3900 Split split repeatedly on ',' 2 2 006 A l aba m a 3937 3 2 007 A l aba m a 3974.9 Promote row 0 to header 4 2 008 A l aba m a 4081.9 5 2 004 A l ask a 3370.9 Delete empty rows 6 2 005 A l ask a 3615 Extract from Year after 'in ' 7 2 006 A l ask a 3582 8 2 007 A l ask a 3373.9 Set extract's name to State 9 2 008 A l ask a 2928.3 10 2 004 A r izo n a 5073.3 Fill State by copying values from above 11 2 005 A r izo n a 4827 Delete rows where Year starts with 12 2 006 A r izo n a 4741.6 'Reported' 13 2 007 A r izo n a 4502.6 Text Columns Rows Table Clear 14 2 008 A r izo n a 4087.3 Split Fill Delete Fold Cut Drop Fill Unfold State 2004 2005 2006 2007 2008 Drop Year, Property_crime_rate Promote 0 A lab ama 4 0 29. 3 3900 3 93 7 3 9 7 4. 9 40 8 1 . 9 1 A las ka 3 3 70. 9 3615 3 58 2 3 3 7 3. 9 29 2 8 . 3 Fold Year, Property_crime_rate using 2 A riz ona 5 0 73. 3 4827 4 74 1 . 6 4 5 0 2. 6 40 8 7 . 3 header as a key 3 A rka nsas 4 0 33. 1 4068 4 02 1 . 6 3 9 4 5. 5 38 4 3 . 7 Fold Year, Property_crime_rate using 4 C ali forn ia 3 4 23. 9 3321 3 17 5 . 2 3 0 3 2. 6 29 4 0 . 3 row 0 as a key 5 C olo rado 3 9 18. 5 4041 3 44 1 . 8 2 9 9 1. 3 28 5 6 . 7 6 C onn ecti cut 2 6 84. 9 2579 2 57 5 2 4 7 0. 6 24 9 0 . 8 Unfold Year on Property_crime_rate 7 D ela ware 3 2 83. 6 3118 3 47 4 . 5 3 4 2 7. 1 35 9 4 . 7 8 D ist rict of Col umbi a 4 8 52. 8 4490 4 65 3 . 9 51 0 4 . 6 4 9 1 6. 3 Figure 6. Table reshaping. Unfold Property_crime_rate on Year The analyst selects two columns, and then elects to unfold them to create a cross-tabulation. A ghosted table overlay 9 F lor ida 4 1 82. 5 4013 3 98 6 . 2 4 0 8 8. 8 41 4 0 . 6 previews the result. Color highlights show the correspondence 10 G eor gia of data between the 4 2 23. 5 start and end states.3 92 8 . 8 4145 3 8 9 3. 1 39 9 6 . 6 Merge Year, Property_crime_rate with glue 11 H awa ii 4 7 95. 5 4800 4 21 9 . 9 4 1 1 9. 3 35 6 6 . 5 dicated by the gray bar in the12 I dah o data 2 7 81 quality meter above 13 I lli nois 3 1 74. 1 the split( 2697 3092 'data'Prev ).on(NEWLINE).max_splits(NO_MAX) Next 2 38 6 . 9 3 01 9 . 6 2 2 6 4. 2 2 9 3 5. 8 21 1 6 . 5 29 3 2 . 6 column. The analyst clicks the 14 Igray ndi ana bar and Wrangler 3 4 03. 6 sug- 3 4 6 0 'split' ).on(COMMA).max_splits(NO_MAX) split( 3 46 4 . 3 3 3 8 6. 5 33 3 9 . 6 gests transforms for missing values. 15 I owa The analyst chooses 2 9 04. 8 to 2845 columnName().row(0) 2 87 0 . 3 2 6 4 8. 6 24 4 0 . 5 fill empty cells with the value from above (Fig. 4). delete(isEmpty()) Looking at the “Year” column, the analyst notices a red bar extract( 'Year').on(/.*/ ).after(/in /) in the data quality meter indicating inconsistent data types. columnName( 'extract' ).to( 'State' ) Wrangler has inferred that the column primarily contains fill( 'State' ).method(COPY).direction(DOWN) numbers, and so has flagged non-numeric values as potential errors. She decides to remove the rows containing the text delete('Year starts with "Reported"') ‘Reported’. She selects the text ‘Reported’ in row 0. Wran- unfold( 'Year').above('Property_crime_rate' ) gler suggests split, extract, and cut transforms, but no delete Figure 7. The result of a data wrangling session is a declarative data operations. In response, the analyst selects the Delete com- dw.wrangle() cleaning script, shown here as generated JavaScript code. The script mand from the Rows menu in the transform editor. This ac- encodes a step-by-step description of how to operate on input data; a .push(dwsplit('data').on(dw.Regex.NEWLINE).max_splits(dwsplit Wrangler runtime evaluates the script to produce transformed data. tion reorders the suggestions so that delete commands have higher ranking. She finds the suggestion that deletes the un- .push(dwsplit('split').on(dw.Regex.TAB).max_splits(dwsplit.NO_ DESIGN PROCESS .push(dw.filter(dw.isEmpty())) wanted rows (Fig. 5) and executes the transform. We based Wrangler on a transformation language with a .push(dw.extract('Year')..on(/.*/)).after(/in /) At this point the analyst has wrangled the data into a proper handful of operators. Originally we thought that each of .push(dw.columnName('match').to('State')) these operators might correspond to a single interaction with relational format, sufficient for export to database and vi- sualization tools. But now suppose she would like to cre- .push(dw.fill('State').method(dw.Fill.COPY).direction(dw.Fill.DOW example data in a table view. However, after considering dif- ate a cross-tabulation of crime rates by state and year for .push(dw.unfold('Year').above('Property_crime_rate')) ferent mappings and evaluating their implications, we were subsequent graphing in Excel. She selects the “Year” and unable to devise an intuitive and unambiguous mapping be- “Property_crime_rate” columns, previews the suggested un- tween simple gestures and the full expressiveness of the lan- fold operation (Fig. 6), then executes it to create the de- guage. A given interaction could imply multiple transforms sired cross-tab. The unfold operation creates new columns and multiple interactions might imply the same transform. for each unique value found in the “Year” column, and reor- ganizes the “Property_crime_rate” values by placing each in Although this many-to-many relationship between the lan- the appropriate cell in the resulting matrix. guage and interaction might complicate our interface, we found the relationship to be relatively sparse in practice: the The analyst’s process results in a transformation script writ- number of likely transforms for a given gesture is small. As a ten in a declarative transformation language. The script pro- result, we adopted a mixed-initiative approach [12]; instead vides an auditable description of the transformation enabling of mapping an interaction to a single transform, we surface later inspection, reuse, and modification. The analyst can likely transforms as an ordered list of suggestions. We then also annotate these transformations with her rationale. By focused on rapid means for users to navigate—prune, refine, clicking the Export button above the transformation history, and evaluate—these suggestions to find a desired transform. the analyst can either save the transformed data or generate runnable code implementing the transformation (Fig. 7). Wrangler is a browser-based web application, written in Java- Script. In the next section we describe the Wrangler trans- 3366
5 .CHI 2011 • Session: Developers & End-user Programmers May 7–12, 2011 • Vancouver, BC, Canada formation language. We then present the Wrangler interface To aid data validation and transformation, Wrangler sup- and its techniques for navigating suggestion space. Next, we ports standard data types (e.g., integers, numbers, strings) describe Wrangler’s mechanisms for verification. We go on and higher-level semantic roles (e.g., geographic location, to discuss the technical details of our inference engine. classification codes, currencies). Data types comprise stan- dard primitives and associated parsing functions. Semantic THE WRANGLER TRANSFORMATION LANGUAGE roles consist of additional functions for parsing and format- Underlying the Wrangler interface is a declarative data trans- ting values, plus zero or more transformation functions that formation language. Both prior work [8, 17, 22] and em- map between related roles. As an example, consider a se- pirical data guided the language design. As our starting mantic role defining a zip code. The zip code role can check point we used the Potter’s Wheel transformation language that a zip code parses correctly (i.e., is a 5 digit number) and [22] (which in turn draws from SchemaSQL [17]). Informed that it is a valid zip code (checking against an external dic- by a corpus of data sets gathered from varied sources (e.g., tionary of known zipcodes). The zip code role can also reg- data.gov, NGOs, log files, web APIs), we then extended the ister mapping functions, e.g., to return the containing state language with additional operators for common data clean- or a central lat-lon coordinate. Wrangler leverages types and ing tasks. These include features such as positional oper- roles for parsing, validation, and transform suggestion. The ators, aggregation, semantic roles, and complex reshaping Wrangler semantic role system is extensible, but currently operators (e.g., using multiple key rows for cross-tabs). We supports a limited set of common roles such as geographic also introduced conditional mapping operators (e.g., update locations, government codes, currencies, and dates. country to “U.S.” where state=“California”). Language state- ments manipulate data tables with numbered rows and named The Wrangler language design co-evolved with the interface columns of data. Wrangler treats raw text as a “degenerate” described in subsequent sections. We sought a consistent table containing one row and one column. The language mapping between the transforms shown in the interface and consists of eight classes of transforms, described below. statements in the language. Disconnects between the two might cause confusion [20], particularly when analysts try to Map transforms map one input data row to zero, one, or mul- interpret code-generated scripts. As a result, we chose to in- tiple output rows. Delete transforms (one-to-zero) accept troduce redundancy in the language by adding operators for predicates determining which rows to remove. One-to-one high-level actions that are commonly needed but have unin- transforms include extracting, cutting, and splitting values tuitive lower-level realizations (e.g., positional operators can into multiple columns; reformatting; simple arithmetic; and be realized using key transforms, self-joins, and scalar func- value updates. One-to-many transforms include operations tions). The result is a clear one-to-one mapping between for splitting data into multiple rows, such as splitting a text transforms presented in the interface and statements in out- file on newlines or unnesting arrays and sets. put scripts. Prior work [17, 22] proves that our basic set of transforms is sufficient to handle all one-to-one and one-to- Lookups and joins incorporate data from external tables. many transforms. Through both our own practice and dis- Wrangler includes extensible lookup tables to support com- cussions with analysts, we believe our extended language is mon types of transformations, such as mapping zip codes to sufficient to handle a large variety of data wrangling tasks. state names for aggregation across states. Currently Wran- gler supports two types of joins: equi-joins and approximate joins using string edit distance. These joins are useful for THE WRANGLER INTERFACE DESIGN lookups and for correcting typos for known data types. The goal of the Wrangler interface is to enable analysts to author expressive transformations with minimal difficulty and Reshape transforms manipulate table structure and schema. tedium. To this aim, our interface combines direct manipula- Wrangler provides two reshaping operators: fold and unfold. tion, automatic suggestion, menu-based transform selection, Fold collapses multiple columns to two or more columns and manual editing of transform parameters. This synthe- containing key-value sets, while an unfold creates new col- sis of techniques enables analysts to navigate the space of umn headers from data values; see [22] for an extended dis- transforms using the means they find most convenient. cussion. Reshaping enables higher-order data restructuring and is common in tools such as R and Excel Pivot Tables. Both novices and experts can find it difficult to specify trans- form parameters such as regular expressions. While direct Positional transforms include fill and lag operations. Fill manipulation selections can help, inference is required to operations generate values based on neighboring values in a suggest transforms without programming. To reduce this row or column and so depend on the sort order of the table. gulf of execution [20], Wrangler uses an inference engine For example, an analyst might fill empty cells with preceding that suggests data transformations based on user input, data non-empty values. The lag operator shifts the values of a type or semantic role, and a number of empirically-derived column up or down by a specified number of rows. heuristics. These suggestions are intended to facilitate the discovery and application of more complicated transforms. The language also includes functions for sorting, aggrega- tion (e.g., sum, min, max, mean, standard deviation), and However, suggested transforms (and their consequences) may key generation (a.k.a., skolemization). Finally, the language be difficult to understand. To reduce this gulf of evaluation contains schema transforms to set column names, specify [20], Wrangler provides natural language descriptions and column data types, and assign semantic roles. visual transform previews. Natural language descriptions are 3367
6 .CHI 2011 • Session: Developers & End-user Programmers May 7–12, 2011 • Vancouver, BC, Canada intended to enhance analysts’ ability to review and refine transformation steps. Textual annotations enable communi- cation of analyst intent. Wrangler also couples verification (run in the background as data is transformed) with visual- ization to help users discover data quality issues. Basic Interactions The Wrangler interface supports six basic interactions within the data table. Users can select rows, select columns, click bars in the data quality meter, select text within a cell, edit data values within the table (for mass editing [14, 19]), and assign column names, data types or semantic roles. Users Figure 8. Editable Natural Language Descriptions. (a) An example of can also choose transforms from the menu or refine sugges- an editable description; highlighted text indicates editable parameters. (b) Clicking on a parameter reveals an in-place editor. (c) After editing, tions by editing transform descriptions as described below. the description may update to include new parameters. In this case, a new window size parameter is displayed for the moving average. Automated Transformation Suggestions As a user interacts with data, Wrangler generates a list of suggested transforms. In some cases the set of possible sug- only performed with regular expressions generated by the gestions is large (in the hundreds), but we wish to show only Wrangler inference engine. If a user types in a custom ex- a relevant handful to avoid overload. Instead of enumerat- pression, Wrangler will reflect their input. ing the entire suggestion space, users can prune and reorder the space in three ways. First, users can provide more exam- Visual Transformation Previews ples to disambiguate input to the inference engine. Providing Wrangler uses visual previews to enable users to quickly examples is especially effective for text selections needed evaluate the effect of a transform. For most transforms, Wran- for splitting, extraction, and reformatting; two or three well- gler displays these previews in the source data, and not as chosen examples typically suffice. Second, users can filter a separate visualization (e.g., side-by-side before and after the space of transforms by selecting an operator from the views). In-place previews provide a visual economy that transform menu. Third, users can edit a transform by alter- serves a number of goals. First, displaying two versions of ing the parameters of a transform to a desired state. a table inherently forces both versions to be small, which is particularly frustrating when the differences are sparse. Wrangler does not immediately execute a selected sugges- Second, presenting in-place modifications draws user atten- tion. Instead, Wrangler makes it the current working trans- tion to the effect of the transformation in its original context, form. The user can edit this transform directly; as a user edits without requiring a shift in focus across multiple tables. As parameters, the suggestion space updates to reflect these ed- we discuss next, in-place previews better afford direct ma- its. Also, a user can instead interact with the table to generate nipulation for users to revise the current transform. new suggestions that use the working transform as context. Wrangler maps transforms to at least one of five preview Natural Language Descriptions classes: selection, deletion, update, column and table. In To aid apprehension of suggested transforms, Wrangler gen- defining these mappings, we attempted to convey a trans- erates short natural language descriptions of the transform form’s effect with minimum displacement of the original type and parameters. These descriptions are editable, with data. This stability allows users to continue interacting with parameters presented as bold hyperlinks (Fig. 8). Clicking the original data, e.g., to provide new selection examples. a link reveals an in-place editor appropriate to the parameter (Fig. 8b). Enumerable variables (such as the direction of a Selection previews highlight relevant regions of text in all fill) are mapped to drop-down menus while free-form text affected cells (Fig. 3). Deletion previews color to-be-deleted parameters are mapped to text editors with autocomplete. cells in red (Fig. 2). Update previews overwrite values in a column and indicate differences with yellow highlights (Fig. We designed these descriptions to be concise; default param- 4). Column previews display new derived columns, e.g., as eters that are not critical to understanding may be omitted. results from an extract operation (Fig. 3). We show a side- For example, the unless between parameter for split opera- by-side display of versions when previewing fold and unfold tions indicates regions of text to ignore while splitting. In transforms. These alter the structure of the table to such an most cases, this parameter is left undefined and including it extent that the best preview is to show another table (Fig. would bloat the description. To edit hidden parameters, users 6, 9). These table previews use color highlights to match can click the expansion arrow to the left of the description, input data to their new locations in the output table. Some revealing an editor with entries for all possible parameters. transforms map to multiple classes; e.g., extract transforms use both selection and column previews. We also sought to make parameters within descriptions read- able by non-experts. For instance, we translate regular ex- When possible, previews also indicate where the user can pressions into natural language via pattern substitution (e.g., modify the transform through either direct manipulation or (\d+) to ‘number’). This translation can make some descrip- description refinement. Highlighting selected text or cells tions less concise but increases readability. Translation is works well for certain transformations. For example, by 3368
7 .DataWrangler CHI 2011 • Session: Developers & End-user Programmers May 7–12, 2011 • Vancouver, BC, Canada ransform Script Import Export split split1 split2 split3 split4 Split data repeatedly on newline into rows 0 1 STATE 2004 Participation Rate 2004 2004 Mean SAT I Verbal 2004 Mean SAT I Math 2003 Participation Rate 2003 these types, assigning the type that validates for over half of Split split repeatedly on "," 2 New York 3 Connecticut 87 85 497 515 510 515 82 84 the non-missing values. When multiple types satisfy this cri- Promote row 0 to header 4 Massachusetts 5 New Jersey 85 83 518 501 523 514 82 85 teria, Wrangler assigns the more specific one (e.g., integer is Delete rows 0,1 6 New Hampshire 7 D.C. 80 77 522 489 521 476 75 77 more specific than double). Wrangler infers semantic roles Fill row 0 by copying values from the left 8 Maine 9 Pennsylvania 76 74 505 501 501 502 70 73 analogously. An icon in the column header indicates the se- Text Table Split Columns Clear Fill Rows Delete 10 Delaware 11 Georgia 73 73 500 494 499 493 73 66 mantic role of the column, or the underlying data type if no Fold Cut Unfold Drop Fill Promote split fold fold1 value role has been assigned. Clicking the icon reveals a menu 0 New York 1 New York 2004 2004 Participation Rate 2004 Mean SAT I Verbal 87 497 with which users can manually assign a type or role. 2 New York 2004 Mean SAT I Math 510 3 New York 2003 Participation Rate 2003 82 4 New York 5 New York 2003 2003 Mean SAT I Verbal Mean SAT I Math 496 510 Above each column is a data quality meter: a divided bar 6 Connecticut 7 Connecticut 2004 2004 Participation Rate 2004 Mean SAT I Verbal 85 515 chart that indicates the proportion of values in the column 8 Connecticut 2004 Mean SAT I Math 515 9 Connecticut 2003 Participation Rate 2003 84 that verify completely. Values that parse successfully are in- 10 Connecticut 2003 Mean SAT I Verbal 512 11 Connecticut 2003 Mean SAT I Math 514 dicated in green; values that match the type but do not match 12 Massachusetts 2004 Participation Rate 2004 85 Figure 9. Visual preview of a fold operation. For transforms that rear- the role (e.g., a 6 digit zip code) are shown in yellow; those range table layout, Wrangler previews the output table and uses color that do not match the type (e.g., ‘One’ does not parse as an highlights to show the correspondence of values across table states. integer) are shown in red; and missing data are shown in gray. Clicking a bar generates suggested transforms for that highlighting the text selected by a regular expression in each category. For instance, clicking the missing values bar will cell, users can determine which examples they need to fix. suggest transforms to fill in missing values or delete those For reshape transforms, Wrangler highlights the input data rows. Clicking the fails role bar will suggest transforms such in the same color as the corresponding output in the sec- as a similarity join on misspelled country names. ondary table. For instance, in a fold operation, data values that will become keys are colored to match the keys in the output table (Fig. 9). Wrangler also highlights the param- THE WRANGLER INFERENCE ENGINE eters in the transform description using the same colors as We now present the design of the Wrangler inference engine, those generated in previews (Fig. 3–6). The consistent use which is responsible for generating a ranked list of suggested of colors allows users to associate clauses in a description transforms. Inputs to the engine consist of user interactions; with their effects in the table. the current working transform; data descriptions such as col- umn data types, semantic roles, and summary statistics; and Transformation Histories and Export a corpus of historical usage statistics. Transform sugges- tion proceeds in three phases: inferring transform parame- As successive transforms are applied, Wrangler adds their ters from user interactions, generating candidate transforms descriptions to an interactive transformation history viewer. from inferred parameters, and finally ranking the results. Users can edit individual transform descriptions and selec- tively enable and disable prior transforms. Upon changes, Wrangler runs the edited script and updates the data table. Usage Corpus and Transform Equivalence Toggling or editing a transform may result in downstream er- To generate and rank transforms, Wrangler’s inference en- rors; Wrangler highlights broken transforms in red and pro- gine relies on a corpus of usage statistics. The corpus con- vides an error message to aid debugging. sists of frequency counts of transform descriptors and initi- ating interactions. We built our initial corpus by wrangling Wrangler scripts also support lightweight text annotations. our collection of gathered data sets. The corpus updates over Analysts can use annotations to document their rationale for time as more analysts use Wrangler. a particular transform and may help future users better un- derstand data provenance. To annotate a transform, users can For any given transform, we are unlikely to find an exact click the edit icon next to the desired transform and write match in the corpus. For instance, an analyst may perform their annotation in the resulting text editor. Users can view a fold operation over a combination of columns and rows an annotation by mousing over the same edit icon. These that does not appear in the corpus. In order to get useful annotations appear as comments in code-generated scripts. transform frequencies, we define a relaxed matching routine: Users can export both generated scripts and transformed data; two transforms are considered equivalent in our corpus if (a) clicking the Export button in the transform history invokes they have an identical transform type (e.g., extract or fold) export options. Analysts can later run saved or exported and (b) they have equivalent parameters as defined below. scripts on new data sources, modifying the script as needed. Wrangler transforms accept four basic types of parameters: TYPES, ROLES, AND VERIFICATION row, column or text selections and enumerables. We treat It is often difficult to discover data quality issues and there- two row selections as equivalent if they both (a) contain fil- fore difficult to address them by constructing the appropri- tering conditions (either index- or predicate-based) or (b) ate transform. Wrangler aids discovery of data quality issues match all rows in a table. Column selections are equivalent through the use of data types and semantic roles. if they refer to columns with the same data type or semantic role. We based this rule on the observation that transforms As users transform data, Wrangler attempts to infer the data that operate on identical data types are more likely to be type and semantic role for each column. Wrangler applies similar. Text selections are equivalent if both (a) are index- validation functions to a sample of a column’s data to infer based selections or (b) contain regular expressions. We con- 3369
8 .CHI 2011 • Session: Developers & End-user Programmers May 7–12, 2011 • Vancouver, BC, Canada Reported crime in Alabama sider enumerable parameters equivalent only if they match exactly. We chose these equivalency classes based on ex- (a) ploratory analysis of our corpus and they seem to work well before: {‘in’, ‘ ’} ‘Alabama’ → {‘Alabama’, word} in practice. As our corpus of transforms grows with more (b) selection: {‘Alabama’} ‘in’ → {‘in’, word, lowercase} use, we plan to explore more principled approaches (such as after: ∅ ‘ ’ → {‘ ’} clustering) to refine our matching routines. before: {(‘ ’), (‘in’, ‘ ’), (word, ‘ ’), (lowercase, ‘ ’)} (c) selection: {(‘Alabama’), (word)} Inferring Parameter Sets from User Interaction after: ∅ In response to user interaction, Wrangler attempts to infer {(),(‘Alabama’),()} {(),(word),()} three types of transform parameters: row, column, or text {(‘ ’),(),()} {(word, ‘ ’),(),()} {(‘ ’),(‘Alabama’),()} {(word, ‘ ’),(‘Alabama’),()} selections. For each type we enumerate possible parameter (d) {(‘ ’),(word),()} {(word, ‘ ’),(word),()} values, resulting in a collection of inferred parameter sets. {(‘in’, ‘ ’),(),()} {(lowercase, ‘ ’),(),()} We infer a parameter’s values independent of the other pa- {(‘in’, ‘ ’),(‘Alabama’),()} {(lowercase, ‘ ’),(‘Alabama’),()} rameters. For example, we infer regular expressions for text {(‘in’, ‘ ’),(word),()} {(lowercase, ‘ ’),(word),()} selection based solely on the selected text, a process other- (e) {(lowercase, ‘ ’),(‘Alabama’),()} → /[a-z]+ (Alabama)/ wise independent of which rows or columns are selected. Figure 10. Regular Expression Inference. (a) The user selects text in a We infer row selections based on row indices and predicate cell. (b) We tokenize selected and surrounding text. For clarity, the fig- ure only includes two neighboring tokens. For each token, we generate matching. We list predicates of the form “row is empty” and a set of matching labels. (c) We enumerate all label sequences matching “column [equals | starts with | ends with | contains] selected- the text. (d) We then enumerate all candidate before, selection and after value”, then emit the selections that match the rows and text combinations. Patterns that do not uniquely match the selected text are currently selected in the interface. For column selections we filtered (indicated by strike-through). (e) Finally, we construct regular expressions for each candidate pattern. simply return the columns that users have interacted with. Emitted text selections are either simple index ranges (based directly on selections in the interface) or inferred regular ex- Ensuring that transforms of the same type are adjacent helps pressions. To generate regular expressions, we tokenize the users compare varying parameterizations more easily. text within a cell and extract both the selected text and any surrounding text within a 5 token window. We annotate to- First, we consider explicit interactions: if a user chooses a kens with one or more labels of the form number, word, up- transform from the menu or selects a current working trans- percase word, lowercase word, or whitespace. We then enu- form, we assign higher rank to transforms of that type. Sec- merate label sequences that match the text before, within, ond, we consider specification difficulty. We have observed and after the selection range (see Fig. 10); sequences can that row and text selection predicates are harder to specify contain either an annotation label or the exact token text. than other parameters. We thus label row and text selections Next we emit all possible combinations of before, within, as hard and all others as easy. We then sort transform types and after sequences that match all current text selection ex- according to the count of hard parameters they can accept. amples in the interface. It is then straightforward to translate Third, we rank transform types based on their corpus fre- matching label sequences into regular expressions. quency, conditioned on their initiating user interaction (e.g., text or column selection). In the case of text selection, we also consider the length of the selected text. If a user selects Generating Suggested Transforms three or fewer characters, split transforms are ranked above After inferring parameter sets, Wrangler generates a list of extract transforms; the opposite is true for longer selections. transform suggestions. For each parameter set, we loop over each transform type in the language, emitting the types that We then sort transforms within type. We first sort trans- can accept all parameters in the set. For example, a split forms by frequency of equivalent transforms in the corpus. transform can accept a parameter set containing a text selec- Second, we sort transforms in ascending order using a sim- tion, but an unfold transform can not. Wrangler instantiates ple measure of transform complexity. Our goal is to prefer- each emitted transform with parameters from the parameter entially rank simpler transforms because users can evaluate set. To determine values for missing parameters, we query their descriptions more quickly. We define transform com- the corpus for the top-k (default 4) parameterizations that co- plexity as the sum of complexity scores for each parameter. occur most frequently with the provided parameter set. Dur- The complexity of a row selection predicate is the number of ing this process we do not infer complex criteria such as row clauses it contains (e.g., “a=5 and b=6” has complexity 2). predicates or regular expressions; we do infer enumerable The complexity of a regular expression is defined to be the parameters, index-based row selections, and column inputs. number of tokens (described previously) in its description. We then filter the suggestion set to remove “degenerate” (no- All other parameters are given complexity scores of zero. op) transforms that would have no effect on the data. Finally, we attempt to surface diverse transform types in the Ranking Suggested Transforms final suggestion list. We filter the transforms so that no type Wrangler then rank-orders transform suggestions according accounts for more than 1/3 of the suggestions, unless the to five criteria. The first three criteria rank transforms by transform type matches the working transform or the filter their type; the remaining two rank transforms within types. results in fewer suggestions than can appear in the interface. 3370
9 .CHI 2011 • Session: Developers & End-user Programmers May 7–12, 2011 • Vancouver, BC, Canada COMPARATIVE EVALUATION WITH EXCEL T1 As an initial evaluation of Wrangler, we conducted a com- parative user study with Microsoft Excel. Subjects performed T2 three common data cleaning tasks: value extraction, missing value imputation, and table reshaping. Our goal was to com- T3 pare task completion times and observe data cleaning strate- 0 1 2 3 4 5 6 7 8 9 10 gies. We chose Excel because it is the most popular data ma- User Study Task Completion Time (minutes) Wrangler Excel nipulation tool and provides an ecologically valid baseline for comparison: all subjects use it regularly and half self- Figure 11. Task completion times. Black bars indicate median values. Median Wrangler performance is over twice as fast in all tasks. report as experts. Excel also supports our chosen tasks. Nei- ther Potter’s Wheel [22] (no support for fill) nor Google Re- fine [13] (lack of reshaping) support the full set. In contrast, Wrangler Accelerates Transform Specification Excel includes specific tools for each task (text-to-columns, We performed a repeated-measures ANOVA of completion goto-special & pivot tables) in addition to manual editing. times with task, tool, and Excel novice/expert2 as indepen- dent factors; we log-transformed responses to better approx- Participants and Methods imate a normal distribution. We found a significant main We recruited 12 participants, all professional analysts or grad- effect of tool (F1,54 = 23.65, p < 0.001), but no main effect uate students who regularly work with data. Subjects rated of task (F1,54 = 0.01, p = 0.943) or expertise (F1,54 = 0.30, their prior experience with Excel on a 10-point scale (1 be- p = 0.596). We found a significant interaction effect of task ing basic knowledge and 10 being expert); the median score and expertise (F1,54 = 11.10, p < 0.002) driven by improved was 5. Participants had never used the Wrangler interface. performance by experts (regardless of tool) in the reshaping task (T3). No other interactions were significant. We first presented a 10 minute Wrangler tutorial describ- ing how to create, edit, and execute transforms. We then Across all tasks, median performance in Wrangler was over asked subjects to complete three tasks (described below) us- twice as fast as Excel (Fig. 11). Users completed the clean- ing both Wrangler and Excel. We randomized the presenta- ing tasks significantly more quickly with Wrangler than with tion of tasks and tools across subjects. In each task, we asked Excel, and this speed-up benefitted novice and expert Excel subjects to transform a data set into a new format, presented users alike. Moreover, the user study tasks involved small to them as a picture of the final data table. data sets amenable to manual manipulation. As data set size grows, we expect the benefits of Wrangler to come into even Task 1: Extract Text. In this task, we asked users to ex- sharper relief. Of course, larger data sets might complicate tract the number of bedrooms and housing price from hous- the process of assessing transform effects and so may benefit ing listings on craigslist. The original data set contained one from additional validation and visualization techniques. cell for each listing, with all the information in a text string. The target data set consisted of two columns: one for the number of bedrooms and one for the housing price. Strategies for Navigating Suggestion Space When working with Wrangler, users applied different nav- Task 2: Fill Missing Values. We gave users data containing igation strategies for different tasks. These strategies were year-by-year agricultural data for three countries. Some of largely consistent across users. For text selection, users fre- the values in the data set were blank. The target data set con- quently provided multiple examples. For other operations, tained the same data with all missing values replaced with users performed an initial selection and then previewed each the closest non-empty value from a previous year.1 suggestion. One subject noted, “I just look at the picture.” Users with a programming background spent time reading Task 3: Reshape Table Structure. Users started with three transform descriptions, whereas the other users relied almost columns of housing data: year, month, and price. The target entirely on the previews. When users did not find a transform data set contained the same data formatted as a cross-tab: the among the initial suggestions, they most often filtered the data contained one row for each year, with the 12 months as suggestions by selecting a transform type from the menu. If column headers and housing prices as cell values. only imperfect matches were found, users then selected the nearest transform and edited its parameters. In other words, When using Excel, we allowed subjects to ask for references users turned to manual parameterization only as a last resort. to functions they could describe concretely (e.g., we would answer “how do I split a cell?” but not “how do I get the Our post-study questionnaire asked users to rate automated number of bedrooms out?”). For Wrangler tasks, we did not suggestions, visual previews, and direct editing of transforms respond to user inquiries. We permitted a maximum of 10 on a scale from 1 (not useful) to 5 (most useful). We per- minutes per task. Each data set had at most 30 rows and 4 formed an ANOVA and found a significant difference among columns; complete manual manipulation in Excel was eas- the ratings (F2,33 = 17.33, p < 0.001). Users rated previews ily attainable within the time limits. Afterwards, each user (µ = 4.8) and suggestions (µ = 4.3) significantly more use- completed a post-study questionnaire. ful than direct editing (µ = 2.5) (p < 0.001 in both cases by 1 2 We acknowledge that this is not an ideal cleaning solution for the We divided subjects into “novices” and “experts” according to data, but it nonetheless served as a useful test. their median self-reported expertise rating (5). 3371
10 .CHI 2011 • Session: Developers & End-user Programmers May 7–12, 2011 • Vancouver, BC, Canada Tukey’s HSD). Users’ preference for suggestions and pre- ACKNOWLEDGEMENTS views over direct editing provides evidence that these novel The first author was supported by a Stanford Graduate Fel- user interface features have merit. lowship. We also thank the Boeing Company, Greenplum and Lightspeed Venture Partners for their support. This work Users’ navigation strategies worked well when they under- was partially funded by NSF Grant CCF-0963922. stood the nature of the desired transform, even if they did not know how to specify it. However, we found that users REFERENCES of both tools experienced difficulty when they lacked a con- 1. A. Arasu and H. Garcia-Molina. Extracting structured data from web ceptual model of the transform. For instance, Task 3 ex- pages. In ACM SIGMOD, pages 337–348, 2003. hibited an uneven distribution of completion times; 7 of the 2. A. F. Blackwell. SWYN: A visual representation for regular 10 fastest times and 3 of the 4 slowest times were in Wran- expressions. In Your Wish is my Command: Programming by Example, pages 245–270, 2001. gler. Wrangler does not provide the recourse of manual edit- 3. L. Chiticariu, P. G. Kolaitis, and L. Popa. Interactive generation of ing, hence users who got stuck fared slightly better in Excel. integrated schemas. In ACM SIGMOD, pages 833–846, 2008. However, those familiar with pivot operations in Excel uni- 4. T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. formly performed the task more quickly with Wrangler. John Wiley & Sons, Inc., New York, NY, 2003. 5. T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining We also observed one recurring pitfall: a few users got stuck database structure; or, how to build a data quality browser. In ACM in a “cul-de-sac” of suggestion space by incorrectly filtering SIGMOD, pages 240–251, 2002. (e.g., by selecting a specific transform type from the menu). 6. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate When this happened, some users kept searching and refining record detection: A survey. IEEE TKDE, 19(1):1–16, 2007. only these filtered transforms. By design, Wrangler does not 7. K. Fisher and R. Gruber. Pads: a domain-specific language for afford users the same flexibility to layout data as in Excel; processing ad hoc data. In ACM PLDI, pages 295–304, 2005. since users cannot perform arbitrary editing in Wrangler, the 8. H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: an extensible data cleaning tool. In ACM SIGMOD, page 590, 2000. recourse is less obvious when they get stuck. This pitfall was 9. L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio most common in Task 3, where a user might mistakenly filter grows up: from research prototype to industrial tool. In ACM all but fold operations when an unfold operation was needed. SIGMOD, pages 805–810, 2005. One solution may be to suggest non-matching transforms re- 10. J. M. Hellerstein. Quantitative data cleaning for large databases, 2008. lated to the selected transform type, in effect treating filtering White Paper, United Nations Economic Commission for Europe. criteria as guidelines rather than strict rules. 11. V. Hodge and J. Austin. A survey of outlier detection methodologies. Artif. Intell. Rev., 22(2):85–126, 2004. 12. E. Horvitz. Principles of mixed-initiative user interfaces. In ACM CHI, CONCLUSION AND FUTURE WORK pages 159–166, 1999. This paper introduced Wrangler, an interface and underly- 13. D. Huynh and S. Mazzocchi. Google Refine. ing language for data transformation. The system provides http://code.google.com/p/google-refine/. a mixed-initiative interface that maps user interactions to 14. D. F. Huynh, R. C. Miller, and D. R. Karger. Potluck: semi-ontology suggested data transforms and presents natural language de- alignment for casual users. In ISWC, pages 903–910, 2007. scriptions and visual transform previews to help assess each 15. Z. G. Ives, C. A. Knoblock, S. Minton, M. Jacob, P. Pratim, T. R. suggestion. With this set of techniques, we find that users Tuchinda, J. Luis, A. Maria, and M. C. Gazen. Interactive data integration through smart copy & paste. In CIDR, 2009. can rapidly navigate to a desired transform. 16. H. Kang, L. Getoor, B. Shneiderman, M. Bilgic, and L. Licamele. Interactive entity resolution in relational data: A visual analytic tool Our user study demonstrates that novice Wrangler users can and its evaluation. IEEE TVCG, 14(5):999–1014, 2008. perform data cleaning tasks significantly faster than in Excel, 17. L. V. S. Lakshmanan, F. Sadri, and S. N. Subramanian. SchemaSQL: an effect shared across both novice and expert Excel users. An extension to SQL for multidatabase interoperability. ACM Trans. We found that users are comfortable switching navigation Database Syst., 26(4):476–519, 2001. strategies in Wrangler to suit a specific task, but can some- 18. J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user times get stuck—in either tool—if they are unfamiliar with programming of mashups with vegemite. In IUI, pages 97–106, 2009. the available transforms. Future work should help users form 19. R. C. Miller and B. A. Myers. Interactive simultaneous editing of data cleaning strategies, perhaps through improved tutorials. multiple text regions. In USENIX Tech. Conf., pages 161–174, 2001. 20. D. A. Norman. The Design of Everyday Things. Basic Books, 2002. Looking forward, Wrangler addresses only a subset of the 21. E. Rahm and P. A. Bernstein. A survey of approaches to automatic hurdles faced by data analysts. As data processing has be- schema matching. The VLDB Journal, 10:334–350, 2001. come more sophisticated, there has been little progress on 22. V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, pages 381–390, 2001. improving the tedious parts of the pipeline: data entry, data 23. G. G. Robertson, M. P. Czerwinski, and J. E. Churchill. Visualization (re)formatting, data cleaning, etc. The result is that people of mappings between schemas. In ACM CHI, pages 431–439, 2005. with highly specialized skills (e.g., statistics, molecular bi- 24. C. Scaffidi, B. Myers, and M. Shaw. Intelligently creating and ology, micro-economics) spend more time in tedious “wran- recommending reusable reformatting rules. In ACM IUI, pages gling” tasks than they do in exercising their specialty, while 297–306, 2009. less technical audiences such as journalists are unnecessarily 25. S. Soderland. Learning information extraction rules for shut out. We believe that more research integrating methods semi-structured and free text. Mach. Learn., 34(1-3):233–272, 1999. from HCI, visualization, databases, and statistics can play a 26. R. Tuchinda, P. Szekely, and C. A. Knoblock. Building mashups by vital role in making data more accessible and informative. example. In ACM IUI, pages 139–148, 2008. 3372