Profiler:用于数据质量评估的综合统计分析和可视化

We present Profiler, a visual analysis tool for assessing quality issues in tabular data. Profiler applies data mining methods to automatically flag problematic data and suggests coordinated summary visualizations for assessing the data in context. The system contributes novel methods for integrated statistical and visual analysis, automatic view suggestion, and scalable visual summaries that support real-time interaction with millions of data points. We present Profiler’s architecture — including modular components for custom data types, anomaly detection routines and summary visualizations— and describe its application to motion picture, natural disaster and water quality data sets.
展开查看详情

1. Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment Sean Kandel∗ , Ravi Parikh∗ , Andreas Paepcke∗ , Joseph M. Hellerstein† , Jeffrey Heer∗ ∗ Stanford University † University of California, Berkeley {skandel, rparikh, paepcke, jheer}@cs.stanford.edu, hellerstein@cs.berkeley.edu ABSTRACT Discovering a potential error is only the first step towards clean Data quality issues such as missing, erroneous, extreme and dupli- data. Before manipulating the data, an analyst may investigate why cate values undermine analysis and are time-consuming to find and an anomaly has occurred to inform possible fixes. The analyst must fix. Automated methods can help identify anomalies, but determin- place the anomaly in context by scrutinizing its relationship with ing what constitutes an error is context-dependent and so requires other dimensions of the data. Appropriately-chosen visualizations human judgment. While visualization tools can facilitate this pro- can help reveal and contextualize these anomalies. Histograms and cess, analysts must often manually construct the necessary views, scatter plots, for instance, may reveal outlying values in a distribu- requiring significant expertise. We present Profiler, a visual analy- tion. Analysts typically have to choose which views to construct: sis tool for assessing quality issues in tabular data. Profiler applies they must determine which subset of data columns and rows to vi- data mining methods to automatically flag problematic data and sualize, how to transform the data, choose visual encodings, and suggests coordinated summary visualizations for assessing the data specify other criteria such as sorting and grouping. Determining in context. The system contributes novel methods for integrated which visualizations to construct may require significant domain statistical and visual analysis, automatic view suggestion, and scal- knowledge and expertise with a visualization tool. able visual summaries that support real-time interaction with mil- In response we present Profiler, a visual analysis system to aid lions of data points. We present Profiler’s architecture — including discovery and assessment of data anomalies. Profiler uses type in- modular components for custom data types, anomaly detection rou- ference and data mining routines to identify potential data quality tines and summary visualizations — and describe its application to issues in tabular data. Profiler then suggests coordinated, multi- motion picture, natural disaster and water quality data sets. view visualizations to help an analyst assess anomalies and contex- tualize them within the larger data set. Categories and Subject Descriptors Our first contribution is an extensible system architecture that enables integrated statistical and visual analysis for data quality as- H.5.2 [Information Interfaces and Presentation]: User Interfaces sessment. This modular architecture supports plug-in APIs for data types, anomaly detection routines and summary visualizations. We Keywords populate this framework with commonly-needed data types and de- Data analysis, visualization, data quality, anomaly detection tection routines. We focus primarily on univariate anomalies due to their frequency, tractability, and relative ease of explanation. 1. INTRODUCTION We demonstrate how coupling automated anomaly detection with linked summary visualizations allows an analyst to discover and Data sets regularly contain missing, extreme, duplicate or er- triage potential causes and consequences of anomalous data. roneous values that can undermine the results of analysis. These Our architecture also introduces novel visual analysis compo- anomalies come from various sources, including human data entry nents. We contribute a technique for automatic view suggestion error, inconsistencies between integrated data sets, and sensor inter- based on mutual information. Profiler analyzes the mutual infor- ference. Flawed analyses due to dirty data are estimated to cost bil- mation between table columns and the output of anomaly detection lions of dollars each year [6]. Discovering and correcting data qual- to suggest sets of coordinated summary visualizations. Our model ity issues can also be costly: some estimate cleaning dirty data to recommends both table columns and aggregation functions to pro- account for 80 percent of the cost of data warehousing projects [5]. duce visual summaries that aid assessment of anomalies in context. The statistics and database communities have contributed a num- We also contribute the design of scalable summary visualiza- ber of automated routines for detecting dirty data, such as finding tions that support brushing and linking to assess detected anoma- outliers or duplicate records. While these techniques can reveal po- lies. Through linked selections, analysts can project anomalies in tential issues, human judgment is required to determine if the issues one column onto other dimensions. Our aggregate-based summary are in fact errors and how they should be treated. For example, out- views bin values to ensure that the number of visual marks de- lier detection might flag a high temperature reading; an analyst then pends on the number of groups, not the number of data records. We needs to assess if the reading is an exceptional event or an error. provide optimizations for query execution and rendering to enable real-time interaction with data sets in excess of a million rows. Permission to make digital or hard copies of all or part of this work for 2. RELATED WORK personal or classroom use is granted without fee provided that copies are Profiler draws on three areas of related work: anomaly detection, not made or distributed for profit or commercial advantage and that copies data cleaning tools, and visual analysis systems. bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific 2.1 Classifying Data Anomalies permission and/or a fee. AVI ’12, May 21-25, 2012, Capri Island, Italy The database and statistics literature includes many taxonomies Copyright 2012 ACM 978-1-4503-1287-5/12/05 ...$10.00. of anomalous data [5, 7, 10, 19, 24, 33]. These taxonomies inform

2.Major Genre 0 240 0 100 Based on Short Film Distributor Production Budget Spin-Off Worldwide Gross 3K Based on Factual Book/... 3K Release Location Disney Ride Error (2) Compilation Based on Toy 0 0 Extreme (7) 0 300M Musical Group Movie 0 3B Inconsistent (3) Distributor (Levenshtein) Source (Levenshtein) Title (Levenshtein) Schema (1) Schema Browser Transform: Creative Type Distributor MPAA Rating Creative Type Major Genre US DVD Sales R Contemporary Fiction Drama IMDB Rating PG-13 Historical Fiction Comedy PG Fantasy Action IMDB Votes Not Rated Science Fiction Adventure G Dramatization Thriller/Suspense 0 MPAA Rating NC-17 Kids Fiction Horror Major Genre Open Factual Romantic Comedy Super Hero Musical Production Budget Multiple Creative Types Documentary Black Comedy Western Related Views: Anomalies Concert/Performance Release Date Source 220 Original Screenplay Anomaly Browser Based on Book/Short Story Based on Real Life Events Missing (6) Remake 0 MPAA Rating 1911 2010 Based on TV Based on Comic/Graphic... Running Time (min) Rotten Tomatoes Rating Creative Type Based on Play 2K 900 Based on Game Source Traditional/Legend/Fai... Major Genre Based on Magazine Article 0 Based on Musical/Opera 0 Distributor 0 240 Based on Short Film 0 100 Release Location Production Budget Spin-Off Worldwide Gross 3K Based on Factual Book/... 3K Error (2) Disney Ride Extreme (7) Compilation Based on Toy Inconsistent (3) 0 Musical Group Movie 0 0 300M 0 3B Distributor (Levenshtein) Source (Levenshtein) Title (Levenshtein) Figure 1: The Profiler User Interface. The UI contains (clockwise from top-left): (a) schema browser, (b) formula editor, (c) canvas of Schema (1) linked summary visualizations, and (d) anomaly browser. Profiler generates a set of linked views for each identified anomaly. Here, we investigate possible causes of missing MPAA movie ratings. The grey bar above the MPAA rating chart indicates missing values; we select it to highlight matching records. The Release Date chart shows that missing ratings correlate with earlier release dates. a variety of algorithms for detecting outliers [4, 10, 12, 33], dupli- 2.2 Data Cleaning Tools cate records [7], and key violations [13]. While these routines flag Motivated by the issues above, database and HCI researchers potential issues, most types of error require some form of human in- have created interactive systems for data cleaning. Many of these tervention to assess and correct [19]. Here, we focus on errors that interfaces focus on data integration [9, 15, 21, 27, 34] or entity res- arise within a single relational table. Guided by prior taxonomies, olution [17]. Here we focus on data quality issues in a single table. we identified five categories of anomalies to address in Profiler: Profiler does include detectors for duplicate values, but we do not Missing data results from a number of sources, including in- attempt to address the general problem of entity resolution. complete collection or redaction due to privacy concerns. Miss- Other interfaces support mass reformatting of raw input data [14, ing data can take the form of missing records or missing attributes. 16, 25, 29]. A common form of discrepancy detection is provided These issues can lead to a loss of statistical power if too many cases by data type definitions that specify constraints for legal values [16, are unobserved and can introduce bias into model estimates, espe- 25, 29]. These systems are usually limited to finding formatting cially when data is not missing at random [1]. discrepancies for individual values. Profiler’s data types are similar Erroneous data can arise because of error during data entry, to domains in Potter’s Wheel [25] and Scaffidi et al.’s Topes [29]. measurement, or distillation [10]. Obviously, analysis of incorrect However, Profiler detects a broader range of discrepancies, includ- data can lead to incorrect conclusions. ing distribution-dependent outliers and duplicate values. Unlike Inconsistent data refers to variable encodings of the same value. these prior tools, Profiler also generates scalable interactive visual Examples include variations in spelling or formatting, measure- summaries to aid anomaly assessment. ment units, or coding schemes (e.g., names vs. abbreviations). Perhaps most comparable to Profiler is Google Refine [14], which Extreme values such as outliers can undermine robust analysis supports both faceted browsing and text clustering to identify data and may be erroneous. Extreme values may be standard univariate quality issues. Refine users must manually specify which facets outliers, or may be type specific. For example, time-series outliers and clusters to create. In contrast, Profiler automatically suggests generally take two forms [33]: an additive outlier is an unexpected, visualizations to aid discovery and assessment of discrepancies. transient movement in a measured value over time, whereas an in- Profiler is integrated with the Wrangler [16] data transformation novation outlier is an unexpected movement that persists over time. tool. An analyst can transform raw data using Wrangler. Once the Key violations refer to data that violate primary key constraints. data is properly formatted as a relational table, Profiler can leverage For example, having two employees with the same social security type information to automate anomaly detection and visualization. number violates the assumption that SSN is a key. Observed issues can fall into multiple categories: a numeric out- 2.3 Visual Analysis Systems lier may result from an accurate measurement of an extreme value, Visualization can support discovery of patterns in data, includ- a data entry error, or from inconsistent units (feet vs. meters). ing anomalies [18]. Aggregation, clustering and sorting have been used in various contexts to support scalable visualization for large

3. Creative Type Transform: Distributor data sets [3, 20, 28, 35]. Through linked IMDB Rating highlighting (“brushing Worldwide Gross Release Date Creative Type 2800 350 Contemporary Fict & linking”), coordinated multiple views enable assessment of rela- Historical Fiction IMDB Votes tionships between data dimensions [22, 36]. Profiler’s visualization Fantasy 0 0 0 2800000000 Jan Dec Science Fiction layer extends this prior work with a set MPAA Rating of type-specific aggregate Dramatization US Gross Major Genre Kids Fiction visualizations that aid assessment of data quality issues. 2600 Drama Factual Visual analytic tools such as Tableau [31], GGobi Related Views:[32], and Im- Context Comedy Super Hero provise [36] enable analysts to construct multi-dimensional views Action Multiple Creative T 0 Adventure of data. However, these tools generallyAnomalyrequireBrowser users to choose 0 800000000 Thriller/Suspense Production Budget Horror Missingof which variables to visualize. As the number (5)data subsets ex- 2400 Romantic Comedy MPAA rely plodes combinatorially, analysts must often Rating on significant do- Musical Documentary main expertise to identify variables thatCreative may contain Type or help explain 0 Black Comedy 0 300000000 anomalies. To facilitate the view selection Sourceprocess, Profiler auto- Western IMDB Votes Running Time (min) Concert/Performance matically suggests both data subsets and appropriate summary vi- Major Genre Figure 2: Automatically generated views to help assess World- 2600 1800 sualizations based on identified anomalies and inferred data types. wide Gross. Worldwide Gross correlates with high US Gross Distributor While other tools support general exploratory analysis, Profiler pro- and Production Budgets. High0 gross also coincides with Action0 Error (1) 0 600000 0 240 vides guided analytics to enable rapid quality assessment. &USAdventure movies and the Summer & Winter seasons. Pro- Extreme (7) Production Budget DVD Sales Others have explored interfaces for guiding analysis and suggest- Transform: filer chose to bin Release Date IMDB Rating Release Date 3000by month instead of by year. ing appropriate views. Social Action [23] uses a wizard-like inter- IMDB Votes Release Location face to guide users through social network analysis. Seo and Shnei- Worldwide Gross vs US Gross 22 Release Location 0 Production Budget Rotten Tomatoes Rating 0 400000000 derman’s rank-by-feature framework [30] sorts histograms and scat- Running Time terplots of numeric data according to user-selectedRunning (min) criteria.Time (min) Oth- ers have used dimensionality reduction, US clustering DVD Sales andViews: Related sortingNoneto aid visualization of multidimensional data [8, 11,Anomaly US Gross 37]. InBrowserProfiler, we use anomaly detection followed by mutual Worldwide information Missing (6)analysis Gross to suggest a set of coordinated summary views forMPAA Inconsistent Rating assessing (3) Type data Creative 0 quality issues. Our suggestion engine automates Schema (1) theSource choice of data 0 22 Major Genre columns, aggregation functions and visual encodings. Distributor Figure 3: Map assessing 2D outliers in a binned scatter plot. Se- Release Location lected in the scatter plot are movies with high Worldwide Gross 3. USAGE SCENARIOSchema Browser Error (2) Extreme (7) but low US Gross (in orange). Linked highlights on the map Inconsistent (3) Transform: Before describing Profiler’s architecture, we begin with a repre- confirm that the movies were released outside of the US. Creative Type Schema (1) sentative usage scenario. Consider an example task, using movie Title Title | 2001 data compiled from IMDB, Rotten Tomatoes Distributorand The Numbers. The Texas Chainsaw Massacre The Brothers This data set contains 16 columns and overIMDB3,000 Ratingmovies. The data The Texas Chainsaw Massac... The Others includes metadata such as the title, primary production location, di- Final Destination IMDB Votes Final Destination 2 Title | 2004 rector, MPAA rating, and release date; financial information such Final Destination 3 Dawn of the Dead as DVD sales and worldwide gross; andMPAA IMDB Rating ratings. Halloween Shaun of the Dead Major An analyst is interested in which factors Genre affect a movie’s eventual Halloween 2 Halloween II Title | 2008 revenue. She first loads the data into Profiler to assess Production Budgetoverall data Day of the Dead Mission: Impossible quality. The interface shows a schema browser, anomaly browser, Mission: Impossible 2 Diary of the Dead formula editor and an empty canvas Related (Figure 1).Views: The schemaNone browser Mission: Impossible III The Reader Predator The Rocker shows the column names in the data set; the analyst could double- Predator 2 click column names or drag them into Anomaly Browser the canvas to visualize the Predators Title | 2009 The Informant Missingthe corresponding column. Instead, she examines (6)anomaly browser. Spider-Man The Informers The anomaly browser displays potential Spider-Man 2 MPAA quality Rating issues, grouped Spider-Man 3 by issue type and sorted by severity. For each issue, Title | 2003 Creative Type Profiler lists the Figure 4: Conditioned duplicate detection. Left: Movie titles The Grudge columns containing the issue and the name Source of the detection routine The Grudge by clustered 2 Levenshtein distance reveal over 200 potential du- that flagged the anomaly. The analyst clicks the MPAA Rating The Pledge Major Genre plicates. Right: Conditioning the clustering routine on ‘Release label in the missing values group. In response, Profiler displays the Toy Story Distributor Year’ reduces Toy Story 2 the number of potential duplicates to 10. MPAA Rating data as a categorical bar chart showing the counts Toy Story 3 Release Location for each rating type. The chart title includes a data summary bar: A25th histogram Hour reveals a small number of high grossing movies. To Errorindicate green bars indicate parsed values, red bars (3) type verification Rush Hour explanatory visualizations, the analyst selects ‘Data Val- generate Rush Hour 2 errors, and grey bars indicate missingExtreme values. (7) ues’ Rush from Hour 3 the related views menu — this operation requests views Inconsistent Curious why so many values are missing, (3) the analyst adds re- that Scarymight Movie help explain the total distribution of Worldwide Gross, lated visualizations by selecting the Distributor ‘Anomaly’ option in the re- (Levenshtein) Scary not Movie just 2 flagged anomalies. She mouses over the bars at the high Scary Movie 3 lated views menu — this operation requests Source views that might ex- (Levenshtein) end Scaryof the4Worldwide Gross histogram and sees that these values Movie plain the observed anomaly. She then selects Title the grey bar in the (Levenshtein) correlate American Pie with high values in other financial figures, such as U.S. MPAA Rating chart to see how missing values project across other American Gross Pie 2 (Figure 2). She notices that Action and Adventure movies Schema (1) columns (Figure 1). She finds that missing ratings correlate with Big Momma's account forHouse a disproportionate number of highly grossing movies. Big Momma's House 2 early release dates. While this is interesting, she determines that The time-series view reveals that these films spike during the sum- Cheaper by the Dozen the missing values don’t have a strong relationship with any finan- mer andbyholiday Cheaper the Dozenseasons. 2 The view groups release dates by month cial figures. This result holds for other columns with missing data. rather Child's than Play year, as binning by month produces a stronger relation- The analyst next decides to look at extreme values in financial Child's ship Play Worldwide with 2 Gross. The analyst is now confident that the Creepshow figures and clicks Worldwide Gross in the ‘Extreme’ anomaly list. outliers represent Creepshow 2 exceptional performance, not errors in the data. Doctor Dolittle Doctor Dolittle 2 Friday the 13th Part 2 Friday the 13th Part 3 Kill Bill: Volume 1 Kill Bill: Volume 2 Men in Black Men in Black 2

4. (a) Data Table (d) Anomaly Detection (e) View Recommendation The analyst decides to explore the seemingly strong relationship t1 t2 t3 [...][...][...] a([...]) [][] [...][...][...][...][...][][] anomalies derived anomalies between Worldwide Gross and U.S. Gross. The analyst first se- 1 .8 1 .8 Æ 0 .1 0 .1 lects ‘None’ in the related views menu to de-clutter the canvas. She 1 .7 1 .7 drags U.S. Gross from the schema viewer onto the histogram dis- playing Worldwide Gross to create a binned scatterplot. The data t1 t2 t3 t1 t2 t3 [...][...][...] [...][...][...][...][...] derived appear to be log-normally distributed so she uses the chart menu to set log scales for the axes. She notes outlying cells containing very low U.S Gross values compared to Worldwide Gross. She adds a (b) Type Inference (c) Feature Generation (f) Interactive Visualization map visualization by dragging Release Location to the canvas and confirms that most of these movies were released outside the U.S Figure 5: The Profiler Architecture. An (a) input table is an- (Figure 3). The analyst decides to filter these data points from the alyzed to (b) infer types for each column. Type information is data set so she chooses a filter transform from the transformation used to (c) generate features prior to running (d) anomaly de- menu. The formula editor shows a predicate based on the current tection routines. The results of anomaly detection and mutual selection criteria and the analyst hits return to filter the points. information analysis are used to perform (e) view recommen- The analyst notices that the Release Location map contains a red dation and populate a set of (f) interactive visualizations. bar indicating erroneous country values. She decides to toggle the ture extractors and a set of data mining routines. Using detected map visualization to a bar chart to inspect the erroneous values. She anomalies and the mutual information between columns, the Rec- clicks the small arrow at the top-right of the chart to open the chart ommender suggests visualizations to help an analyst assess poten- menu and changes the visualization type. She filters the bar chart tial issues. The View Manager presents linked summary visualiza- to only show erroneous values and sees a few ‘None’ and ‘West tions; it generates type-specific visualizations and executes coor- Germany’ values. To fix these errors, the analyst selects a replace dinated queries across views to support brushing and linking. We transform in the formula editor menu and then specifies parameters; now describe each of these components in detail. e.g., replace(Release Location, ‘West Germany’, ‘Germany’). Next, the analyst inspects the ‘Inconsistency’ list in the anomaly 4.1 Type Registry browser. The analyst clicks on Title in order to spot potential dupli- The Type Registry consists of a set of type definitions and rou- cate records. Profiler responds by showing a grouped bar chart with tines for type inference. Each column in a data table is assigned a movie titles clustered by textual similarity (Figure 4). Unsurpris- type, whether automatically via inference or manually by the user. ingly, the analyst sees that movies and their sequels are clustered At minimum, a Profiler type is defined by a binary verification together. There also appear to be potential remakes of classic films. function: given an input value, the function returns true if the value The analyst worries that there might also be misspellings of some is a member of the type and false otherwise. Verification functions films, but does not want to verify all the clusters by hand. The include regular expression matches, set membership (e.g., dictio- analyst reasons that true duplicates are likely to have the same Re- nary lookup of country names) and range constraints (e.g., pH be- lease Date and so decides to condition the text clustering anomaly tween 0-14). Profiler associates a type with an entire column, but detector on Release Date. The analyst clicks ‘Levenshtein’ next not all values in the column necessarily satisfy the type definition. to Title in the anomaly browser. A menu appears which includes Profiler includes built-in support for primitive types — boolean, selection widgets for conditioning anomaly detection on another string, and numeric (int, double) — and higher-order types such column. After rerunning the detector, there are significantly fewer as dates and geographic entities; e.g., state/country names, FIPS anomalies to check. The analyst is satisfied that there are no dupli- codes, zip codes. Profiler’s detector and view manager components cate entries and continues with her analysis. require that all columns be assigned to a data type. The type system is extensible: as new types are defined, anomaly detection and vi- 4. SYSTEM ARCHITECTURE sualization methods can be specified in terms of pre-existing types Underlying the Profiler application is an extensible architecture or new components (e.g., a novel type-specific visualization) that that combines statistical algorithms and coordinated visualizations. plug-in to the Profiler architecture. The system is implemented in JavaScript, and is intended to run A type definition may also include a set of type transforms and inside browsers with optimized JavaScript execution engines. The group-by functions. A type transform is a function that maps be- architecture consists of five major components. tween types (e.g., zip code to lat-lon coordinate). These functions First, Profiler represents data tables using a memory-resident form a graph of possible type conversions, some of which may be column-oriented relational database. The database supports stan- lossy. User-defined types can include type transforms to built-in dard SQL-style queries for filtering, aggregation, and generating types to leverage Profiler’s existing infrastructure. Group-by func- derived columns. Unlike standard SQL databases, Profiler uses a tions determine how values can be grouped to drive scalable visu- relaxed type system: values can deviate from their column’s de- alizations. For instance, numeric types can be binned at uniform fined type. Profiler flags these values as inconsistent; they appear intervals to form histograms, while dates may be aggregated into in red within a chart’s quality summary bar. The same database sys- meaningful units such as days, weeks, months or years. tem also powers the Wrangler [16] data transformation tool. Pro- Type inference methods automatically assign a type to each col- filer has access to the Wrangler data transformation language and umn in a data table based on the Minimum Description Length extends it with additional transforms, including more advanced ag- principle (MDL) [26]. MDL selects the type that minimizes the gregation operations such as binning numeric data to compute his- number of bits needed to encode the values in a column. MDL has tograms and mathematical operations for deriving new columns. been used effectively in prior data cleaning systems, such as Pot- The rest of the Profiler architecture consists of four modular ter’s Wheel [25]. We use the same MDL formulation in Profiler. components (Figure 5). The Type Registry contains data type def- initions and a type inference engine. Profiler uses types to choose 4.2 Detector appropriate anomaly detection routines and visualizations. The De- Profiler’s Detector applies a collection of type-specific data min- tector performs anomaly detection by combining type-aware fea- ing routines to identify anomalies in data.

5.Type Issue Detection Method(s) Visualization tines. The following list focuses on the most common needs and Missing Missing record Outlier Detection | Residuals then Histogram, Area Moving Average w/ Hampel X84 Chart demonstrates the diversity of routines that the system supports. Frequency Outlier Detection | Hampel Histogram, Area Missing value detection identifies cells that do not contain data. X84 Chart Missing value Find NULL/empty values Quality Bar Type verification functions identify values inconsistent with a Inconsistent Measurement units Clustering | Euclidean Distance Histogram, Scatter given column type (Sec. 4.1). Verification can flag incorrect use of Plot Outlier Detection | z-score, Hampel X84 Histogram, Scatter physical types (e.g., strings vs. integers) or constraint violations. Plot Misspelling Clustering | Levenshtein Distance Grouped Bar Chart Clustering is used to detect a variety of errors relative to a cho- Ordering Clustering | Atomic Strings Grouped Bar Chart sen distance metric. Euclidean distance is useful for detecting nu- Representation Clustering | Structure Extraction Grouped Bar Chart Special characters Clustering | Structure Extraction Grouped Bar Chart meric outliers and inconsistent measurement units. Character-based Incorrect Erroneous entry Outlier Detection | z-score, Hampel X84 Histogram (Levenshtein distance), token-based (Atomic Strings), and phonetic- Extraneous data Type Verification Function Quality Bar Misfielded Type Verification Function Quality Bar based (soundex) distances are useful for detecting inconsistencies Wrong physical data Type Verification Function Quality Bar in text such as misspellings, different term orderings, and phonet- type Extreme Numeric outliers Outlier Detection | z-score, Hampel Histogram, Scatter ically similar words [7]. We use nearest neighbor agglomerative X84, Mahalanobis distance Plot hierarchical clustering with each distance metric. Time-series outliers Outlier Detection | Residuals vs. Moving Area Chart Average then Hampel X84 Univariate outlier detection routines identify extreme and pos- Schema Primary key violation Frequency Outlier Detection | Unique Bar Chart Value Ratio sibly incorrect values for numeric and time-based data. We apply both z-scores and Hampel X84 — a routine based on median abso- Figure 6: Taxonomy of Data Quality Issues. We list classes lute deviation — to detect univariate quantitative outliers [10]. of methods for detecting each issue, example routines used in Frequency outlier detection identifies values that appear in a set Profiler, and visualizations for assessing their output. more or less often then expected. Frequency outliers are commonly used to detect primary key violations. Profiler uses the unique value 4.2.1 The Detection Pipeline ratio to detect primary keys [10]. We use numerical outlier routines on aggregated counts to detect other types of anomalies, such as The Detector determines which anomaly detection routines to gaps in ranges which may indicate missing observations. apply, runs them, and produces output for visualization. This pro- Profiler supports two methods of multivariate outlier detection. cess has two phases: feature generation and anomaly detection. First, detection routines can accept multiple columns as input. For During feature generation, the Detector derives features of the example, Mahalanobis distance can be used to detect multivariate input columns to use as input to anomaly detection routines. Fea- numeric outliers [10]. Second, conditioning is a general method tures are extracted using unary transformations called generators. for converting any routine into a multivariate routine. Conditioning For example, a generator might compute the lengths of string val- applies an existing routine to subsets of data, grouped by one or ues; an anomaly detector might then compute z-scores to flag ab- more variables (e.g., categorical or binned quantitative values). For normally long strings. The Detector maintains a list of appropri- instance, conditioning the z-score routine on genre calculates the ate generators (including the identity function) for each type in the scores for values within each genre separately. To support condi- Type Registry. Given an input table, the Detector applies genera- tioning, Profiler uses a partitioner that applies any transformation tors to each input column according to its type signature. The result to data subsets formed by applying specified group-by functions. is a set of feature columns that serve as input to anomaly detectors. The space of possible routines is combinatorially large and the Detection routines then analyze the feature columns. Detection results of these routines can be difficult to interpret. As a result, routines accept columns as input and output two columns: a class Profiler does not automatically run multivariate outlier detection column and a certainty column. The class column contains inte- routines by default. Users can initiate multivariate outlier detection gers; 0 indicates that no anomaly was found in that row. Non-zero by adding conditioning columns to existing univariate detectors. values indicate the presence of an anomaly and distinct integers in- dicate distinct classes of anomaly. For example, the z-score routine outputs a class column where each value is either 0 (within 2 stan- 4.3 View Recommendation dard deviations from the mean), -1 (< 2 stdev), or 1 (> 2 stdev). The For a given anomaly, the Recommender automatically populates certainty column represents the strength of the routine’s prediction. the View Manager (discussed next) with relevant visual summaries. For z-scores, these values indicate the distance from the mean. Generating summary views requires recommending a view specifi- The Detector organizes detection routines by the data types they cation — a set of columns to visualize and type-appropriate group- can process. After feature generation, the system visits each col- by functions for aggregation. A view specification can also include umn in the data table (including derived columns) and runs all rou- anomaly class and certainty columns to parameterize a view. The tines with a compatible type. For instance, the z-score routine is recommender always specifies a primary view that visualizes the applied to all numeric columns. The standardized output of class column(s) that contain the anomaly. The recommender also deter- and certainty columns is then handled in a general fashion by the mines a set of related views using a model based on mutual infor- downstream Recommender and View Manager components. mation. The Recommender supports two types of related views. The Detector’s output appears in the anomaly browser. This Anomaly-oriented views show columns that predict the presence of browser lists any result of a detection routine that contains at least anomalies. Value-oriented views show columns that best explain one anomalous value (i.e., a non-zero value in the class column), the overall distribution of values in the primary column(s). Users grouped by the type of detection routine and sorted by decreasing select which type of view to show with the related view menu. anomaly count. The browser displays the columns containing the anomaly and which routines detected the anomaly. When a user 4.3.1 Mutual Information clicks an item, relevant views appear in the canvas. The mutual information of two variables quantifies how much knowing the value of one variable reduces the uncertainty in pre- 4.2.2 Detection Routines dicting a second variable. It is equivalent to the reduction in entropy Profiler incorporates a variety of detection routines to flag data attained by knowing a second variable. Mutual information is non- anomalies (Figure 6), and can easily be extended with new rou- negative and has a minimum of 0 when variables are independent.

6.To compare mutual information across pairs of variables, we de- fine a distance metric D that is 0 for completely dependent vari- ables and increases as the mutual information between variables decreases. For variables X and Y with mutual information I(X,Y ) and entropies H(X) and H(Y ), we define D as: I(X,Y ) D(X, Y ) = 1 − (1) max(H(X), H(Y )) 4.3.2 Recommendation We use the metric D to recommend both the primary view and related views. A view specification determines how data is aggre- gated for a visual summary by assigning each row of input a group id (e.g., a bin in a histogram or binned scatterplot). In this way, we Figure 7: Adding perceptual discontinuity to summary views. can derive a column of group ids from a view specification. We de- Left: A binned scatter plot using a naive opacity ramp from fine ViewToColumn as a function that converts a view specification 0-1. Right: An opacity scale with a minimum non-zero opacity into a column of group ids. For a set of columns C, we use V SC ensures perception of bins containing relatively few data points. to refer to the set of all possible view specifications containing one column from C and a type-appropriate group-by function. scatter plot difficult to read due to overplotting. Profiler’s sum- The primary view always displays the set of columns that contain mary visualizations use aggregation to scale to a large number of the anomaly. Our goal is to produce a summary view with bins that records [3, 14, 20, 28, 35]: the number of marks in each view de- minimize the overlap of anomalies and non-anomalies so that ana- pends primarily on the number of bins, not the number of records. lysts can better discriminate them. Recall that the class column out- To compute aggregates, each view requires a group-by function put by the Detector indicates the presence of anomalies. We enu- that specifies a binning strategy. For automatically generated views, merate pairs of {column, group-by f unctions} and select the pair bins are determined by the Recommender. When a user manually that best predicts the class column. More formally, if A is the set of selects columns to visualize, Profiler chooses a group-by function columns containing the anomaly, we recommend the view specifi- based on the range of data values. Users can also select group-by cation vs ∈ V SA that minimizes the quantity D(ViewToColumn(vs), functions or type transformations through a view’s context menu. class). This primary view specification (denoted pvs) is assigned Histograms (numeric data), area charts (temporal data), choro- the class and certainty columns as parameters. pleth maps (geographic data) and binned scatter plots (2D nu- To suggest anomaly-oriented views, we find other columns that meric or temporal data) visualize counts of values within binned best predict the class column. We consider the set of all columns R ranges. Though Carr [3] recommends hexagonal binning of scatter that exclude the columns in C. We then choose view specifications plots for improved density estimation, we currently use rectangular from V SR that predict the class column. We sort specifications vs ∈ binning to enable better query and rendering performance. V SR by increasing values of D(ViewToColumn(vs), class). The Profiler uses bar charts to visualize the frequencies of distinct Recommender populates the View Manager with the corresponding nominal values. Sorting the bars by frequency helps analysts assess visual summaries in sort order until the canvas is full, discarding the distribution of values within a set. Grouped bar charts display summaries that contain columns already visualized. the frequencies of clustered values (e.g., clusters of possible dupli- We use a similar process to recommend value-oriented views. cate values). For columns with high cardinality, it is not feasible to Value-oriented views show visualizations related to the entire dis- show all the bars at once. In response, Profiler also visualizes the tribution of values in the primary view, not just anomalies. In- distribution in the chart’s scroll bar. We perform windowed aggre- stead of predicting the class column, we predict the group ids gen- gation over contiguous bars to form summary counts; the window erated by the primary view specification. We sort specifications size is adjusted as needed to match the available pixel resolution. vs ∈ V SR by D(ViewToColumn(vs), ViewToColumn(pvs)). Be- Data quality bars summarize column values as valid, type er- cause V SR only contains view specifications with one column, only rors, or missing. Profiler annotates each visualization with one univariate summaries are suggested. Our approach extends to mul- or more quality bars to indicate missing or non-conforming data. tiple columns if we augment R to include larger subsets of columns. Quality bars also act as input regions for brushing and linking. Higher-dimensional views are depicted using small multiples. 4.4 View Manager Any Profiler visualization can be used in a trellis plot, with subplots The View Manager converts view specifications into a set of showing data subdivided by one or more conditioning variables. linked visual summaries. The View Manager creates type-specific Finally, Profiler’s table display presents the raw data. Analysts views to reveal patterns such as gaps, clusters and outliers. A query can filter table rows by brushing summary visualizations. engine for filtering and aggregating data supports rapid brushing Profiler visualizations also incorporate design considerations for and linking across summaries, allowing an analyst to determine varying levels of scale. Naïve scaling of bar heights and color how subsets of data project across other dimensions. In addition to ramps can result in low-frequency bins that are essentially invisible automatic view recommendation, analysts can manually construct due to minuscule bars or near-white colors. This is unacceptable, views through drag-and-drop and menu-based interactions. Pro- as outliers often reside in low-frequency bins. We induce a percep- filer visualizations are implemented in SVG using the D3 library tual discontinuity in these scales so that low-frequency bins remain [2]. We now detail the design of the View Manager, including op- identifiable: we give small bars a minimum height and make col- timizations for rendering and query performance. ors for any non-zero values suitably distinguishable from the back- ground (Figure 7). In addition, different tasks may require visu- 4.4.1 Summary Visualizations alizing data at different levels of detail. Profiler time-series charts Visualizing “raw” data is increasingly difficult with even mod- support binning by a variety of time spans (day, month, year, etc). erately sized data — even a few hundred data points may make a Maps include panning and zooming controls.

7. 5HVXO W V 'L PV 0HDVXU H1DPHV %L QV 5RZV [ 63/20 [ 63/20 [ 63/20 U W  . TW dependently normally distributed. The others are linearly or log- . linearly dependent with the first column. We visualize the data 0 as a SPLOM with univariate histograms along the diagonal. We  . . then programmatically brush the bins in each univariate histogram. 0 This approach provides a conservative estimate of performance, as  . . brushing scatter plot bins results in smaller selections and hence 0 faster processing. We varied the number of visualized columns (3,          4, 5), bin count (10, 20, 30), and data set size (10K, 100K, 1M 0L O OL VHFRQGV 0L O OL VHFRQGV 0L O OL VHFRQGV rows). For each condition, we averaged the query and render times 4W DQG UW I RU HDFK 5RZV EURN HQ GRZQ E\ 'LPV YV %L QV  &RO RU VKRZV GHW DLO VDERXW TW DQG UW  across 5 runs. The benchmarks were performed in Google Chrome Figure 8: Performance (in ms) of linked highlighting in a scat- v.16.0.912.75 on a quad-core 2.66 GHz MacPro (OS X 10.6.8) with ter plot matrix (SPLOM). Orange bars represent query pro- per-core 256K L2 caches, a shared 8MB L3 cache and 8GB RAM. cessing time, blue bars represent rendering time. We varied the Our results demonstrate interactive rates with million-element number of dimensions, bins per dimension and data set size. In data sets (Figure 8). We see that the number of columns and num- most cases we achieve interactive (sub-100ms) response rates. ber of bins have a greater impact on performance than the num- ber of data points. We also performed an analysis of variance Each view can be parameterized using the class and certainty (ANOVA) to assess the contribution of each factor to the average columns generated by an anomaly detector. The bar chart and small response time. We found significant effects for SPLOM dimension multiples views enable sorting by class and certainty. By default (F2,20 = 21.4, p < 0.0001) and bin count (F2,20 = 14.8, p = 0.0001). we sort in descending order to reveal anomalies with higher cer- However, we failed to find a significant effect of data set size (F2,20 = tainty; e.g., a grouped bar chart will sort clusters of similar words 1.2, p = 0.3114), lending credence to our claim of scalability. by the certainty that the cluster contains misspelled items, with groupings determined by the class column. 5. INITIAL USAGE 4.4.2 Scalable Linked Highlighting We have conducted informal evaluations of Profiler on a variety When a user selects a range of values (e.g., by mouse hover), of data sets — including water quality data, a disasters database, Profiler highlights the projection of that data across all views. To obesity data, a world wide quality-of-life index, and public govern- do so, Profiler first filters the backing data table to include only ment data. We now describe two concrete examples of how Profiler the selected range. For each view Profiler then computes an ag- has enabled rapid assessment of these data sets. gregate query to determine the count of selected items in each bin. The disasters database contains 11 columns, including the type, These data points are visualized as orange highlights overlaid over location, and consequences (cost, number affected) of the disaster. the original view (see Figure 1). Linked selection extends to all Profiler identified 13 data quality issues. These include 2 columns visualizations, including quality bars, scrollbars, and table views. containing duplicates due to inconsistent capitalization, 6 columns To support scalable, real-time interaction we addressed two per- with missing values, and 3 columns with extreme values. For ex- formance challenges: query execution and rendering. To reduce the ample, Profiler detected disasters with extremely high monetary query load we first simplify the data. Multiple records in the input cost. The recommended views include the Type column. Upon data often map to the same bin. In response we pre-aggregate the selecting large values in the Cost histogram, it became evident that data, grouping by the bins for all visualized attributes. With a suit- the vast majority of these outliers were floods, storms or droughts. able number of bins per dimension (typically 10-30) this step can By selecting these values in the Type bar chart, we confirmed that reduce the number of records by one to two orders of magnitude. these disaster types typically lead to higher cost. For columns with To further reduce query time, we encode non-numeric types as missing values, Profiler primarily recommends columns with co- zero-based integers. Integer codes speed element comparisons and occurrences of missing values. For instance, rows missing values simplify the calculation of dimensional indices during aggregation. in a Killed column also tend to have missing values in the Cost, Sub The original values are encoded in sort order and stored in a lookup Type, and Affected columns. Because of this, the recommended table for reference. To facilitate optimization by the browser’s just- views for each of these anomalies were very similar. Assessing in-time compiler, the inner loop of the query executor avoids func- data quality in this data set took just a few minutes. tion calls. We also cache query results to eliminate redundant com- We also tested Profiler on World Water Monitoring Day data. putation. For example, in a scatter plot matrix (SPLOM) cross- Each year, thousands of people collect water quality data using test diagonal plots visualize the same data, only transposed. kits; they manually record the name and location of the body of Rendering bottlenecks also limit performance. Even with aggre- water as well as measurements such as Turbidity and pH. The data gated views, the number of marks on-screen can grow large: a 4 × 4 contains 34 columns. Profiler identifies 23 columns with missing SPLOM containing plots with 50 × 50 bins requires rendering up data, 2 with erroneous values, 5 containing outliers and 5 contain- to 40,000 marks. To speed rendering we minimize modifications ing duplicates. For instance, the Air Temperature column contains to the Document Object Model (DOM) in each interactive update. extremely low temperatures. Profiler recommends a world map and To avoid churn, we introduce all SVG DOM elements (including a visualization of the date of collection, revealing that the extreme highlights) upon initialization. Each update then toggles a minimal lows were collected in Russia during winter. The data set also con- set of visibility and style properties. We also try to take advantage tains many duplicates. Data collectors often refer to the same city of optimized rendering pathways, for example by using squares in- by slightly different names, resulting in hundreds of potential du- stead of hexagons in binned scatter plots. plicates. After inspecting a few duplicate sets, we conditioned text clustering on the State column to simplify the clustered bar charts 4.4.3 Performance Benchmarks significantly. However, Profiler also flagged possible duplicates in We benchmarked query and render times during interactive brush- the State column itself, prompting us to resolve duplicates there ing and linking. For our test data, we sample from random dis- first. Profiler also flagged the Site Country name for containing er- tributions for up to five columns. Three of the columns are in- roneous country names; a recommended bar chart shows that peo-

8.ple enter extra specifics, such as “Congo, Republic of (Brazaav- [13] Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Tane: An ille).” We then corrected these values to enable proper aggregation. efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100–111, 1999. [14] D. Huynh and S. Mazzocchi. Google Refine. 6. CONCLUSION http://code.google.com/p/google-refine/. In this paper we presented Profiler, an extensible system for data [15] Z. G. Ives, C. A. Knoblock, S. Minton, M. Jacob, P. Pratim, T. R. quality assessment. Our system architecture can support a flexible Tuchinda, J. Luis, A. Maria, and M. C. Gazen. Interactive data set of data types, anomaly detection routines and summary visual- integration through smart copy & paste. In Proc. CIDR, 2009. izations. Our view recommendation model facilitates assessment of [16] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In data mining routines by suggesting relevant visual data summaries Proc. ACM CHI, pages 3363–3372, 2011. according to the mutual information between data columns and de- [17] H. Kang, L. Getoor, B. Shneiderman, M. Bilgic, and L. Licamele. tected anomalies. We demonstrated how the appropriate selection Interactive entity resolution in relational data: A visual analytic tool of linked summary views aids evaluation of potential anomalies and and its evaluation. IEEE TVCG, 14(5):999–1014, 2008. their causes. We also discussed optimizations for scaling query [18] D. Keim. Information visualization and visual data mining. IEEE and rendering performance to provide interactive response times TVCG, 8(1):1–8, 2002. for million element data sets within modern web browsers. By in- [19] W. Kim, B.-J. Choi, E.-K. Hong, S.-K. Kim, and D. Lee. A taxonomy tegrating statistical and visual analysis, we have found that Profiler of dirty data. Data Mining & Knowl. Discovery, 7(1):81–99, 2003. can reduce the time spent diagnosing data quality issues, allowing [20] R. Kosara, F. Bendix, and H. Hauser. TimeHistograms for large, time-dependent data. In Proc. VisSym, pages 45–54, 2004. domain experts to discover issues and spend more time performing [21] J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user meaningful analysis. programming of mashups with Vegemite. In Proc. Intelligent User In future work, we plan to evaluate Profiler through both con- Interfaces, pages 97–106, 2009. trolled studies and public deployments on the web. We intend to [22] C. North and B. Shneiderman. Snap-together visualization: A user develop a tool for end users to define custom types (c.f., [29]) and interface for coodinating visualizations via relational schemata. In to incorporate detectors and visualizations for additional data types Proc. Advanced Visual Interfaces, pages 128–135, 2000. such as free-form text. Our query engine is currently limited to [23] A. Perer and B. Shneiderman. Systematic yet flexible discovery: data that fits within a browser’s memory limit. Future work might guiding domain experts through exploratory data analysis. In Proc. Intelligent User interfaces, pages 109–118, 2008. examine hybrid approaches that combine server-side aggregation [24] E. Rahm and H. H. Do. Data cleaning: Problems and current with client-side interactive querying. Our model for view recom- approaches. IEEE Data Engineering Bulletin, 23, 2000. mendation currently uses pairwise mutual information, which is in- [25] V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data sensitive to redundant dependencies between data. Other methods, cleaning system. In VLDB, pages 381–390, 2001. such as structure learning of Bayesian networks, might account for [26] J. Rissanen. Modeling by shortest data description. Automatica, conditional dependencies between sets of columns to side-step re- 14(5):465–471, 1978. dundancy and further improve view ranking. [27] G. G. Robertson, M. P. Czerwinski, and J. E. Churchill. Visualization of mappings between schemas. In Proc. ACM CHI, pages 431–439, 2005. 7. ACKNOWLEDGEMENTS [28] G. E. Rosario, E. A. Rundensteiner, D. C. Brown, M. O. Ward, and The first author was supported by a Stanford Graduate Fellow- S. Huang. Mapping nominal values to numbers for effective ship. This work was partially funded by NSF grant CCF-0964173. visualization. Information Visualization, 3(2):80–95, 2004. [29] C. Scaffidi, B. Myers, and M. Shaw. Intelligently creating and recommending reusable reformatting rules. In Proc. Intelligent User 8. REFERENCES Interfaces, pages 297–306, 2009. [1] P. D. Allison. Missing Data. Sage Publications, 2001. [30] J. Seo and B. Shneiderman. A rank-by-feature framework for [2] M. Bostock, V. Ogievetsky, and J. Heer. D3: Data-driven documents. interactive exploration of multidimensional data. Information IEEE TVCG, 17(12):2301–2309, 2011. Visualization, 4(2):96–113, 2005. [3] D. B. Carr, R. J. Littlefield, W. L. Nicholson, and J. S. Littlefield. [31] C. Stolte, D. Tang, and P. Hanrahan. Polaris: a system for query, Scatterplot matrix techniques for large N. Journal of the American analysis, and visualization of multidimensional relational databases. Statistical Association, 82(398):424–436, 1987. IEEE TVCG, 8(1):52–65, 2002. [4] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A [32] D. F. Swayne, D. T. Lang, A. Buja, and D. Cook. GGobi: evolving survey. ACM Comput. Surv., 41(3):1–58, July 2009. from XGobi into an extensible framework for interactive data [5] T. Dasu and T. Johnson. Exploratory Data Mining and Data visualization. Comp. Stat. & Data Analysis, 43(4):423–444, 2003. Cleaning. John Wiley & Sons, Inc., 2003. [33] R. S. Tsay. Outliers, level shifts, and variance changes in time series. [6] W. Eckerson. Data quality and the bottom line: Achieving business Journal of Forecasting, 7(1):1–20, 1988. success through a commitment to high quality data. Technical report, [34] R. Tuchinda, P. Szekely, and C. A. Knoblock. Building mashups by The Data Warehousing Institute, 2002. example. In Proc. Intelligent User Interfaces, pages 139–148, 2008. [7] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate [35] A. Utwin, M. Theus, and H. Hofmann. Graphics of Large Datasets: record detection: A survey. IEEE TKDE, 19(1):1–16, 2007. Visualizing a Million. Springer, 2006. [8] D. Guo. Coordinating computational and visual approaches for [36] C. Weaver. Building highly-coordinated visualizations in Improvise. interactive feature selection and multivariate clustering. Information In Proc. IEEE InfoVis, pages 159–166, 2004. Visualization, 2(4):232–246, 2003. [37] J. Yang, M. O. Ward, E. A. Rundensteiner, and S. Huang. Visual [9] L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio hierarchical dimension reduction for exploration of high dimensional grows up: from research prototype to industrial tool. In ACM datasets. In Proc. VisSym, pages 19–28, 2003. SIGMOD, pages 805–810, 2005. [10] J. M. Hellerstein. Quantitative data cleaning for large databases, 2008. White Paper, U.N. Economic Commission for Europe. [11] A. Hinneburg, D. Keim, and M. Wawryniuk. HD-Eye: visual mining of high-dimensional data. IEEE CG&A, 19(5):22 –31, 1999. [12] V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial Intelligence Rev., 22(2):85–126, 2004.