VisComplete: Automating Suggestions for Visualization Pipelines

In this paper, we propose techniques to help users construct pipelines by consensus—automatically suggesting completions based on a database of previously created pipelines. In particular, we compute correspondences between existing pipeline subgraphs from the database, and use these to predict sets of likely pipeline additions to a given partial pipeline. By presenting these predictions in a carefully designed interface, users can create visualizations and other data products more efficiently because they can augment their normal work patterns with the suggested completions.

1. VisComplete: Automating Suggestions for Visualization Pipelines David Koop Carlos E. Scheidegger Steven P. Callahan Juliana Freire ´ Claudio T. Silva Abstract—Building visualization and analysis pipelines is a large hurdle in the adoption of visualization and workflow systems by domain scientists. In this paper, we propose techniques to help users construct pipelines by consensus—automatically suggesting completions based on a database of previously created pipelines. In particular, we compute correspondences between existing pipeline subgraphs from the database, and use these to predict sets of likely pipeline additions to a given partial pipeline. By presenting these predictions in a carefully designed interface, users can create visualizations and other data products more efficiently because they can augment their normal work patterns with the suggested completions. We present an implementation of our technique in a publicly-available, open-source scientific workflow system and demonstrate efficiency gains in real-world situations. Index Terms—Scientific Workflows, Scientific Visualization, Auto Completion ✦ 1 I NTRODUCTION Data exploration through visualization is an effective means to un- and can be helpful in situations when the user knows in advance what derstand and obtain insights from large collections of data. Not sur- they want the end result to be. However, during pipeline creation, it prisingly, visualization has grown into a mature area with an es- is not always the case that the user has an analogy template readily tablished research agenda [26], and a number of software systems available for the visualization that is desired. In these cases, the user have been developed that support the creation of complex visualiza- is relegated to manually searching for examples. tions [18, 6, 23, 35, 19, 29, 13, 36]. However, a wider adoption of In this paper, we present VisComplete, a system that aids users in visualization systems has been greatly hampered due to the fact that the process of creating visualizations by using a database of previously these systems are notoriously hard to use, in particular, for users who created visualization pipelines. The system learns common paths used are not visualization experts. in existing pipelines and predicts a set of likely module sequences that Even for systems that have sophisticated visual programming inter- can be presented to the user as suggestions during the design process. faces, such as DX, AVS and SCIRun, the path from the raw data to The quality and nature of the suggestions depend on the data they are insightful visualizations is laborious and error-prone. Visual program- derived from. Whereas in a single-user environment, suggestions are ming interfaces expose computational components as modules and al- derived based on pipelines created by a specific user, in a multi-user low the creation of complex visualization pipelines which combine environment, the “wisdom of the crowds” can be leveraged to derive these modules in a dataflow, where connections between modules ex- a richer set of suggestions that includes examples the user is not fa- press the flow of data through the pipeline. They have been shown miliar with. User collaboration and social data reuse has proven to to be useful for comparative visualization and efficient exploration be a powerful mechanism in various domains, such as recommenda- of parameter spaces [2]. Through the use of a simple programming tion systems in commercial settings (e.g., Amazon, e-Bay, Netflix), model (i.e., dataflows) and by providing built-in constraint checking knowledge sharing on open Web sites (e.g., Wikipedia), image label- mechanisms (e.g., that disallow a connection between incompatible ing for computer vision (e.g., ESPGame) and visualization creation module ports), they ease the creation of pipelines. Notwithstanding, (e.g., ManyEyes). The underlying theme shared by these systems without detailed knowledge of the underlying computational compo- is that they use information provided by many users to solve prob- nents, it is difficult to understand what series of modules and connec- lems that would be difficult otherwise. We apply a similar concept to tions ought to be added to obtain a desired result. In essence, there is pipeline creation: pipelines created by many users enable the creation no “roadmap”; systems provide very little feedback to help the user of visualizations by consensus. For the user, VisComplete acts as an figure out which modules can or should be added to the pipeline. A auto-complete mechanism for pipelines, suggesting modules and con- novice user (i.e., an experienced programmer that is unfamiliar with nections in a manner similar to a Web browser suggesting URLs. The the modules and the dataflow of the system), or even an advanced user completions are presented graphically in a way that allows the user to performing a new task, often resorts to manually searching for exist- easily explore and accept suggestions or disregard them and continue ing pipelines to use as examples. These examples are then adapted working as they were. Figure 1 shows an example of VisComplete and iteratively refined until a solution is found. Unfortunately, this incorporated into a visual programming interface and Figure 2 shows manual, time-consuming process is the current standard for creating some example completions for a single module. visualizations rather than the exception. Recent work has shown that provenance information (the metadata Contributions and Outline. We propose a recommendation system required for reproducibility) can be used to simplify the process of that leverages information in a collection of pipelines to provide advice pipeline creation by allowing pipelines to be refined and queried by to users of visualization systems and aid them in the construction of example [30]. For example, a pipeline refinement can act as an anal- pipelines. By modeling pipelines as graphs, we develop an algorithm ogy template for creating new visualizations. This is a powerful tool for predicting likely completions that searches for common subgraphs in the collection. We also present an interface that displays the recom- mended completions in an intuitive way. Our preliminary experiments • David Koop and Juliana Freire are with the School of Computing at the show that VisComplete has the potential to reduce the effort and time University of Utah. email: {dakoop, juliana} required to construct visualizations. We found that the suggestions de- • Carlos E. Scheidegger, Steven P. Callahan, and Cl´audio T. Silva are with rived by VisComplete could have reduced the number of operations the Scientific Computing and Imaging (SCI) Institute at the University of performed by users to construct pipelines by an average of over 50%. Utah. email: {cscheid, stevec, csilva} Note that although in this paper we focus on the use of VisComplete for visualization pipelines, the techniques we present can be applied Manuscript received 31 March 2008; accepted 1 August 2008; posted online to general workflows. 19 October 2008; mailed on 13 October 2008. For information on obtaining reprints of this article, please send The rest of this paper is organized as follows. In Section 2 we dis- cuss related work. In Section 3 we present the underlying formalism for generating pipeline suggestions, and in Section 4 we describe a

2.Fig. 1: The VisComplete suggestion system and interface. (a) A user starts by adding a module to the pipeline. (b) The most likely completions are generated using indexed paths computed from a database of pipelines. (c) A suggested completion is presented to the user as semi-transparent modules and connections. The user can browse through suggestions using the interface and choose to accept or reject the completion. practical implementation that has been integrated into the VisTrails In graphics and visualization, recommendation systems have been system [35]. We then detail the use cases we envision in Section 5, proposed to simplify the creation of images and visualizations. Design report our experiments and results in Section 6, and provide a discus- Galleries [22] were introduced to allow users to explore the space of sion of our algorithm in Section 7. We conclude in Section 8, where rendering parameters by suggesting a set of automatically generated we outline directions for future work. thumbnails. Igarashi and Hughes [14] proposed a system for creating 3D line drawings that uses rules to suggest possible completions of 3D 2 R ELATED W ORK objects. Suggestions have also been used for view point selection in Visualization systems have been successfully used to bring powerful volume rendering. Bordoloi and Shen [38] and Takahashi et al. [33] visualization techniques to a wide audience. Seminal workflow-based present methods that analyze the volume from various view points to visualization systems such as AVS Explorer [36], Iris Explorer [27], suggest the view that best shows the features within the volume. Like and Visualization Data Explorer [13] have paved the way for more these systems, we provide the user with prioritized suggestions that the recent systems designed using an object-oriented approach such as user may choose to utilize. However, our suggestions are data-driven SciRun [29] for computational steering and the Visualization Toolkit and based on examples of previous interactions. (VTK) [19] for visualization. Systems that incorporate standard point- An emerging trend in image processing is to enhance images based and-click interfaces and operate on data at a larger scale, such as on a database of existing images. Hays and Efros [10] recently pre- VisIt [6] and ParaView [18], still use workflows as their underlying sented a system for filling in missing regions of an image by searching execution engine. Development in workflow systems for visualization a database for similar images. Along similar lines Lalonde et al. [21] is ongoing, as seen in projects such as MeVisLab [24] for medical vi- recently introduced Photo Clip Art, a method for intelligently insert- sualization and VisTrails [35] for incorporating existing visualization ing clip art objects from a database to an existing image. Properties libraries with other tools in a provenance capturing framework. Our of the objects are learned from the database so that they may be sized completion strategy can be combined with and enhance workflow and and oriented automatically, depending on where they are inserted into workflow-based visualization systems. the image. The use of databases for completion has also been used for Recommendation systems have been used in different settings. Like 3D modeling. Tsang et al. [34] proposed a modeling technique that VisComplete, these are based on methods that predict users’ actions utilizes previously created geometry stored in a database of shapes to based solely on the history of their previous interactions [12]. Exam- suggest completions of objects. Like these methods, our completions ples include Unix command-line prediction [20], prediction of Web are computed by learning from a database to find similarities. But requests [8, 28], and autocompletion systems such as IntelliSense [25]. instead of images, our technique relies on workflow specifications to Senay and Ignatius have proposed incorporating expert knowledge derive predictions. into a set of rules that allow automated suggestions for visualization Another important trend is that of social visualization. Web-based construction [31], while Gilson et al. incorporate RDF-based ontolo- systems such as VisPortal [3, 15] provide the means for collaborative gies into an information visualization tool [9]. However, these ap- visualization from disjoint locations. Web sites such as [11], proaches necessarily require an expert that can encode the necessary Swivel [32], and ManyEyes [37] allow many users to create, share, knowledge into a rule set or an ontology. and discuss visualizations. One key feature of these systems is that Fu et al. [8] applied association rule mining [1] to analyze Web they leverage the knowledge of a large group of people to effectively navigation logs and discover pages that co-occur with high frequency understand disparate data. Similarly, VisComplete uses a collection of in navigation paths followed by different users. This information is pipelines possibly created by many users to derive suggestions. then used to suggest potentially interesting pages to users. VisCom- plete also derives predictions based on user-derived data and does so 3 G ENERATING DATA - DRIVEN S UGGESTIONS in an automated fashion, without the need for explicit user feedback. VisComplete suggests partial completions (i.e., a set of structural However, the data it considers is fundamentally different from Web changes) for pipelines as they are being created by a user. These sug- logs: VisComplete bases its predictions on a collection of graphs and gestions are derived using structural information obtained from a col- it leverages the graph structure to make these predictions. Because lection G of already-completed pipelines. association rule mining computes rules over sets of elements, it does Pipelines are specified as graphs, where nodes represent modules not capture relationships (other than co-occurrence) amongst these el- (or processes) and edges determine how data flows through the mod- ements. ules. More formally, a pipeline specification is a directed acyclic graph

3.Fig. 2: Three of the first four suggested completions for a “vtkDataSetReader” are shown along with corresponding visualizations. The visual- izations were created using these completions for a time step of the Tokamak Reactor dataset that was not used in the training data. G(M,C), where M consists of a set of modules and C is a set of con- More concretely, we extract all possible paths of length N, and split nections between modules in M. A module is a complex object which them into a path of length N − 1 and a single vertex. Note that we do contains a set of input and output ports through which data flows in this in both forward and reverse directions with respect to the directed and out of the module. A connection between two modules ma and mb edges. This allows us to offer completions for pipeline pieces when connects an output port of ma to an input port of mb . they are built top-down and bottom-up. The path summary Gpath is Problem Definition. The problem of deriving pipeline completions stored as a set of (path, vertex) pairs sorted by the number of occur- can be defined as follows. Given a partial graph G, we wish to find a set rences in the database and indexed by the last vertex of the path (the of completions C(G) that reflect the structures that exist in a collection anchor). Since predictions begin at the anchor vertex, indexing the of completed graphs. A completion of G, Gc , is a supergraph of G. path summary by this vertex leads to faster access to the predictions. Our solution to this problem consists of two main steps. First, we As an example of the path summary generation, consider the graph pre-process the collection of pipelines G and create Gpath , a compact shown in Figure 3. We have the following upstream paths ending with representation of G that summarizes relationships between common D: A → C → D, B → C → D, C → D, and D. In addition, we also have structures (i.e., sequences of modules) in the collection (Section 3.1). the following downstream vertices: E and F. The set of correlations Given a partial pipeline p, completions are generated by querying between the upstream paths and downstream vertices is shown in Fig- Gpath to identify modules and connections that have been used in con- ure 3. As we compute these correlations for all starting vertices over junction with p in the collection (Section 3.2). all graphs, some paths will have higher frequencies than others. The frequency (or support) for the paths is used for ranking purposes: pre- 3.1 Mining Pipelines dictions derived from paths with higher frequency are ranked higher. To derive completions, we need to identify graph fragments that co- Besides paths, we also extract additional information that aid in the occur in the collection of pipelines G . Intuitively, if a certain fragment construction of completions. Because we wish to predict full pipeline always appears connected to a second fragment in our collection, we structures, not just paths, we compute statistics for the in- and out- ought to predict one of those fragments when we see the other. degrees of each vertex type. This information is important in deter- Because we are dealing with directed acyclic graphs, we can iden- mining where to extend a completion at each iteration (see Figure 4). tify potential completions for a vertex v in a pipeline by associating We also extract the frequency of connection types for each pair of subgraphs downstream from v with those that are upstream. A sub- modules. Since two modules can be connected through different pairs graph S is downstream (upstream) of a vertex v if for every v ∈ S, of ports, this information allows us to predict the most frequent con- there exists a path from v to v (v to v). In many cases where we wish nection type. to complete a graph, we will know either the downstream or upstream structure and wish to complete the opposite direction. Note that this 3.2 Generating Predictions problem is symmetric: we can change one problem to the other by Predicting a completion given the path summary and an anchor module simply reversing the direction of the edges. v is simple: given the set of paths associated with v, we identify the However, due to the (very) large number of possible subgraphs in G , vertices that are most likely to follow these paths. As shown in the generating predictions based on subgraphs can be prohibitively expen- following algorithm, we iteratively develop our list of predictions by sive. Thus, instead of subgraphs we use paths, i.e., a linear sequence adding new vertices using this criteria. of connected modules. Specifically, we compute the frequencies for each path in G . Completions are then determined by finding which path extensions are likely given the existing paths. Generating the Path Summary. To efficiently derive completions path vertex from a collection of pipelines G , we begin by generating a summary A B A→C →D E of all paths contained in the pipelines. Because completions are de- A→C →D F rived for a specific vertex v in a partial pipeline (we call this vertex C B→C →D E the completion anchor), we extract all possible paths that end or begin B→C →D F with v and associate them with the vertices that are directly connected D C→D E downstream or upstream of v. Note that this leads to many fewer en- C→D F tries than the alternative of extracting all possible subgraph pairs. And E F D E as we discuss in Section 6, paths are effective and lead to good predic- D F tions. Fig. 3: Deriving a path summary for the vertex D.

4.Fig. 4: Predictions are iteratively refined. At each step, a prediction can be extended upstream and downstream; in the second step, the algorithm only suggests a downstream addition. Also, predictions in either direction may include branches in the pipeline, as shown in the center. G ENERATE -P REDICTIONS(P) For computational stability, our implementation uses log-confidences predictions ← F IRST-P REDICTION(P) so the products are actually sums. result ← [ ] Because we wish to derive predictions that are not just paths, our re- while | predictions | > 0 finement step begins by identifying the vertex in the current prediction do prediction ← R EMOVE - FIRST(predictions) that we wish to extend our prediction from. Recall that we computed new-predictions ← R EFINE(prediction) the average in- and out-degree for each vertex type in the mining step. if | new-predictions | = 0 Then, for each vertex, we can compute the difference between the av- then result ← result + prediction erage degree for its type and its current degree for the current predic- else predictions ← predictions + new-predictions tion direction. We choose to extend completions at vertices where the current degree is much smaller than the average degree. We also in- At each step, we refine existing predictions by generating new predic- corporate this measure into our vertex confidence so that predictions tions that add a new vertex based on the path summary information. that contain vertices with too many edges are ranked lower: Note that because there can be more than one possible new vertex, we may add more than one new prediction for each existing prediction. cd (v) = c(v) + degree-difference(v) Figure 4 illustrates two steps in the prediction process. To initialize the list of predictions, we use the specified anchor mod- We stop iteratively refining our predictions after a given number ules (provided as input). At this point, each prediction is simply a base of steps or when no new predictions are generated. At this point, we prediction that describes the anchor modules and possibly how they sort all of the suggestions by confidence and return them. Note that connect to the pipeline. After initialization, we iteratively refine the if we have too many suggestions, we can choose to prune our set of list of predictions by adding to each suggestion. Because there are a predictions at each step by eliminating those which fall below a certain large number of predictions, we need some criteria to order them so threshold. that users can easily locate useful results. We introduce confidence to measure the goodness of the predictions. 3.3 Biasing the Predictions Given the set of upstream (or downstream depending on which di- The prediction mechanism described above relies primarily on the fre- rection we are currently predicting) paths, the confidence of a single quency of paths to rank the predictions. There are, however, other fac- vertex c(v) is the measure of how likely that vertex is, given the up- tors that can be used to influence the ranking. For example, if a user stream paths. To compute the confidence of a single vertex, we need has been working on volume rendering pipelines, completions that em- to take into account the information given by all upstream paths. For phasize modules related to that technique could be ranked higher than this reason, the values in Gpath are not normalized; we use the exact those dealing with other techniques. In addition, some users will pre- counts. Then, as illustrated by Figure 5, we combine the counts from fer certain completions over others because they more closely mirror each path. This means we do not need any weighting based on the their own work or their own pipeline structures. Again, it makes sense frequency of paths; the formula takes this into account automatically. to bias completions toward user preferences. We can adapt our algo- Specifically, rithm to include such bias by incorporating a weighting factor in the ∑P∈upstream(v) count(v | P) confidence computation. Specifically, we adjust our counts by weight- c(v) = ∑P∈upstream(v) count(P) ing the contribution of each path according to a pipeline importance factor determined by a user’s preferences. Then, the confidence of a graph G is the product of the confidences of each of its vertices: c(G) = ∏ c(v) 4 I MPLEMENTATION v∈G Our implementation is split into three specific steps: determining when While each vertex confidence is not entirely independent, this measure completion should be invoked, computing the set of possible comple- gives a reasonable approximation for the total confidence of the graph. tions, and presenting these suggestions to the user. Computing the Because we perform our predictions iteratively, we calculate the con- possible completions requires the machinery developed in the previ- fidence of the new prediction pi+1 as the product of the confidence of ous section. The other steps are essential to make the approach usable. the old prediction pi and the confidence of the new vertex v: The interface, in particular, plays a significant role in allowing users to make use of suggestions while also being able to quickly dismiss them c(pi+1 ) = c(pi ) · c(v) when they are not desired.

5.Fig. 5: At each iteration, we examine all upstream paths to suggest a new downstream vertex. We select the vertex that has the largest frequency given all upstream paths. In this example, “vtkDataSetMapper” would be the selected addition. 4.1 Triggering a Completion 4.3 The Suggestion Interface We want to provide an environment where suggestions are offered au- In concert with our goal of unobtrusiveness, we provide an intuitive tomatically but do not interfere with a user’s normal work patterns. and efficient interface that enables users to explore the space of pos- There are two circumstances in pipeline creation where it makes sense sible completions. Auto-complete interfaces for text generally show to automatically trigger a completion: when a user adds a new module a set of possible completions in a one-dimensional list that is refined and when a user adds a new connection. In each of these cases, we are as the user types. For pipelines this task is more difficult because it is given new information about the pipeline structure that can be used to not feasible to show multiple completions at once, as this would result narrow down possible completions. Because users may also wish to in visual clutter. The complexity of deriving the completion is also invoke completion without modifying the pipeline, we also provide an greater. For this reason, our interface is two-dimensional: the first di- explicit command to start the completion process. mension is a list of maximal completions while the second dimension In each of the triggering situations, we begin the suggestion process provides the ability to increase or decrease the extent of the comple- by identifying the modules that serve as anchors for the completions. tion. For new connections, we use both of the newly connected modules, Current text completion interfaces defer to the user by showing and for a user-requested completion, we use the selected module(s). completions but allowing the user to continue to type if he does not However, when a user adds a new module, it is not connected to the wish to use the completions. We strive for similar behavior by auto- rest of the existing pipeline. Thus, it can be difficult to offer mean- matically showing a completion along with a simple navigation panel ingful suggestions since we have no surrounding structure to leverage. when a completion is triggered. The user can choose to interact with We address this issue by first finding the most probable connection to the completion interface or disregard it completely by continuing to the existing pipeline, and then continue with the completion process. work, which will cause the completion interface to automatically dis- Finding the initial connection for an added module may be difficult appear. The navigation interface contains a set of arrows for selecting when there are multiple modules in the existing pipeline than can be different completions (left and right) and depths of the current com- connected to the new module. However, because visual programming pletion (up and down). In addition, the rank of the current completion interfaces allow users to drag and place new modules in the pipeline, is displayed to assist in the navigation and accept and cancel buttons we can use the initial position of the module to help infer a likely con- are provided (see Figure 1(c)). All of these completion actions, along nection. To accomplish this, we compute the user’s layout direction with the ability to start a new completion with a selected module, are based on the existing pipeline, and locate the module that is nearest to also available in a menu and as shortcut keys. the new module and can be connected to it. The suggested completions appear in the interface as semi- transparent modules and connections, so that they are easy to distin- guish from the existing pipeline components. The suggested modules 4.2 Computing the Suggestions are also arranged in an intuitive way using a set of simple heuristics As outlined in the previous section, we compute possible completions that respect the layout of the current pipeline. The first new suggested that emanate from a set of anchor modules in the existing pipeline module is always placed near the anchor module. The offset of the using path summaries derived from a database of pipelines, and rank new module from the anchor module is determined by averaging the them by their confidence values. Depending on the anchor modules, direction and distance of each module in the existing pipeline. The a very large set of completions can be derived and a user is unlikely offset for each additional suggested module is calculated by applying to examine a long list of suggestions. Therefore, we prune our pre- this same rule to the module it is appended to. Branches in the sug- dictions to avoid rare cases. This both speeds up computation and gested completion are simply offset by a constant factor. These heuris- reduces the likelihood that we provide meaningless suggestions to the tics keep the spacing uniform and can handle upstream or downstream user. Specifically, because our predictions are refined iteratively, we completions whether pipelines are built top-down or left-right. prune a prediction if its confidence is significantly lower than its par- ent’s confidence. Currently, this is implemented as a constant thresh- 5 U SE C ASES old, but we can use knowledge of the current distribution or iteration We envision VisComplete being used in different ways to simplify the to improve our pruning. task of pipeline construction. In what follows, we discuss use cases VisComplete provides the user with suggestions that assist in the which consider different types of tasks and different user experience creation of the pipeline structure. Parameters are also essential com- levels. The types of tasks performed by a user can range from the ponents in visualizations, but because the choice of parameters is fre- very repetitive to the unique. Obviously, if the user performs tasks that quently data-dependent, we do not integrate parameter selection with are very similar to those in the database of pipelines, the completions our technique. Instead, we focus on helping users complete pipelines, that are suggested are very full—almost the entire pipeline can be cre- and direct them to existing techniques [17, 16, 2] to explore the pa- ated using one or two modules (see Figure 2 for examples). On the rameter space. Note that it might be beneficial to extend VisComplete other hand, if the task that is being performed is not often repeated to identify commonly used parameters that a user might consider ex- and nothing similar in the database can be found, VisComplete will ploring, but we leave this for future work. only be able to assist with smaller portions of the pipeline at a time.

6.Fig. 6: One of the test visualization pipelines applied to a time step of the Tokamak Reactor dataset. VisComplete could have made many completions that would have reduced the amount of time creating the pipeline. In this case about half of the modules and completions could have been completed automatically. This can still be of great help for the user because it will show them nance of the pipeline design process: the series of the actions a user possible directions to proceed in their pipeline construction, albeit at a followed to create and refine a set of related pipelines [7]. smaller scale. The first four tasks were straightforward and required little exper- The experience level of users that could take advantage of VisCom- imentation, but the final task was open-ended; users were given a plete also varies. For a novice user, VisComplete replaces the process dataset without any restrictions on the use of available visualization of searching for and tweaking an example that will perform their de- techniques. As these users learned about various techniques over the sired visualization. For example, a user who is new to VTK and desires semester, their proficiency in the area of visualization presumably pro- to compute an isosurface of a volume might consult documentation gressed from a novice level toward the expert level. to determine that a “vtkContourFilter” module is necessary and then To predict the performance gains VisComplete might attain, we search online for an example pipeline using this module. After down- created user models based on the provenance logs captured by Vis- loading the example, they may be able to manipulate it to produce the Trails. User modeling has been used in the HCI community for many desired visualization. Using VisComplete, this process is simplified— years [4, 5], and we employed a low-level model for our evaluation. the user needs only to start the pipeline by adding a “vtkContourFilter” Specifically, we assumed that at each step of the pipeline construction module and their pipeline will be constructed for them (see Figure 1). process, a VisComplete user would either modify the pipeline accord- Multiple possible completions can easily be explored and unlike ex- ing to the current action from the log or select a completion that adds amples downloaded from the Web, VisComplete can customize the a part of the pipeline they would eventually need. We assumed that a suggestions by providing completions that more closely reflect a spe- user would examine at most ten completions, and because our inter- cific user’s previous or more current work. face is two-dimensional, the user could also select a subgraph of any For experienced users, VisComplete still offers substantial benefits. completion. Because experts may not wish to see full pipelines as completions, Because VisComplete requires a collection of pipelines to derive the default depth of the completions can be adjusted as a preference suggestions, we divided our dataset into training and test sets. The so that only minor modifications are suggested at each step. Thus, training sets were used to construct the path summaries while the test at the smallest completion scale, a user can leverage just the initial sets were used with the user models to measure performance. connection completion to automatically connect new modules to their We note that this model presumes a user’s foreknowledge of the pipeline. The user could also choose to ignore suggested completions completed pipeline, and this certainly is not always the case. Still, as they add modules until the pipeline is specific enough to shrink we believe this simple model approximates user behavior well enough the number of suggestions. Unlike the novice user who may iterate to gauge performance. We also assumed a greedy approach in our through many suggestions at each step, the experienced user will likely model; a user would always take the largest completion that matched choose to ignore the suggestions until they provide the desired com- their final pipeline. Note that this might not always yield the best pletion on the first try. performance because the quality of the suggestions may improve as the pipeline is further specified. 6 E VALUATION 6.2 Results 6.1 Data and Validation Process Figure 6 shows one of the test pipelines with the components that Vis- To evaluate the effectiveness of our completion technique, we used a Complete could have completed highlighted along with its resulting set containing 2875 visualization pipelines along with logs of the ac- visualization. To evaluate the situation where a set of users create tions used to construct each pipeline. These pipelines were constructed pipelines that all tend to follow a similar template, we performed a by 30 students during a scientific visualization course.1 Throughout leave-one-out test for each task in our dataset. Figure 7 shows that our the semester, the students were assigned five different tasks and carried suggestion algorithm could have eliminated over 50%, on average, of them out using the VisTrails system, which captures detailed prove- the pipeline construction operations for each task. Because Task 1 was more structured than the other tasks, it achieved a higher percent- 1 age of reduction. Because Task 4 was more open-ended, although the

7.Fig. 7: Box plot of the percentages of operations that could be com- Fig. 9: Box plot of the average prediction index that was used for the pleted per task (higher is better). The statistics were generated for each completions in Figure 7 (lower is better). These statistics provide a user by taking them out of the training data. measure of how many suggestions the user would have to examine before the correct one was found. model. Finally, the parameters (e.g., pruning threshold, degree weight- ing) for the completion algorithms were not tuned. We plan to evaluate these settings to possibly improve our results. The completion examples shown in the figures of this paper, with the exception of Figure 6, used the entire collection of pipelines to generate predictions. Figure 6 used only the pipelines from Tasks 1–4. 7 D ISCUSSION Fig. 8: Box plot of the percentages of operations that could be com- To our knowledge, VisComplete is the first approach for automatically pleted given two types of tasks, novice and expert. The statistics were suggesting pipeline completions using a database of existing pipelines. generated by evaluating the novice tasks using the expert tasks as train- As large volumes of data continue to be generated and stored and as ing data (novice) and by evaluating the expert tasks using the novice analyses and visualizations grow in complexity, the creation of new tasks as training data (expert). content by consensus and the ability to learn by example are essential to enable a broader use of data analysis and visualization tools. The major difference between our automatic pipeline completion average percentage is also high, the results show a wider variation (be- technique and the related work on creating pipelines by analogy [30] tween 30% and 75%). This indicates that the completion interface can is that instead of using a single, known sequence of pipeline actions, be faster and more intuitive than manually choosing a template. our method uses an entire database of pipelines. Thus, instead of com- Because it is much more likely that our collection will contain pleting a pipeline based on a single example, VisComplete uses many pipelines from a variety of tasks, we also evaluated two cases that ex- examples. A second important difference is that instead of predict- amined the type of knowledge captured by the pipelines. Since Task 5 ing a new set of actions, our method currently predicts new structure was more open-ended and completed after the other four tasks, we ex- regardless of the ordering of the additions. This also means that Vis- pected that most users would be proficient using the tool and closer Complete only adds to the structure while analogies will delete from to the expert user described in Section 5. We ran the completion re- the structure as well. By incorporating more provenance information, sults using Tasks 1 through 4 as the training data (2250 pipelines) and as in analogies, VisComplete might be able to leverage more informa- Task 5 (625 pipelines) as the test data to represent a case where novice tion about the order in which additions to a pipeline are made. This users are helping expert users, but we also ran this test in reverse to could improve the quality of the suggested completions. determine if pipelines from expert users can aid beginners. Figure 8 We note that there will be situations where data about the types shows that both tests achieved similar results; this implies that the va- of completions that should occur are not available. Also, some sug- riety of pipelines from the four novice tasks balanced the knowledge gestions might not correspond to the user’s desires. If there are no captured in the expert pipelines. completions, VisComplete will not derive any suggestions. If there Our testing assumed that users would examine up to ten full com- are completions that do not help, the user can dismiss them by either pletions before quitting. In reality, it is likely that users would give up continuing their normal work or by explicitly canceling completion. even quicker. To evaluate how many predictions a user might need to Currently we determine the completions in an offline step (by pre- examine before finding the desired completion, we recorded the index computing the path summary, Section 3). We could update the path of the chosen completion in our tests. Figure 9 shows that the the cho- summary as new pipelines are added to the repository, incorporating sen completion was almost always among the first four. Note that we new pipelines as they are created. In addition, we could learn from excluded completions that only specified the connection between the user feedback by, for example, allowing users to remove suggestions new module and the existing pipeline because these trivial completions that they do not want to see again. The completions could be further are possible at each prediction index. refined to give higher weights to completions that more closely mirror Our results show that VisComplete can significantly reduce the the current user’s actions, even if they are not the most likely in the number of operations required during pipeline construction. In ad- database. dition, the completion percentages might be higher if our technique One important aspect of our technique is that it leverages the visual were available to the users because it would likely change user’s work programming environment available in many visualization systems. In patterns. For example, a user might select a completion that contains fact, it would be difficult to offer suggestions without a visual environ- most of the structure they require plus some extraneous components ment in which to display the structural changes. In addition, the in- and then delete or replace the extra pieces. Such a completion would formation for the completions comes from the fact that we have struc- almost certainly save the user time but was not captured with our user tural pipelines from previous work. Without an interface to construct

8.pipeline structures, it would be more difficult to process the data used [11] J. Heer, F. B. Vi´egas, and M. Wattenberg. Voyagers and voyeurs: support- to generate completions. However, we should note that turnkey appli- ing asynchronous collaborative information visualization. In Proceedings cations that are based on workflow systems, such as ParaView [18], of the SIGCHI conference on Human factors in computing systems (CHI), may also be able to take advantage of completions in a more limited pages 1029–1038, 2007. way by providing a more intelligent set of default settings for the user [12] H. Hirsh, C. Basu, and B. D. Davison. Learning to personalize. Commu- during their explorations. nications of ACM, 43(8):102–106, 2000. [13] IBM. OpenDX. 8 C ONCLUSIONS AND F UTURE W ORK [14] T. Igarashi and J. F. Hughes. A suggestive interface for 3d drawing. We have described VisComplete, a new method for aiding in the de- In ACM symposium on User interface software and technology (UIST), pages 173–181, 2001. sign of visualization pipelines that leverages a database of existing [15] T. J. Jankun-Kelly, O. Kreylos, K.-L. Ma, B. Hamann, K. I. Joy, J. M. pipelines. We have demonstrated that suitable pipeline fragments can Shalf, and E. W. Bethel. Deploying web-based visual exploration tools be computed from the database and used to complete new pipelines in on the grid. IEEE Computer Graphics and Applications, 23(2):40–50, real-time. Furthermore, we have shown how these completions can be 2003. presented to the user in an intuitive way that can potentially reduce the [16] T. J. Jankun-Kelly and K.-L. Ma. Visualization exploration and encapsu- time required to create pipelines. Our results indicate that substantial lation via a spreadsheet-like interface. IEEE Transactions on Visualiza- effort can be saved using this method for both novice and expert users. tion and Computer Graphics, 7(3):275–287, July/September 2001. There are several areas of future work that we would like to pursue. [17] T. J. Jankun-Kelly, K.-L. Ma, and M. Gertz. A model and framework As described above, we would like to update the database of pipelines for visualization exploration. IEEE Transactions on Visualization and incrementally, thus allowing the completions to be refined based on Computer Graphics, 13(2):357–369, March/April 2007. current information and feedback from the user. We plan to refine the [18] Kitware. Paraview. quality of the results by formally investigating the confidence measure [19] Kitware. VTK. and its parameters. We would also like to explore suggesting finished [20] B. Korvemaker and R. Greiner. Predicting unix command lines: Adjust- pipelines from the database in addition to the constructed completions ing to user patterns. In Proceedings of the Seventeenth National Con- we currently generate. For finished pipelines, we could display not ference on Artificial Intelligence and Twelfth Conference on Innovative only the completed pipeline structure but also a thumbnail of the result Applications of Artificial Intelligence, pages 230–235, 2000. from an execution of that pipeline. Finally, we plan to conduct a user [21] J.-F. Lalonde, D. Hoiem, A. A. Efros, C. Rother, J. Winn, and A. Cri- study to further evaluate our technique. minisi. Photo clip art. ACM Transactions on Graphics (Proceedings of SIGGRAPH), 26(3), 2007. ACKNOWLEDGMENTS [22] J. Marks, B. Andalman, P. A. Beardsley, W. Freeman, S. Gibson, J. Hod- gins, T. Kang, B. Mirtich, H. Pfister, W. Ruml, W. Ryall, J. Seims, and We acknowledge the generous help of many colleagues and collab- S. Shieber. Design galleries: A general approach to setting parameters for orators. Specifically, we wish to thank the VisTrails team for their computer graphics and animation. In ACM SIGGRAPH, pages 389–400, support, the VTK developers for their software, and the students of 1997. CS 5630/6630 at the University of Utah for their data. This work [23] Mercury Computer Systems. Amira. is partially supported by the NSF (under grants CNS-0751152, IIS- [24] MeVis Research. MeVisLab. 0746500, IIS-0513692, CCF-0401498, EIA-0323604, CNS-0514485, [25] Microsoft. Intellisense. IIS-0534628, CNS-0528201, OISE-0405402), the DOE, an IBM Fac- us/library/hcw1s69b(VS.80).aspx. ulty Award, and a Graduate Research Fellowship. [26] T. Munzner, C. Johnson, R. Moorhead, H. Pfister, P. Rheingans, and T. S. Yoo. NIH-NSF visualization research challenges report summary. IEEE R EFERENCES Computer Graphics and Applications, 26(2):20–24, 2006. [1] R. Agrawal, T. Imieli´nski, and A. Swami. Mining association rules be- [27] NAG. Iris Explorer. iec.asp. tween sets of items in large databases. SIGMOD Rec., 22(2):207–216, [28] A. Nanopoulos, D. Katsaros, and Y. Manolopoulos. A data mining algo- 1993. rithm for generalized web prefetching. IEEE Transactions on Knowledge [2] L. Bavoil, S. Callahan, P. Crossno, J. Freire, C. Scheidegger, C. Silva, and Data Engineering, 15(5):1155–1169, Sept.-Oct. 2003. and H. Vo. Vistrails: Enabling interactive, multiple-view visualizations. [29] S. G. Parker and C. R. Johnson. SCIRun: a scientific programming envi- In IEEE Visualization, pages 135–142, 2005. ronment for computational steering. In Supercomputing, 1995. [3] W. Bethel, C. Siegerist, J. Shalf, P. Shetty, T. J. Jankun-Kelly, O. Krey- [30] C. E. Scheidegger, H. T. Vo, D. Koop, J. Freire, and C. T. Silva. Querying los, and K.-L. Ma. VisPortal: Deploying grid-enabled visualization tools and creating visualizations by analogy. IEEE Transactions on Visualiza- through a web-portal interface. In Third Annual Workshop on Advanced tion and Computer Graphics (Proceedings of Visualization), 13(6):1560– Collaborative Environments, 2003. 1567, 2007. [4] S. K. Card, T. P. Moran, and A. Newell. The keystroke-level model [31] H. Senay and E. Ignatius. A knowledge-based system for visualization for user performance time with interactive systems. Commun. ACM, design. IEEE Comp. Graph. and Appl., 14(6), 1994. 23(7):396–410, 1980. [32] Swivel. [5] S. K. Card, A. Newell, and T. P. Moran. The Psychology of Human- [33] S. Takahashi, I. Fijishiro, Y. Takeshima, and T. Nishita. A feature-driven Computer Interaction. Lawrence Erlbaum Associates, Inc., Mahwah, NJ, approach to locating optimal viewpoints for volume visualization. In USA, 1983. IEEE Visualization, pages 495–502, 2005. [6] H. Childs, E. S. Brugger, K. S. Bonnell, J. S. Meredith, M. Miller, B. J. [34] S. Tsang, R. Balakrishnan, K. Singh, and A. Ranjan. A suggestive in- Whitlock, and N. Max. A contract-based system for large data visualiza- terface for image guided 3d sketching. In Proceedings of the SIGCHI tion. In IEEE Visualization, pages 190–198, 2005. conference on Human factors in computing systems (CHI), pages 591– [7] J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and 598, 2004. H. T. Vo. Managing rapidly-evolving scientific workflows. In Interna- [35] University of Utah. VisTrails. tional Provenance and Annotation Workshop (IPAW), LNCS 4145, pages [36] C. Upson et al. The application visualization system: A computational 10–18, 2006. Invited paper. environment for scientific visualization. IEEE Computer Graphics and [8] X. Fu, J. Budzik, and K. J. Hammond. Mining navigation history for Applications, 9(4):30–42, 1989. recommendation. In IUI ’00: Proceedings of the 5th international con- [37] F. B. Viegas, M. Wattenberg, F. van Ham, J. Kriss, and M. McKeon. ference on Intelligent user interfaces, pages 106–112, 2000. ManyEyes: a site for visualization at internet scale. IEEE Transac- [9] O. Gilson, N. Silva, P. Grant, M. Chen, and J. Rocha. VizThis: Rule- tions on Visualization and Computer Graphics (Proceedings of InfoVis), based semantically assisted information visualization. Poster, in Proceed- 13(6):1121–1128, 2007. ings of SWUI 2006, 2006. [38] U. D. Vordoloi and H.-W. Shen. View selection for volume rendering. In [10] J. Hays and A. A. Efros. Scene completion using millions of pho- IEEE Visualization, pages 487–494, 2005. tographs. ACM Transactions on Graphics (Proceedings of SIGGRAPH), 26(3), 2007.