imMens: Real-time Visual Querying of Big Data

We first describe a design space of scalable visual summaries that use data reduction methods (such as binned aggregation or sampling) to visualize a variety of data types. We then contribute methods for interactive querying (e.g., brushing & linking) among binned plots through a combination of multivariate data tiles and parallel query processing. We implement our techniques in imMens, a browser-based visual analysis system that uses WebGL for data processing and rendering on the GPU. In benchmarks imMens sustains 50 frames-per-second brushing & linking among dozens of visualizations, with invariant performance on data sizes ranging from thousands to billions of records.

1.Eurographics Conference on Visualization (EuroVis) 2013 Volume 32 (2013), Number 3 B. Preim, P. Rheingans, and H. Theisel (Guest Editors) imMens: Real-time Visual Querying of Big Data Zhicheng Liu , Biye Jiang‡ and Jeffrey Heer Department of Computer Science, Stanford University ‡ Department of Computer Science and Technology, Tsinghua University Abstract Data analysts must make sense of increasingly large data sets, sometimes with billions or more records. We present methods for interactive visualization of big data, following the principle that perceptual and interactive scalability should be limited by the chosen resolution of the visualized data, not the number of records. We first describe a design space of scalable visual summaries that use data reduction methods (such as binned aggregation or sampling) to visualize a variety of data types. We then contribute methods for interactive querying (e.g., brushing & linking) among binned plots through a combination of multivariate data tiles and parallel query processing. We implement our techniques in imMens, a browser-based visual analysis system that uses WebGL for data processing and rendering on the GPU. In benchmarks imMens sustains 50 frames-per-second brushing & linking among dozens of visualizations, with invariant performance on data sizes ranging from thousands to billions of records. Categories and Subject Descriptors (according to ACM CCS): H.5.2 [Information Interfaces]: User Interfaces—; 1. Introduction alized data, not the number of records. We realize our tech- niques in imMens† , a browser-based system for real-time in- Traditional data visualization tools are often inadequate to teraction with scalable visual summaries of big data. handle big data. While it is debatable what is meant by “big”, visualization researchers have regularly used one million or To support perceptual scalability, we first review applica- more data cases as a threshold [FP02, UTH06]. More gener- ble data reduction methods, including filtering [AS94], sam- ally, many data sets are too large to fit in memory and may pling [DSLG∗ 12,MTS91] and aggregation [CLNL87,JS98]. be distributed across a cluster; modern data warehouses of- We select binned aggregation as our primary data reduction ten include tables with billions or more records. Most visual strategy and describe a design space of binned plots for nu- analysis tools are not designed to work at this scale, let alone meric, ordinal, temporal and geographic variables. support real-time interaction [KPHH12]. We then address interactive scalability for panning, zoom- Research on big data visualization must address two ma- ing and brushing & linking in binned plots. As the number jor challenges: perceptual and interactive scalability. Given of visualized dimensions increases, the size of the support- the resolution of conventional displays (~1-3 million pixels), ing data can explode combinatorially (c.f., [KPP∗ 12]). In re- visualizing every data point can lead to over-plotting and sponse, we develop methods for real-time visual querying: may overwhelm users’ perceptual and cognitive capacities. Precompute Multivariate Data Tiles. Precomputed im- On the other hand, reducing the data through sampling or age tiles, as in Google Maps and Hotmap [Fis07], are a com- filtering can elide interesting structures or outliers. Big data mon solution for scalable panning and zooming. Rather than also impose challenges for interactive exploration. Query- generate image tiles intended for direct display, we instead ing large data stores can incur high latency, disrupting fluent precompute multivariate data tiles: projections correspond- interaction. Even with data reduction methods like binned ing to materialized database views [GM99]. By decompos- aggregation, high dimensionality or fine-grained bins can re- ing a data cube into a set of 3- and 4-dimensional projec- sult in data cubes too large to process in real-time. In this paper we present techniques to address perceptual and interactive scalability, following the principle that scala- † The name imMens stems from our desire to visualize immense bility should be limited by the chosen resolution of the visu- data in a manner that our minds (Latin: mens) can apprehend. © 2013 The Author(s) Computer Graphics Forum © 2013 The Eurographics Association and Blackwell Publish- ing Ltd. Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.

2. Zhicheng Liu, Biye Jiang & Jeffrey Heer / imMens: Real-time Visual Querying of Big Data tions, imMens flexibly manages data subsets as needed. Us- 2.2. Scalable Visualization Systems ing multidimensional projections, imMens can compute ag- In addition to work on visualization design, both researchers gregations across dimensions to support brushing & linking. and companies have developed large scale visualization sys- Parallelize Data Processing and Rendering. Depend- tems. Commercial products such as Tableau [Tab] and Spot- ing on binning resolution, data tiles may still be quite large, fire [Spo] translate user interactions into database queries, with millions or more values. To speed aggregation, imMens and can push processing of big data to dedicated databases. uses a dense indexing scheme that simplifies parallel query The query results are typically aggregates, such that the vi- processing. To realize this approach in contemporary web sualizations are perceptually scalable. This approach has two browsers, imMens uses WebGL to leverage parallel process- potential issues: query latency can be high for large data sets, ing on the GPU. We present a two-pass approach that uses and there is no guarantee that the result size has a reason- WebGL shader programs to first compute aggregate queries able upper bound. To reduce long query times, some sys- and then render updated visualizations. tems prefetch based on the user’s current context [CXGH08, The contributions of our work are two-fold. First, we in- DRW03]. To improve performance, other researchers exploit troduce the use of multivariate data tiles for pre-processing modern hardware such as multi-core [RWC∗ 08, PTMB09, and dynamic loading of data to enable scalable interaction. HB10b] and GPU [FP02, ME09, ZBDS12] computing. Second, the imMens system implements a novel synthesis The system most similar to imMens is Kandel et al.’s Pro- of binned aggregation, data representation and parallel pro- filer [KPP∗ 12]. Profiler employs binned plots to enable in- cessing (here using GPU computation) to support interac- teraction with over a million data points. However, Profiler tive visualization of big data. In performance benchmarks uses a single in-memory data cube and sequential query pro- imMens sustains near 50 frames-per-second brushing and cessing to support brushing & linking. As we show in our linking among dozens of visualizations. This performance benchmarks, this approach does not scale to larger data sets. is invariant on data sets ranging from thousands to billions of records. To our knowledge, imMens is the first system to Our work extends and integrates these approaches. We as- enable real-time interactive brushing of data sets this large. sume a client-server architecture in which scalable databases can be used to precompute multivariate data tiles. imMens then requests and manages data tiles to address issues of data 2. Related Work: Visualizing Big Data size and transfer, and uses parallel processing on the GPU A number of prior research projects have focused on improv- for real-time querying and rendering on the client. ing the scalability of visualization systems. 2.1. Scalability of Visual Encodings 3. Data Reduction Methods In many visualizations, each data record maps to a visual Our approach to visualizing big data follows an overarch- item, resulting in occlusion and cluttering for high data ing principle: perceptual and interactive scalability should densities. In response, researchers have proposed a num- be limited by the chosen resolution of the visualized data, ber of approaches. Pixel-oriented visualization techniques not the number of records. We now survey data reduction plot data points as single pixels to maximize screen uti- methods that can be used to realize this principle, including lization [Kei00]. Spatial displacement techniques like jitter- filtering, sampling, aggregation and modeling. ing [TGC03] and topological distortion [KHD∗ 10] reduce occlusion but do not preserve spatial information. For paral- lel coordinates and scatterplot matrices, dimension reorder- 3.1. Filtering & Sampling ing can also reduce clutter [PWR04]. Alpha blending (trans- Filtering and sampling techniques select a subset of data, parency) is often used to encode density and thus combat to which standard visualization techniques can be applied. over-plotting [JS98,JLJC05,UTH06]. Alpha blending effec- The selected subset, however, may still be too large to visu- tively performs aggregation in image space, rather than data alize effectively and may omit elements of interest. In sim- space. Still, each of these techniques requires drawing every ple random sampling [Loh09], every data point has the same data record, which imposes inherent scalability limits. probability of being selected. The sample resulting may not An alternative is to reduce big data to smaller, derived be representative and can miss important structures or out- data more amenable to visualization [DBC∗ 13]. Data reduc- liers. Systematic sampling sorts data points in a particular tion strategies used in information visualization include fil- order and selects data points at regular intervals with a ran- tering [AS94], sampling [BS06, DSLG∗ 12, Raf05], binned dom start. Stratified sampling divides a data set into disjoint aggregation [BBSBA08, CLNL87, EF10, Fis07, HDS∗ 10, subgroups or “strata”, and applies simple or systematic sam- KMSH12, KPP∗ 12, RWC∗ 08], and model-fitting. We com- pling within each stratum. However, these methods require pare these approaches in more detail in the next section and that specific dimensions be chosen ahead of time, requiring describe a design space for binned plots. prior knowledge and often costly pre-processing. © 2013 The Author(s) © 2013 The Eurographics Association and Blackwell Publishing Ltd.

3. Zhicheng Liu, Biye Jiang & Jeffrey Heer / imMens: Real-time Visual Querying of Big Data (a) (b) Figure 1: A symbol map (a) and heatmap (b) visualizing a dataset of Brightkite user checkins. The symbol map visualizes a sample of the data, and the heatmap shows the density of checkins by aggregation. Compared to the heatmap, sampling misses important structures such as inter-state highway travel and Hurricane Ike, while dense regions still suffer from over-plotting. 3.2. Binned Aggregation over this reduced data. However, this approach still suffers from the same problems with sampling discussed above. On- Binning aggregates data and visualizes density by counting line aggregation [HHW97, FPDs12] shows continuously up- the number of data points falling within each predefined bin. dating aggregates and confidence intervals in response to a For a numeric variable, one can define bins as adjacent inter- stream of samples. Our approach is compatible with these vals over a continuous range. For categorical variables, one two methods: one could compute approximate data tiles us- can simply treat each value as a bin. Aggregation can also ing the BlinkDB approach, or update data tiles in a streaming be defined at multiple scales over a hierarchy [EF10], with fashion via online aggregation. Though not explored here, nested, potentially non-uniform, bins. For example, tempo- these methods may provide low-latency results when com- ral values can be aggregated by day, week, month, quarter, plete data tiles have not yet been precomputed. year, and so on. In terms of visualization, histograms and heatmaps are exemplary 1D and 2D binned plots. 4. Designing Binned Plots 3.3. Model-based Abstraction In imMens we focus on binned aggregation as our primary data reduction strategy. Here we present our rationale for us- Another reduction strategy is to describe data in terms of ing binned aggregation, and discuss the corresponding visu- mathematical models or statistical summaries. For example, alization design space for binned plots. one might fit a model and visualize the resulting parameters or theoretical density. For scatter plots one can use regres- sion models to fit trend lines; examples for time series data 4.1. Why Binned Aggregation? include moving averages and auto-regressive models. We use binned aggregation because it conveys both global patterns (e.g., densities) and local features (e.g., outliers), 3.4. Hybrid Reduction Methods while enabling multiple levels of resolution via the choice of bin size. Consider Figure 1, which visualizes a data set The above data reduction methods can also be combined. For of over four million user checkins on Brightkite, a location- example, a box plot with outliers applies both modeling and based user checkin service, from April 2008 to October 2010 filtering. In this vein, Novotný and Hauser [NH06] perform [CML11]. Figure 1(a) shows a symbol map of stratified sam- two dimensional binning for parallel coordinates and show ples generated by Google Fusion Tables [DSLG∗ 12]. Figure specific data outliers along with the bins. To visually sum- 1(b) shows a binned heatmap in imMens, color-coded by the marize large networks, both Bagrow et al. [BBSBA08] and density of checkins at different locations. One can see richer Kairam et al. [KMSH12] combine modeling and aggregation information in the heatmap, including patterns on inter-state through multi-scale histograms of network statistics. highways, events outside the US, and a long trail of checkins Database researchers have explored the combination of spanning the coast of Texas and Gulf of Mexico. These are sampling and aggregation. To provide fast approximate checkins made by Brightkite user account “Hurricane Ike” queries, BlinkDB [APM∗ 12] builds multi-dimensional and that report the location of the hurricane along its path in multi-resolution stratified samples and computes aggregates 2008. Sampling can fail to show such interesting outliers. © 2013 The Author(s) © 2013 The Eurographics Association and Blackwell Publishing Ltd.

4. Zhicheng Liu, Biye Jiang & Jeffrey Heer / imMens: Real-time Visual Querying of Big Data 4.2. Visualization Design for Binned Plots We consider binning schemes for four data types commonly found in databases: ordinal, numeric, temporal and geo- graphic. For ordinal and (sorted) categorical data, each dis- tinct value is a bin. We group numeric data into adjacent intervals over a continuous range. Temporal values can be binned at various levels of granularity: year, month, week, day, hour, etc. For geographic data, we treat 1D nominal units such as states or countries as unique bins. If locations (a) (b) are specified as latitude and longitude points, we bin their projected spatial coordinates. We primarily use the Mercator projection for consistency with existing map tile providers. Table 1 provides a summary of relevant visualization de- signs organized by data type and number of dimensions. One-dimensional plots for ordinal, numeric, temporal and geographical dimensions take the form of bar charts, his- tograms, line graphs and choropleth maps. Two-dimensional binned plots are heatmap variants; plots with heterogeneous (c) (d) data types (e.g., a temporal and an ordinal variable) are also possible. We focus on binned plots with up to two data di- Figure 2: Scatter plots with 100,000 data points: (a) tradi- mensions, encoded spatially. Color is used to encode data tional, (b) hexagonal bins, (c) rectangular bins and (d) rect- density and indicate highlights for brushing & linking. We angular bins with perceptual (cube root) color adjustment. refrain from additional visual variables such as texture or size, as they might interfere with visualization interpretation. significantly and their applicability to big data is unclear. Multidimensional displays can be constructed in the form of In imMens, we treat bin count as an adjustable parameter, multiple coordinated views and trellis plots [BCS96]. bounded by the screen pixels allocated to a plot and avail- able resources. At the limit, we can map one bin to one pixel Figure 2 shows binned scatter plots [CLNL87] of two and include as many bins as memory constraints allow. numeric dimensions. Unwin et al. describe three binning schemes that can tessellate a plot: triangular, rectangular and hexagonal [UTH06]. Carr et al. [CLNL87] argue for hexago- 4.3. Color Encoding nal bins due to reduced bias in density estimation compared to rectangular bins; however, Scott [Sco92] shows that the For color encoding, one can map density values to hues, lu- differences are marginal. In imMens, we choose rectangu- minance or opacity. We map a non-zero density value x to a lar binning for consistency and efficiency of query process- luminance (or opacity) value Y ∈ [0, 1] using the formula: ing. Applying consistent binning schemes over 1D (e.g., his- γ xˆ − xmin tograms) and 2D plots ensures compatibility when perform- Y = α+ (1 − α) (1) xmax − xmin ing linked selections between plots. Statisticians have proposed various heuristics to select bin Here xˆ denotes the value of x, bounded from above and be- sizes for a numeric range (e.g., Sturges’ formula [Stu26] and low by the range parameters xmax and xmin . These parameters Scott’s reference rule [Sco79]). These heuristics can vary can be determined from the data, or adjusted interactively to explore value ranges at finer resolutions. Numeric Ordinal Temporal Geographic Our color encoding employs two techniques to enhance perception. First, linear changes in Y values may not corre- spond to perceptually linear changes. The γ parameter can be used to introduce a non-linearity. By default we set γ = 13 , as 1D Histogram Bar Chart Line Graph / Choropleth Map the cube root approximates perceptual linearity, akin to the Area Chart lightness channel in the CIELAB color space [Sto03] (com- pare Figures 2(c) and 2(d)). Second, when visualizing big data the maximum density value in a binned plot may be or- ders of magnitude greater than the minimum non-zero den- sity value. A naïve color ramp can render such bins invisible. 2D Binned Heatmap Temporal Geographic To ensure the visibility of outliers, we set a minimum value Scatter Plot Heatmap Heatmap of α = 0.15 for non-zero densities, based on prior experi- Table 1: Example visualization designs for binned plots. mental results for luminance contrast [SB09, HB10a]. © 2013 The Author(s) © 2013 The Eurographics Association and Blackwell Publishing Ltd.

5. Zhicheng Liu, Biye Jiang & Jeffrey Heer / imMens: Real-time Visual Querying of Big Data Figure 3: Panning and zooming in a binned plot: initial view (left), zooming in (middle), panning to the lower-left (right). 5. Enabling Interaction in Binned Plots Interaction is essential to exploratory visual analysis [HS12], Figure 4: Multiple coordinated views of Brightkite user but big data imposes challenges to real-time response rates. checkins in North America. Cyan lines in the heatmap in- While each binned chart type in the previous section visual- dicate data tile boundaries. Each visualization region is an- izes one or two aggregated dimensions, more data resolution notated by its backing data dimensions and indices. is needed to support interaction. Panning and zooming may require finer grained bins, as in Figure 3. 5.2. From Data Cubes to Multivariate Data Tiles Brushing & linking, in which selections in one view high- A full data cube is often too big to fit in memory and query light the corresponding data in other views, requires com- in real-time. The size of a cube is ∏i bi , where bi is the bin puting aggregates filtered by an initial data selection. These count for dimension i. As the number of dimensions or bins queries require partially de-aggregated data over which to increases, the data cube size may become unwieldy. To ad- compute the filtered aggregation (or “roll-up”). Sending dress this issue, we decompose the full cube into sub-cubes these queries to a server incurs latency due to both process- with at most four dimensions. ing and networking delays, and can easily exceed a 100 mil- lisecond threshold for interactive response [CMN83]. Fur- The primary contributor to data cube size is the combina- thermore, multiple clients might overload the server. torial explosion of multiple dimensions. However, for any pair of 1D or 2D binned plots, the maximum number of In this section, we present our method for enabling real- dimensions needed to support brushing & linking is four time visual querying in imMens. We use brushing & linking (e.g., between two binned scatterplots that do not share a over the Brightkite data set as a running example. The raw dimension). As a result, we can safely decompose the full Brightkite data has five dimensions: User, Date, Time, Lat cube into a collection of smaller 3- or 4-dimensional projec- and Lon. Figure 4 shows four linked visualizations depict- tions. For example, four 3-dimensional cubes can cover all ing binned data from different perspectives. The geographi- the possible brushing and linking scenarios shown in Figure cal heatmap (X, Y) is based on Mercator-projected Lon, Lat 4: (X, Y, Hour), (X, Y, Day), (X, Y, Month), (Hour, Day, Month). coordinates; the three histograms show monthly (Month), If we assume a uniform bin count b, this decomposition re- daily (Day) and hourly (Hour) checkin distributions derived duces the total data record count from b5 to 4b3 ; when b=50, from the Date and Time fields. The Jan bin is selected in the the reduction is from 312.5M to 0.5M records. Month histogram. In response, corresponding data are high- lighted in orange in the other histograms, and the geographic After decomposition, individual sub-cubes may still be heatmap shows only checkins in the month of January. prohibitively large if the bin count is high. In some plots, we can treat the bin count as a free parameter, and adjust ac- cordingly. For others – particularly geographic heatmaps – 5.1. Data Cube Queries to Support Interaction we may wish to zoom in to see fine-grained details, requir- ing an exponentially increasing number of bins across zoom Applying binned aggregation to X, Y, Month, Day and Hour, levels. To handle large bin counts, we segment the bin ranges we form a 5-dimensional data cube (Figure 5(a)). The data to form multivariate data tiles, as illustrated in Figure 5(b). cube contains the lowest level of data resolution in the linked visualizations. To perform brushing & linking from Data tiles are inspired by the notion of map tiles used in the Month histogram to the Day histogram, we can filter the systems such as Google Maps and Hotmap [Fis07]. How- data cube to only the rows with bin value 0 in the Month di- ever, data tiles differ in two important ways. First, they pro- mension (corresponding to January; highlighted in yellow in vide data for dynamic visualization, not pre-rendered im- Figure 5(a)) and perform a roll-up by summing data along ages. Second, they contain multidimensional data to support the Hour, X and Y dimensions. To zoom out, we can aggre- querying as well as rendering. Given a set of data tiles and gate adjacent bins to compute a coarser-grained projection. a query selection (bin range), we can dynamically compute Panning at the most zoomed-in level involves querying the roll-up queries and render projected data. Figure 4 shows ge- bins visible in the current viewport. ographic tile boundaries highlighted in cyan. We label each © 2013 The Author(s) © 2013 The Eurographics Association and Blackwell Publishing Ltd.

6. Zhicheng Liu, Biye Jiang & Jeffrey Heer / imMens: Real-time Visual Querying of Big Data (a) (b) Figure 5: (a) A 5-dimensional data cube of Brightkite check-ins; (b) Decomposing a full cube into sub-cubes and data tiles. tile dimension as Dbs -be -z, where D is the binned data di- mension, bs represents the starting bin index, be represents the ending bin index, and z represents the zoom level. The Brightkite visualization in Figure 4 uses 13 data tiles: one tile representing the 3-dimensional projection of month, day and hour (i.e., Month0-11-0 × Day0-30-0 × Hour0-23-0), and twelve tiles containing 3-dimensional projections for all combinations of the four geographic segments and three histograms (e.g., X256-511-4 × Y512-767-4 × Month0-11-0). Multivariate data tiles are precomputed on a server and then loaded on demand to support client-side visualization. Brushing & linking involves aggregating these data tiles. Figure 6: Brushing & linking from the geographic heatmap For example, when the user selects a region in the geo- to the Day histogram. We aggregate four data tiles along the graphic heatmap, we need to highlight the corresponding X and Y dimensions and sum up the projections. checkin distributions in the three histograms. In the worst case, the selected geographic region covers bins in all four map tiles. To render the highlight in the Day histogram we need to roll-up the four data tiles containing the X × Y × Day dimensions and sum the results. Figure 6 shows this process. For interactions like panning and zooming, we dynamically fetch data tiles precomputed at different levels of binning resolution, similar to existing mapping services. 5.3. Dense vs. Sparse Data Tile Storage Figure 7: Sparse and dense representations of a data tile. Data tiles can use either sparse or dense packing schemes. A sparse representation stores indices and values only for However, as the number of data records increases, the den- non-zero bins (Figure 7(b)). A dense representation includes sity of the data typically also increases. Once the proportion zero values, but can store all the data as a simple array if the of non-zero bins exceeds a threshold (20% for 4D tiles, 25% bin counts for all dimensions are known (Figure 7(c)).‡ If a for 3D tiles), a dense representation is more efficient because data tile has many empty bins, a sparse representation can re- we can omit bin indices. In imMens we use dense tiles to ex- duce storage costs. For example, a sparse packing is used in ploit these space savings, safeguard worst-case performance, Profiler [KPP∗ 12] for full data cubes of up to 5 dimensions. and simplify parallel query processing. 5.4. Parallel Query Processing ‡ We treat row indices as numbers in a mixed-radix number sys- tem [Knu06]. The row index in a k-dimensional data tile can be ex- A dense representation scheme supports simple, efficient pressed as V (k − 1)R(k−1) |V (K − 2)R(K−2) |...|V (0)R(0) , where V (k) parallel processing when aggregating data tiles. Dense tiles is the value of the kth digit, and R(k) is the radix of the kth digit. provide a consistent indexing scheme that enables direct © 2013 The Author(s) © 2013 The Eurographics Association and Blackwell Publishing Ltd.

7. Zhicheng Liu, Biye Jiang & Jeffrey Heer / imMens: Real-time Visual Querying of Big Data lookup of any multidimensional bin using a predictable in- scheme in Equation 2 to pack it into the 8-bit RGBA chan- teger index. As a result, aggregation queries can be paral- nels of a pixel. If the value is smaller than 231 , we can pre- lelized easily. For each output bin (summed value), we can serve complete information with no precision loss. We set use a simple loop that accesses only the bins needed for that the highest-order bit in the alpha channel to 1 due to image summation. These computations can be run in parallel in a format constraints; we use the PNG file format for its loss- shared memory environment without any conflicts, resulting less compression and network portability. The PNG specifi- in faster query performance than a sequential scan. cation instructs that fully transparent pixels be assigned the same RGB values (zero) for better compression. Storing zero Input: c2 , c3 , R2 , R3 , T values in alpha channels thus results in data loss. Output: sum v at index i of d1 ’s projection v = 0;   A = 0x80 | ((0xFF000000 & v) >> 24) foreach j ∈ R2 do  R = (0x00FF0000 & v) >> 16  foreach k ∈ R3 do (2)   G = (0x0000FF00 & v) >> 8 v + = T [i ∗ c2 ∗ c3 + j ∗ c3 + k];  B = (0x000000FF & v) end The maximum supported texture size imposes a constraint end on the amount of data we can fit into a data tile. The maxi- Algorithm 1: Data tile roll-up for a projection index. mum texture size can vary by graphics card from 1, 0242 to 16, 3842 pixels. A texture of size 4, 0962 (a commonly sup- Consider a 3D data tile T with dimensions (d1 , d2 , d3 ), ported size) has 16 million pixels. With this size, the finest with respective bin counts (c1 , c2 , c3 ). If users brush a 2D possible resolution is 256 bins per dimension for a 3D data binned plot of d2 and d3 to select ranges R2 and R3 , we can tile and 64 bins per dimension for a 4D tile. Assuming 50 compute the summed value v at index i of the d1 projection bins per dimension, the average file size is 30KB for a 3D using Algorithm 1. With this simple roll-up procedure, we PNG tile and 150KB for a 4D tile. The 13 PNG-encoded can run the algorithm in parallel for all c1 indices. data tiles used in the Brightkite example require 352KB. We note that our summation scheme can be further opti- Applying more aggressive packing schemes by quantiz- mized by the use of summed area tables. If data tiles instead ing data tile values into fewer pixel channels is more space- store cumulative densities over index ranges, the summation efficient, but comes at a cost of precision loss. We have ex- for an output bin can be computed with a constant number perimented with packing each value into one of the RGBA of lookups. However, in practice we find that our simpler channels, quantizing the value to a single byte. Benchmarks scheme provides good performance, while the use of cumu- show per-bin root mean square errors ranging from 0.003 to lative densities exacerbates issues of arithmetic overflow. 0.18, where an error of 0.01 maps to a perceptual difference of one pixel in a 100-pixel-tall histogram. We leave more 6. Implementation Details efficient packing schemes that limit data loss to future work. We implement our visualization and interaction techniques in imMens, a browser-based system for interactive visual ex- 6.2. Parallel Query and Render via Shader Programs ploration of big data. Given a visualization definition, im- WebGL uses shader programs that run on the GPU: ver- Mens loads data tiles from a server and provides an interac- tex shaders transform 3D geometry, fragment shaders com- tive multiple view display of binned plots within standards- pute pixel colors. In imMens, we use a default vertex shader compliant web browsers. We use WebGL, a JavaScript vari- and implement query processing and rendering as fragment ant of the OpenGL ES 2.0 specification, to perform GPU- shaders. We use a two-stage process, illustrated in Figure 8. based computation and render visualizations to HTML5 can- vas elements. We also use Leaflet [Lea] to display map im- The query fragment shader reads from data tiles bound age tiles and D3 [BOH11] to render axes and labels. We now as textures and performs a roll-up (§5.4), providing the data describe our scheme for storing data tiles as image textures to visualize. The shader program computes the sum for a and present our querying and rendering pipeline. single output bin, and writes the result to an offscreen frame buffer object (FBO). This program is run in parallel for all bins in the resulting plot. With multiple plots, we need to 6.1. Storing Data Tiles as Image Files compute roll-ups for each. To do so, we dynamically change data tiles during the execution of the shader, and store all the By packing data tiles in an image format, we can directly aggregates in the same FBO. For queries spanning multiple bind them to the WebGL context as textures and take ad- data tiles (as in Figure 6), we bind the required data tiles and vantage of existing browser caching facilities. Packing data perform multiple roll-ups in a loop. tiles into images facilitates efficient storage and transfer of tiles and makes the data accessible for parallel processing The render fragment shader binds the FBO as a texture on the GPU. For each data tile integer value v, we apply the and renders the plots. For each pixel in a plot, the shader de- © 2013 The Author(s) © 2013 The Eurographics Association and Blackwell Publishing Ltd.

8. Zhicheng Liu, Biye Jiang & Jeffrey Heer / imMens: Real-time Visual Querying of Big Data Figure 8: A two-pass approach for parallel data querying and rendering using WebGL fragment shaders. termines which bin that pixel belongs to and assigns it an ap- propriate color. Consider the case of rendering a histogram. The query fragment shader first computes the necessary data and writes it as a 1D array in the FBO. To draw a histogram with an on-screen bin width of c pixels, the render fragment shader computes the corresponding bin index i for each pixel coordinate (x, y). If the histogram starts at x-coordinate 0, then i = x/c. By reading the data value from the FBO, the shader determines the height h of the bar at bin index i. Fi- nally the shader makes a binary choice to assign the pixel a color: if y > h, the color is left as background; otherwise the pixel is colored as part of the histogram bar. Our two-pass approach carries two advantages. First, us- Figure 9: Average time (ms) for imMens (blue) and Profiler ing the FBO we can eliminate redundant computation for (orange) to update frames during brushing & linking. pixels corresponding to the same data bin. Second, by ex- amining the FBO we can identify the maximum bin value. The FBO enables us to render bar heights and cell color val- Data Set Brightkite Flight Delays SPLOM ues accurately by normalizing with this maximum value. Size 4M 118M 1B Bins Month (12) Carrier (28) Dim. A (50) Day (31) Year (20) Dim. B (50) 7. Performance Benchmarks Hour (24) Day of Week (7) Dim. C (50) X (256) Dep. Delay (174) Dim. D (50) To evaluate the scalability of imMens, we measure the sys- Y (256) Arr. Delay (174) Dim. E (50) tem frame rate during brushing & linking. For a given visu- Data Tiles 13 4 10 alization configuration (consisting of both 1D and 2D plots), Time 17.76 ms 16.56 ms 20.47 ms we programmatically brush the bins in each 1D histogram. Brushing only the 1D histograms and not the 2D plots pro- Table 2: Benchmark results for three data sets. vides a conservative estimate of performance, as brushing 2D bins results in more selective queries and hence faster count (10, 20, 30, 40, 50), the number of dimensions (4, processing. For each brushed bin, we record the time taken 5), and the number of input records (10K, 100K, 1M, 10M, to both compute a roll-up query and re-render the display. 100M, 1B). Each generated data set contains five dimensions We average these update times over multiple runs. consisting of random samples drawn from normal distribu- We first conducted benchmarks using two real-world data tions. Three of these distributions are independent; the other sets: 4 million user checkins on Brightkite [CML11] and 118 two distributions are linearly derived from the first. We vi- million records regarding on-time performance of U.S. do- sualize the data as a scatter plot matrix (SPLOM) with uni- mestic flights [BTS]. We visualize the Brightkite data set variate histograms along the diagonal, and programmatically with a geographic heatmap and three temporal histograms brush the bins in each of these histograms. as shown in Figure 4. For the flight delay data set, we use a We ran the benchmarks in Google Chrome v.23.0.1271.95 similar visualization configuration, including a binned scat- on a quad-core 2.3 GHz MacBook Pro (OS X 10.8.2) ter plot of departure delay versus arrival delay and three his- with per-core 256K L2 caches, shared 6MB L3 cache and tograms showing the distributions of delayed flights by air- 8GB RAM. The test machine has a PCI Express NVIDIA line, year and day of week. GeForce GT 650M graphics card with 1024MB video RAM. To compare imMens to the existing state-of-the-art, we Table 2 summarizes the benchmark results, and Figure 9 replicate the benchmarks used for Profiler [KPP∗ 12] and plots the detailed results for the comparative benchmark compare the two systems. We generate a total of 60 test across all 60 test data sets. Due to memory limits, Profiler data sets by varying three parameters: the per-dimension bin can not visualize data with 10M or more records. imMens © 2013 The Author(s) © 2013 The Eurographics Association and Blackwell Publishing Ltd.

9. Zhicheng Liu, Biye Jiang & Jeffrey Heer / imMens: Real-time Visual Querying of Big Data sustains an average 50 fps across all conditions. The perfor- degradation is thus important when data tiles are yet to be mance is invariant to the original data set size. computed. One possibility is to adopt approaches such as BlinkDB [APM∗ 12] and online aggregation [HHW97] to rapidly provide approximate tiles on the fly based on data 8. Conclusion and Future Work samples. The resulting visualizations will have a coarser res- In this paper, we contribute methods for real-time visual olution and can be enhanced with more details after the full querying of big data. We first provide a review of data re- data tiles are computed. duction methods to support scalable visual summaries, and Another future step is to provide an interface for visu- examine the design space of binned plots for large multidi- alization construction. Currently imMens uses hand-coded mensional data. To enable real-time interaction, we integrate specifications. In contrast, Tableau’s drag-and-drop interface multivariate data tiles and parallel processing. [Tab] supports dynamic construction of multiscale visualiza- We implement these methods in imMens, a browser-based tions. We plan to combine a similar design with the visual- system using WebGL. We use WebGL because it is cur- ization and interaction facilities of imMens. rently the only standardized way to access GPU processing Finally, by focusing on 2-dimensional binned plots of in a web browser. Future browser-based GPGPU or parallel multi-dimensional data, we do not address scalability issues computing (multi-core) features will permit alternative ap- in higher-dimensional plots such as parallel coordinates or proaches. Our general approach of generating multivariate for complex data types such as networks. These remain as data tiles and leveraging parallel query processing is also di- interesting challenges for future research. rectly applicable in standard (non-web) desktop contexts. With a sustained performance of 50 frames-per-second Acknowledgments brushing and linking, imMens enables data analysts to in- This work was supported by the Intel Science and Technol- teractively examine summaries of billions of data records ogy Center for Big Data and the DARPA XDATA program. in real-time. To our knowledge, it is the first system to en- We thank Leilani Battle, Justin DeBrabant and Michael able real-time brushing with data sets this large. imMens is Stonebraker for their helpful insights and conversations. available as open source software at StanfordHCI/imMens. References One limitation of our approach is the lack of support [APM∗ 12] AGARWAL S., PANDA A., M OZAFARI B., M ADDEN for ad-hoc compound brushing of more than four dimen- S., S TOICA I.: BlinkDB: queries with bounded errors and bounded response times on very large data. arXiv:1203.5485 sions. For example, in the Brightkite visualizations in Fig- (Mar. 2012). 3, 9 ure 4, users may want to explore the geographic distribution [AS94] A HLBERG C., S HNEIDERMAN B.: Visual information of checkins across 24 hours in a particular day. To do so, seeking: Tight coupling of dynamic query filters with starfield they may select both a specific month and day, then perform displays. In Proceedings of CHI (1994), pp. 313–317. 1, 2 brushing and linking between the Hour histogram and the ge- [BBSBA08] BAGROW J. P., B OLLT E. M., S KUFCA J. D., B EN - ographic heatmap. Currently such compound queries require AVRAHAM D.: Portraits of complex networks. EPL (Europhysics computing 5-dimensional data tiles – an untenable increase Letters) 81, 6 (2008), 68004. 2, 3 in tile size. However, once queries become highly selective, [BCS96] B ECKER R. A., C LEVELAND W. S., S HYU M. J.: The the decrease in data size due to filtering may permit more re- visual design and control of trellis display. Journal of Computa- sponsive server-side querying and dynamic tile generation. tional and Graphical Statistics 5, 2 (1996), 123–155. 4 [BOH11] B OSTOCK M., O GIEVETSKY V., H EER J.: D3: Data- In future work, we plan to automatically optimize client- driven documents. IEEE TVCG 17, 12 (2011), 2301–2309. 7 side visualization specifications (including requested dimen- [BS06] B ERTINI E., S ANTUCCI G.: Give chance a chance: mod- sions and bin counts) based on available resources. For ex- eling density to enhance scatter plot quality through random data ample, the maximum size and number of textures supported sampling. Information Visualization 5, 2 (2006), 95–110. 2 by WebGL varies across machines and graphics cards. Given [BTS] Bureau of transportation statistics. a client’s resource constraints, reducing the number of bins 8 or dimensions may preserve quality of service. [CLNL87] C ARR D. B., L ITTLEFIELD R. J., N ICHOLSON A critical issue in providing a seamless user experience is W. L., L ITTLEFIELD J. S.: Scatterplot matrix techniques for large n. Journal of the American Statistical Association (1987), supporting natural transitions between levels of detail and vi- 424–436. 1, 2, 4 sualization configurations. Ideally all possible data tiles are [CML11] C HO E., M YERS S. A., L ESKOVEC J.: Friendship and pre-computed and stored on a server. The client might an- mobility: user movement in location-based social networks. In ticipate possible user interactions and prefetch data tiles to Proceedings of SIGKDD (2011), pp. 1082–1090. 3, 8 reduce latency [DBC∗ 13]. This strategy may not be realis- [CMN83] C ARD S. K., M ORAN T. P., N EWELL A.: The Psy- tic for high-dimensional data sets as preprocessing all pos- chology of Human-Computer Interaction. Psychology Press, sible data tiles can be computationally expensive. Graceful 1983. 5 © 2013 The Author(s) © 2013 The Eurographics Association and Blackwell Publishing Ltd.

10. Zhicheng Liu, Biye Jiang & Jeffrey Heer / imMens: Real-time Visual Querying of Big Data [CXGH08] C HAN S.-M., X IAO L., G ERTH J., H ANRAHAN P.: [Knu06] K NUTH D. E.: The art of computer programming, vol. 2. Maintaining interactivity while exploring massive time series. In Addison-Wesley, 2006. 6 Proceedings of VAST (Oct. 2008), pp. 59 –66. 2 [KPHH12] K ANDEL S., PAEPCKE A., H ELLERSTEIN J. M., [DBC∗ 13] D E B RABANT J., BATTLE L., C ETINTEMEL U., H EER J.: Enterprise data analysis and visualization: An inter- Z DONIK S., S TONEBRAKER M.: Techniques for visualizing view study. IEEE TVCG 18, 12 (2012), 2917–2926. 1 massive data sets. In New England Database Summit (Feb 2013). [KPP∗ 12] K ANDEL S., PARIKH R., PAEPCKE A., H ELLER - 2, 9 STEIN J. M., H EER J.: Profiler: Integrated statistical analysis [DRW03] D OSHI P. R., RUNDENSTEINER E. A., WARD M. O.: and visualization for data quality assessment. In Proceedings of Prefetching for visual data exploration. In Proceedings of the AVI (2012), pp. 547–554. 1, 2, 6, 8 Eighth International Conference on Database Systems for Ad- [Lea] Leaflet. 7 vanced Applications (2003), DASFAA ’03, pp. 195–203. 2 [Loh09] L OHR S. L.: Sampling: Design and Analysis, 2 ed. [DSLG∗ 12] DAS S ARMA A., L EE H., G ONZALEZ H., M ADHA - Duxbury Press, Dec. 2009. 2 VAN J., H ALEVY A.: Efficient spatial sampling of large ge- ographical tables. In Proceedings of SIGMOD (2012), ACM, [ME09] M C D ONNEL B., E LMQVIST N.: Towards utilizing pp. 193–204. 1, 2, 3 GPUs in information visualization: A model and implementa- tion of image-space operations. IEEE TVCG 15, 6 (Nov. 2009), [EF10] E LMQVIST N., F EKETE J. D.: Hierarchical aggregation 1105—1112. 2 for information visualization: Overview, techniques, and design guidelines. IEEE TVCG 16, 3 (2010), 439–454. 2, 3 [MTS91] M IHALISIN T., T IMLIN J., S CHWEGLER J.: Visualiz- ing multivariate functions, data, and distributions. IEEE Comput. [Fis07] F ISHER D.: Hotmap: Looking at geographic attention. Graph. Appl. 11, 3 (May 1991), 28–35. 1 IEEE TVCG 13, 6 (2007), 1184–1191. 1, 2, 5 [NH06] N OVOTNY M., H AUSER H.: Outlier-preserving focus+ [FP02] F EKETE J.-D., P LAISANT C.: Interactive information vi- context visualization in parallel coordinates. IEEE TVCG 12, 5 sualization of a million items. In Proceedings of InfoVis (2002), (2006), 893–900. 3 pp. 117–124. 1, 2 [PTMB09] P IRINGER H., T OMINSKI C., M UIGG P., B ERGER [FPDs12] F ISHER D., P OPOV I., D RUCKER S., SCHRAEFEL M .: W.: A multi-threading architecture to support interactive visual Trust me, i’m partially right: Incremental visualization lets ana- exploration. IEEE TVCG 15, 6 (Nov. 2009), 1113–1120. 2 lysts explore large datasets faster. In Proceedings of CHI (2012), pp. 1673–1682. 3 [PWR04] P ENG W., WARD M. O., RUNDENSTEINER E. A.: Clutter reduction in multi-dimensional data visualization using [GM99] G UPTA A., M UMICK I. S. (Eds.): Materialized views: dimension reordering. In Proceedings of InfoVis (2004), pp. 89– techniques, implementations, and applications. MIT Press, Cam- 96. 2 bridge, MA, USA, 1999. 1 [Raf05] R AFIEI D.: Effectively visualizing large networks [HB10a] H EER J., B OSTOCK M.: Crowdsourcing graphical per- through sampling. In Proceedings of VIS (2005), pp. 375–382. ception: using mechanical turk to assess visualization design. In 2 Proceedings of CHI (2010), pp. 203–212. 4 [RWC∗ 08] R ÜBEL O., W U K., C HILDS H., M EREDITH J., [HB10b] H EER J., B OSTOCK M.: Declarative language design G EDDES C. G., C ORMIER -M ICHEL E., A HERN S., W EBER for interactive visualization. IEEE TVCG 16, 6 (2010), 1149– G. H., M ESSMER P., H AGEN H., H AMANN B., B ETHEL E. W.: 1156. 2 High performance multivariate visual data exploration for ex- [HDS∗ 10] H AO M. C., DAYAL U., S HARMA R. K., K EIM tremely large data. In Proceedings of the 2008 ACM/IEEE con- D. A., JANETZKO H.: Visual analytics of large multi- ference on Supercomputing (2008), IEEE Press, pp. 51–62. 2 dimensional data using variable binned scatter plots. In Proceed- [SB09] S TONE M., BARTRAM L.: Alpha, contrast and the per- ings of SPIE (2010), Bibliothek der Universitat Konstanz. 2 ception of visual metadata. In Color Imaging Conf (2009). 4 [HHW97] H ELLERSTEIN J. M., H AAS P. J., WANG H. J.: On- [Sco79] S COTT D. W.: On optimal and data-based histograms. line aggregation. ACM SIGMOD Record 26, 2 (1997), 171–182. Biometrika 66, 3 (1979), 605–610. 4 3, 9 [Sco92] S COTT D. W.: Multivariate Density Estimation: Theory, [HS12] H EER J., S HNEIDERMAN B.: Interactive dynamics for Practice, and Visualization, 1 ed. Wiley, Aug. 1992. 4 visual analysis. Queue 10, 2 (2012), 30. 5 [Spo] TIBCO spotfire. 2 [JLJC05] J OHANSSON J., L JUNG P., J ERN M., C OOPER M.: Re- [Sto03] S TONE M.: A Field Guide to Digital Color, 1st ed. A K vealing structure within clustered parallel coordinates displays. Peters/CRC Press, Aug. 2003. 4 In Proceedings of InfoVis (Oct. 2005), pp. 125 – 132. 2 [Stu26] S TURGES H. A.: The choice of a class interval. Journal [JS98] J ERDING D. F., S TASKO J. T.: The information mural: A of the American Statistical Association 21, 153 (1926), 65–66. 4 technique for displaying and navigating large information spaces. IEEE TVCG 4, 3 (1998), 257–271. 1, 2 [Tab] Tableau software. 2, 9 [Kei00] K EIM D. A.: Designing pixel-oriented visualization tech- [TGC03] T RUTSCHL M., G RINSTEIN G., C VEK U.: Intelli- niques: Theory and applications. IEEE TVCG 6, 1 (2000), 59–78. gently resolving point occlusion. In Proceedings of InfoVis (Oct. 2 2003), pp. 131–136. 2 [KHD∗ 10] K EIM D. A., H AO M. C., DAYAL U., JANETZKO H., [UTH06] U NWIN A., T HEUS M., H OFMANN H.: Graphics of BAK P.: Generalized scatter plots. Information Visualization 9, Large Datasets: Visualizing a Million, 1 ed. Springer, July 2006. 4 (2010), 301–311. 2 1, 2, 4 [KMSH12] K AIRAM S., M AC L EAN D., S AVVA M., H EER J.: [ZBDS12] Z INSMAIER M., B RANDES U., D EUSSEN O., S TRO - GraphPrism: compact visualization of network structure. In Pro- BELT H.: Interactive level-of-detail rendering of large graphs. ceedings of AVI (2012), pp. 498–505. 2, 3 IEEE TVCG 18, 12 (Dec. 2012), 2486 –2495. 2 © 2013 The Author(s) © 2013 The Eurographics Association and Blackwell Publishing Ltd.