The BigDAWG Polystore System and Architecture

组织经常面临挑战为大型异质可能有不同基础数据和编程模型.例如,医疗数据集可能有非结构化文本、关系数据、时间序列波形和想象。试图将这些数据集整合到单一的数据管理中系统具有不良的性能和效率效应。作为英特尔科技中心的一部分数据,我们正在开发一个多存储系统有问题bighead(大数据分析的缩写组)是一个多存储系统,设计用于复杂的工作问题自然跨越不同的处理或存储引擎.比格耶提供了一个支持各种数据库系统使用不同的数据模型,支持地点透明度和通过岛屿和中间件提供语义完整性一个统一的多岛界面。原型的初步结果应用到医疗数据集验证多商店的概念.在本文中,我们将描述多存储数据库、当前的bigheard体系结构及其应用程序在模拟医学数据集上,初始性能结果和我们未来的发展计划。
展开查看详情

1. The BigDAWG Polystore System and Architecture Vijay Gadepally∗† Peinan Chen† Jennie Duggan‡ Aaron Elmore§ Brandon Haynes Jeremy Kepner∗† Samuel Madden† Tim Mattson¶ Michael Stonebraker† ∗ MIT Lincoln Laboratory † MIT CSAIL ‡ Northwestern University § University of Chicago ¶ Intel Corporation University of Washington Abstract—Organizations are often faced with the challenge leveraged based on the data engine that provides the highest of providing data management solutions for large, heteroge- performance response to a particular query. nous datasets that may have different underlying data and Such analytics on complex datasets call for the develop- programming models. For example, a medical dataset may have unstructured text, relational data, time series waveforms and ment of a new generation of federated databases that support imagery. Trying to fit such datasets in a single data management seamless access to the different data models of database or system can have adverse performance and efficiency effects. storage engines. We refer to such a system as a polystore in As a part of the Intel Science and Technology Center on Big order to distinguish it from traditional federated databases that Data, we are developing a polystore system designed for such largely supported access to multiple engines using the same problems. BigDAWG (short for the Big Data Analytics Working Group) is a polystore system designed to work on complex data model. problems that naturally span across different processing or As a part of the Intel Science and Technology Center (ISTC) storage engines. BigDAWG provides an architecture that supports on Big Data, we are developing the BigDAWG, short for Big diverse database systems working with different data models, Data Analytics Working Group, polystore system. The Big- support for the competing notions of location transparency and DAWG stack is designed to support multiple data models, real- semantic completeness via islands and a middleware that provides a uniform multi–island interface. Initial results from a prototype time streaming analytics, visualization interfaces, and multi- of the BigDAWG system applied to a medical dataset validate ple databases. The current version of BigDAWG [2] shows polystore concepts. In this article, we will describe polystore significant promise and has been used to develop a series of databases, the current BigDAWG architecture and its application applications for the MIMIC II dataset. The BigDAWG system on the MIMIC II medical dataset, initial performance results and supports multiple data stores; provides an abstraction of data our future development plans. and programming models through “islands”; a middleware and API that can be used for query planning, optimization I. I NTRODUCTION and execution; and support for applications, visualization and Enterprises today encounter many types of databases, data, clients. Initial results of applying the BigDAWG system to and storage models. Developing analytics and applications that diverse data such as medical imagery or clinical records has work across these different modalities is often limited by the shown the value of a polystore system in developing new incompatibility of systems or the difficulty of creating new solutions for complex data management. connectors and translators between each one. For example, The remainder of the article is organized as follows: Sec- consider the MIMIC II dataset [1] which contains deidentified tion II expands on the concept of a polystore databases and the health data collected from thousands of critical care patients in execution of polystore queries. Section III describes the current an Intensive Care Unit (ICU). This publicly available dataset BigDAWG architecture and its application to the MIMIC II (http://mimic.physionet.org/) contains structured data such as dataset. Section IV describes performance results on an initial demographics and medications; unstructured text such as doc- BigDAWG implementation. Finally, we conclude and discuss tor and nurse reports; and time–series data of physiological future work in Section V. signals such as vital signs and electrocardiogram (ECG). Each II. P OLYSTORE DATABASES of these components of the dataset can be efficiently organized into database engines supporting different data models. For With the increased interest in developing storage and man- example, the structured data in a relational database, the text agement solutions for disparate data sources coupled with notes in a key-value or graph database and the time–series our belief that “one size does not fit all” [3], there is a data in an array database. Analytics of the future will cross renewed interest in developing database management systems the boundaries of a single data modality, such as correlating (DBMSs) that can support multiple query languages and information from a doctor’s note against the physiological complete functionality of underlying database systems. Prior measurements collected from a particular sensor. Further, the work on federated databases such as Garlic [4], IBM DB2 [5] same dataset may be stored in different data engines and and others [6] have demonstrated the ability to provide a single interface to disparate DBMSs. Other related work in paral- The corresponding author, Vijay Gadepally, can be reached at lel databases [7] and computing [8], [9] have demonstrated vijayg [at] ll.mit.edu the high performance that can be achieved by making use 978-1-5090-3525-0/16/$31.00 ©2016 IEEE

2.          !"                          Fig. 2: The BigDAWG polystore architecture consists of four Fig. 1: Time taken for various database operations in dif- layers - engines, islands, middleware/API and applications. ference database engines. The dashed lines correspond to a count operation in SciDB and PostGRES and the solid lines correspond finding the number of discrete entries in SciDB perform a matrix multiplication operation) may benefit from and PostGRES. For the count operation, SciDB outperforms performing part of the operation in PostGRES (extracting PostGRES whereas PostGRES outperforms SciDB for finding discrete entries) and the remaining part (matrix multiplication) the number of discrete entries. in SciDB. Extending the concept of federated and parallel databases, we propose a “polystore” database. Polystore databases can of replication, partitioning and horizontally scaled hardware. harness the relative strengths of underlying DBMSs. Unlike Many of the federated database technologies concentrated on federated or parallel databases, polystore databases are de- relational data. With the influx of different data sources such as signed to simultaneously work with disparate database en- text, imagery, and video, such relational data models may not gines and programming/data models while supporting com- support high performance ingest and query for these new data plete functionality of underlying DBMSs. In fact, a polystore modalities. Further, supporting the types of analytics that users solution may include federated and/or parallel databases as wish to perform (for example, a combination of convolution of a part of the overall solution stack. In a polystore solution, time series data, gaussian filtering of imagery, topic modeling different components of an overall dataset can be stored in the of text,etc.) is difficult within a single programming or data engine(s) that will best support high performance ingest, query model. and analysis. For example, a dataset with structured, text and Consider the simple performance curve of Figure 1 which time-series data may simultaneously leverage relational, key- describes an experiment where we performed two basic oper- value and array databases. Incoming queries may leverage one ations – counting the number of entries and extracting discrete or more of the underlying systems based on the characteristics entries – on a varying number of elements. As shown in the of the query. For example, performing a linear algebraic op- figure, for counting the number of entries, SciDB outperforms eration on time-series data may utilize just an array database; PostGRES by nearly an order of magnitude. We see the performing a join between time-series data and structured data relative performance reversed in the case of extracting discrete may leverage array and relational databases respectively. entries. In order to support such expansive functionality, the Big- Many time-series, image or video storage systems are most DAWG polystore system (Figure 2) utilizes a number of fea- efficient when using an array data model [10] which provides tures. “Islands” provide users with a number of programming a natural organization and representation of data. Analytics on and data model choices; “Shim” operations allow translation these data are often developed using linear algebraic operations of one data model to another; and “Cast” operations allow for such as matrix multiplication. In a simple experiment in which the migration of data from one engine or island to another. we performed matrix multiplication in PostGRES and SciDB, We go into greater depth of the BigDAWG architecture in we observed nearly three orders of magnitude difference in Section III. performance time (for a 1000 × 1000 dense matrix multi- plication, PostGRES takes approximately 166 minutes vs. 5 III. B IG DAWG A RCHITECTURE seconds in SciDB). These results suggest that analytics in which one wishes to The BigDAWG architecture consists of four distinct layers perform a combination of operations (for example, extracting as described in Figure 2: database and storage engines; islands; the discrete entries in a dataset and using that result to middleware and API; and applications. In this section, we 978-1-5090-3525-0/16/$31.00 ©2016 IEEE

3.discuss the current status of each of these layers as well as how they are used with the MIMIC II dataset. A. Database and Storage Engines A key design feature of BigDAWG is the support of multiple database and storage engines. With the rapid increase in het- erogenous data and proliferation of highly specialized, tuned and hardware accelerated database engines, it is important the BigDAWG support as many data models as possible. Further, Fig. 3: The BigDAWG middleware consists of four modules many organizations already rely on legacy systems as a part - planner, monitor, migrator and executor. For a new query, it of their overall solution. We believe that analytics of the is first passed to the planner which interacts with the monitor future will depend on many, distinct data sources that can to develop a complete query plan. This is then passed to the be efficiently stored and processed only in disparate systems. executor which leverages the migrator as needed to complete BigDAWG is designed to address this need by leveraging the query. many vertically-integrated data management systems. For the MIMIC II dataset, we use the relational databases PostgreSQL and Myria [11] to store clinical data such as Figure 3. The middleware has four components: the query demographics and medications. BigDAWG uses the key-value planning module (planner) [16], the performance monitoring store Apache Accumulo for freeform text data and to perform module (monitor) [17], the data migration module (migra- graph analytics [12]. For the historical waveform time-series tor) [18] and the query execution module (executor) [19]. data of various physiological signals, we use the array store Given an incoming query, the planner parses the query into SciDB [13]. Finally, for streaming time-series data, our appli- collections of objects and creates a set of possible query cation uses the streaming database S-Store [14]. plan trees that also highlights the possible engines for each B. Islands collection of objects. The planner then sends these trees to the monitor which uses existing performance information to The next layer of the BigDAWG stack is its islands. Islands determine a tree with the best engine for each collection of allow users to trade off between semantic completeness (using objects (based on previous experience of a similar query). The the full power of an underlying database engine) and location tree is then passed to the executor which determines the best transparency (the ability to access data without knowledge method to combine the collections of objects and executes of the underlying engine). Each island has a data model, a the query. The executor can use the migrator to move objects query language or set of operators and one or more database between engines and islands, if required, by the query plan. engines for executing them. In the BigDAWG prototype, the user determines the scope of their query by specifying an 2) BigDAWG API: The BigDAWG interface provides a island within which the query will be executed. Islands are simple API to execute polystore queries. The API layer a user-facing abstraction, and they are designed to reduce consists of server and client facing components. The server the challenges associated with incorporating a new database components incorporate the many possible islands which engine. connect to database engines via lightweight connectors re- We currently support a number of islands. For example, ferred to as shims. Shims essentially act as an adapter to the D4M island provides users with an associative array data go from the language of an island to the native language model [15] to PostgreSQL, Accumulo, and SciDB. The Myria of an underlying database engine. In order to specify how island exposes support for iteration over and efficient casting a user is interacting with an island, a user specifies a scope between the MyriaX, PostgreSQL and SciDB databases. We in their query. A scope of a query allow an island to also support a number of degenerate islands that connect to correctly interpret the syntax of the query and allows the a single database engine. These degenerate islands provide Island to select the correct Shim that is needed to execute support for the full semantic power (programming and data a part of the query. Thus, a cross-island query may involve model) of a connected database at the expense of location multiple Scope operations. For example, let us suppose we transparency. have two tables A and B in a relational and array database, respectively. Suppose that we want to perform the cross- C. BigDAWG Layer island query ARRAY(multiply(RELATIONAL(select The BigDAWG middleware consists of a number of com- * from A,...),B) which takes all the data in table A ponents required to support the multiple islands, programming and multiplies it with all the data in table B. In this case, the languages and query types that BigDAWG supports. inner operation (RELATIONAL(...)) invokes the Relational 1) BigDAWG middleware: The BigDAWG middleware, is scope and the outer operation invokes the Array scope. Mov- responsible for receiving queries, query planning, determining ing the data between two engines can be done through the efficient execution strategies, maintaining history of previous Cast operation. The Cast operation sends information about queries, and maintaining a record of previous query per- the translation between data models and moves the data as formance. The architecture of the middleware is shown in needed. In the example query, this may imply that the results 978-1-5090-3525-0/16/$31.00 ©2016 IEEE

4.of RELATIONAL(...) along with translational information and have the monitor pick one at random. The remaining plans about the resulting objects are Cast to the engine where B can then be run in the background of the system when it is resides. underutilized. Over time, these plans are then added to the 3) Polystore Queries: Efficient query execution is a key monitor database. goal of the BigDAWG system. A key challenge is that the data being queried is likely to be distributed among two or D. Applications and Visualizations more disparate data management systems. In order to support Polystore applications, visualizations and clients may need different islands, efficient data movement is also critical. to interact with disparate database and storage engines. Moreover, efficient execution may also depend on system Through the BigDAWG API and middleware, these applica- parameters such as available resources or usage that are prone tions can use a single interface to any number of underlying to change. To illustrate the mechanics of a polystore query, in systems. In order to minimize the impact to existing appli- this section we describe the simplest case where there is no cations, the “island” interface allows users to develop their replication, partitioned objects, expensive queries or attempts applications using the language(s) or data model(s) that most to move objects for load balancing. Given an incoming query, efficiently (or easily) represents the queries or analytics they an execution plan for the query is based on whether the query are developing (or have already developed). For example, an is in a training or production phase. application developed using SQL can leverage the relational The training phase is typically used for execution of queries island or a scientific application can leverage the array island. that are new (either the query is new or the system has changed In both cases, the applications may talk to the same underlying significantly since the last time a particular query was run) or data engines. are believed to have been poorly executed. In the simplest In our current implementation, BigDAWG supports a variety case, the training phase consists of queries that arrive with of visualization platforms such as Vega [20] and D3 [21]. Most a “training” tag. In the training phase, we allow the query recently, applying BigDAWG to the MIMIC II dataset allowed execution engine to generate a good query plan using any for the development of a number of polystore applications: number of available resources. First, the query planner parses 1) Browsing: This screen provides an interface to the full the query and assigns the scope of each piece of the query MIMIC II dataset which is stored in different storage en- to a particular island. Pieces of the resulting subquery that gines. This screen utilizes the open source tool ScalaR [22]. are local to a particular storage engine are encapsulated into a 2) “Something interesting”: This application uses container and given an identifying signature. For the remaining SeeDB [23] to highlight interesting trends and anomalies elements of the query (remainder), which correspond to cross- in the MIMIC II dataset. system predicates, we generate a signature by looking at the 3) “Text Analytics”: This application performs topic mod- structure of the remainder, the objects being referenced and eling of the unstructured doctor and nurse notes directly the constants in the query. If the remainder signature has been in a key-value store database using Graphulo [12] and seen before, a query plan can be extracted. If not, the system correlates them with structured patient information stored decomposes the remainder to determine all possible query in PostGRES. plans which are then sent to the monitor. 4) “Heavy Analytics”: This application looks for hemody- To execute the query, the monitor feeds the queries to the namically similar patients in a dataset by comparing the executor, plus all of the containers which are then passed to signatures of historical ECG waveforms using Myria. We the appropriate underlying storage engine. For the cross-engine discuss this particular application in detail in Section IV-B predicates, the executor decides how to perform each step. The 5) “Streaming Analytics”: This application performs analyt- executor runs each query, collects the total running time and ics on streaming time-series waveforms and can be used other usage statistics and stores the information in the monitor for extract-transform-load (ETL) via the data migrator into database. This information can then be used to determine the another database such as SciDB. best query plan in the production phase. In the production phase, when a query is received it is first IV. B IG DAWG P ERFORMANCE matched against the various signatures in the monitor database and the optimizer selects the closest one. The BigDAWG The current reference implementation of the BigDAWG optimizer also compares the current usage statistics of the system satisfies two key performance goals: 1) The polystore system and compares it against the usage statistics of the architecture of Figure 2 can provide low overhead access system when the training was performed. If there are large to data in disparate engines and 2) Polystore queries can differences, the optimizer may select an alternate query plan outperform “one size fits all” solutions. In this section, we that more closely resembles the current resources or system discuss performance results with respect to these two goals. usage or recommend that the user rerun the query under the training phase under the current usage. In cases where the A. BigDAWG overhead signature of the incoming query do not match with existing Providing low overhead access to data is an important signatures, the optimizer may suggest the query run in training element of the BigDAWG system. Low overhead ensures that mode or construct a list of plans as done in the training phase the BigDAWG middleware and “island” architecture do not 978-1-5090-3525-0/16/$31.00 ©2016 IEEE

5. Overhead Incurred When Using BigDAWG We trained a classifier under each configuration using 256- For Common Database Queries minute ECG vectors drawn from 600 patients present in 2000 the MIMIC II dataset and classified a single test patient. 1800 Overhead Incurred (ms) Each execution was performed on a cluster comprised of 1600 Time Taken (msec) 1400 Query without BigDAWG (ms) eight m4.large Amazon EC2 instances (https://aws.amazon. 1200 com/). 1000 As illustrated in Figure 5, we found that performance under 800 600 the hybrid configuration (32 seconds) exceed performance 400 under both the Myria and SciDB islands in isolation (77 and 200 240 seconds, respectively). 0 Our results highlight that substantial performance differ- Count Average Average Standard Distinct (Postgres) (Postgres) (SciDB) Deviation Count Values ences exist between systems when executing this complex (SciDB) (SciDB) (SciDB) analytical query. For example, performance of the TF-IDF Fig. 4: Overhead of using BigDAWG to execute queries to and k-NN computations are substantially faster in Myria than PostGRES or SciDB. For most queries, the overhead is less SciDB while the wavelet transform time under SciDB greatly than 1% of the overall execution time. exceeds that of Myria. For this analytical query, the ability to capture these per- formance differences under a polystore yields a substantial penalize clients or applications for using BigDAWG. Support- performance benefit. More generally, our results support the ing low overhead queries is especially important for appli- notion that overall query performance may be improved cations such as interactive analytics and visualizations [24]. by identifying and leveraging relative strengths of disparate In Figure 4, we show the overhead of executing queries to database systems within a polystore, and that this improvement a single data engine via BigDAWG compared with the time far exceeds the cost of inter-system data casts. taken for directly querying the database engine through its native interface. As we can observe, for most queries, the V. C ONCLUSION AND F URTHER W ORK overhead incurred by using BigDAWG is a small percentage Future analytics will require access to disparate database of the overall query time. There is a minimum overhead management systems. Previously developed federated and incurred which may be a larger percentage for queries of parallel database engines provided a first step towards the shorter duration. solution but were largely limited to working with single data or programming models. The concept of polystore systems B. Polystore Analytic: Classifying Hemodynamic Deteriora- extends these concepts to support multiple query languages tion and disparate DBMSs. We described our architecture for such To demonstrate the performance advantages offered by a polystore system, BigDAWG. A reference version of the a polystore in executing a complex analytical query, we BigDAWG architecture has been built and applied to the replicated a process described by Saeed & Mark [25]. This diverse medical dataset - MIMIC II. Initial performance results process begins by identifying temporally-similar patterns in validate that a polystore approach to data management can be the physiologic measurements of patient data found in the applied without excessive overhead. Further, initial results on MIMIC II dataset. These patterns are used as input to a a polystore medical application reinforce the notion that we classifier which identifies subsequent patients as being likely can achieve greater performance when using multiple storage (or not) to experience hemodynamic deterioration. engines that are optimized for particular operations and data Using the process described by Saeed & Mark, we first models. compute the Haar-basis transform [26] over the ECG wave- There are many areas of potential improvement of the forms of training patients. For each patient, we then binned BigDAWG system. For example, we are interested in develop- the coefficients over each temporal scale and concatenated ing more complex query planning and execution capablities, the resulting histograms into a single patient vector. We then increasing the number of supported islands and engines, and normalized each patient vector by applying a term frequency- applying BigDAWG to a greater variety of datasets. inverse document frequency (TF-IDF) computation. Finally, we classified a test patient by performing a k-nearest neighbor ACKNOWLEDGEMENT computation using these frequency-adjusted vectors. This work was supported in part by the Intel Science and We first executed this workflow on our polystore prototype Technology Center (ISTC) for Big Data. The authors wish under each of the Myria and SciDB degenerate islands. We to thank our ISTC collaborators Kristin Tufte, Jeff Parkhurst, then performed a multi-island execution designed to capture Stavros Papadopoulos, Nesime Tatbul, Magdalena Balazinska, performance advantages that exist between each of these sys- Bill Howe, Jeffrey Heer, David Maier, Tim Kraska, Ugur tems. This execution first computes the Haar-basis transform Cetintemel, and Stan Zdonik. The authors also wish to thank on SciDB and casts the intermediate coefficients to Myria, the MIT Lincoln Laboratory Supercomputing Center for their where the TF-IDF and k-NN computation is performed. help in maintaining the MIT testbed. 978-1-5090-3525-0/16/$31.00 ©2016 IEEE

6.                  !     !   " &" #"" #&" $"" $&" %""   Fig. 5: Polystore analytic applied to medical dataset for 256 minute ECG vectors drawn from 600 patients. The polystore (Myria+SciDB) execution strategy outperforms a “one size” approach of using just Myria or SciDB. R EFERENCES mulo database,” in High Performance Extreme Computing Conference (HPEC), 2015 IEEE. IEEE, 2015, pp. 1–7. [1] M. Saeed, M. Villarroel, A. T. Reisner, G. Clifford, L.-W. Lehman, [13] M. Stonebraker, P. Brown, D. Zhang, and J. Becla, “Scidb: A database G. Moody, T. Heldt, T. H. Kyaw, B. Moody, and R. G. Mark, “Multipa- management system for applications with complex analytics,” Comput- rameter intelligent monitoring in intensive care ii (mimic-ii): a public- ing in Science Engineering, vol. 15, no. 3, pp. 54–62, May 2013. access intensive care unit database,” Critical care medicine, vol. 39, [14] U. Cetintemel, J. Du, T. Kraska, S. Madden, D. Maier, J. Meehan, no. 5, p. 952, 2011. A. Pavlo, M. Stonebraker, E. Sutherland, N. Tatbul et al., “S-store: A [2] A. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Cetintemel, streaming newsql system for big velocity applications,” Proceedings of V. Gadepally, J. Heer, B. Howe, J. Kepner, T. Kraska et al., “A the VLDB Endowment, vol. 7, no. 13, pp. 1633–1636, 2014. demonstration of the bigdawg polystore system,” Proceedings of the [15] V. Gadepally, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun, VLDB Endowment, vol. 8, no. 12, pp. 1908–1911, 2015. L. Edwards, M. Hubbell, P. Michaleas, J. Mullen, A. Prout, A. Rosa, [3] M. Stonebraker and U. C ¸ etintemel, “”one size fits all”: an idea whose C. Yee, and A. Reuther, “D4m: Bringing associative arrays to database time has come and gone,” in Data Engineering, 2005. ICDE 2005. engines,” in High Performance Extreme Computing Conference (HPEC), Proceedings. 21st International Conference on. IEEE, 2005, pp. 2– 2015 IEEE, Sept 2015, pp. 1–6. 11. [16] A. Gupta, V. Gadepally, and M. Stonebraker, “Cross-engine query [4] M. J. Carey, L. M. Haas, P. M. Schwarz, M. Arya, W. Cody, R. Fagin, execution in federated database systems,” in High Performance Extreme M. Flickner, A. W. Luniewski, W. Niblack, D. Petkovic et al., “Towards Computing Conference (HPEC), 2016 IEEE, Submitted. heterogeneous multimedia information systems: The garlic approach,” [17] P. Chen, V. Gadepally, and M. Stonebraker, “The bigdawg monitor- in Research Issues in Data Engineering, 1995: Distributed Object Man- ing framework,” in High Performance Extreme Computing Conference agement, Proceedings. RIDE-DOM’95. Fifth International Workshop on. (HPEC), 2016 IEEE, Submitted. IEEE, 1995, pp. 124–131. [18] A. Dziedzic, A. Elmore, and M. Stonebraker, “Data transformation [5] P. Gassner, G. M. Lohman, K. B. Schiefer, and Y. Wang, “Query and migration in polystores,” in High Performance Extreme Computing optimization in the ibm db2 family,” IEEE Data Eng. Bull., vol. 16, Conference (HPEC), 2016 IEEE, Submitted. no. 4, pp. 4–18, 1993. [19] S. Zuohao and J. Duggan, “Bigdawg polystore query optimization [6] A. P. Sheth and J. A. Larson, “Federated database systems for managing through semantic equivalences,” in High Performance Extreme Com- distributed, heterogeneous, and autonomous databases,” ACM Comput- puting Conference (HPEC), 2016 IEEE, Submitted. [20] A. Satyanarayan, R. Russell, J. Hoffswell, and J. Heer, “Reactive vega: A ing Surveys (CSUR), vol. 22, no. 3, pp. 183–236, 1990. [7] D. DeWitt and J. Gray, “Parallel database systems: the future of high streaming dataflow architecture for declarative interactive visualization,” performance database systems,” Communications of the ACM, vol. 35, Visualization and Computer Graphics, IEEE Transactions on, vol. 22, no. 6, pp. 85–98, 1992. no. 1, pp. 659–668, 2016. [21] M. Bostock, V. Ogievetsky, and J. Heer, “D3 data-driven documents,” [8] D. E. Hudak, N. Ludban, A. Krishnamurthy, V. Gadepally, S. Samsi, and Visualization and Computer Graphics, IEEE Transactions on, vol. 17, J. Nehrbass, “A computational science ide for hpc systems: design and no. 12, pp. 2301–2309, 2011. applications,” International journal of parallel programming, vol. 37, [22] L. Battle, M. Stonebraker, and R. Chang, “Dynamic reduction of no. 1, pp. 91–105, 2009. query result sets for interactive visualizaton,” in Big Data, 2013 IEEE [9] S. Samsi, V. Gadepally, and A. Krishnamurthy, “Matlab for signal International Conference on. IEEE, 2013, pp. 1–8. processing on multiprocessors and multicores,” IEEE Signal Processing [23] M. Vartak, S. Madden, A. Parameswaran, and N. Polyzotis, “Seedb: Magazine, vol. 27, no. 2, pp. 40–49, 2010. automatically generating query visualizations,” Proceedings of the VLDB [10] P. Cudr´e-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, Endowment, vol. 7, no. 13, pp. 1581–1584, 2014. E. Soroush, P. Velikhov, D. L. Wang, M. Balazinska, J. Becla et al., [24] A. Dziedzic, J. Duggan, A. J. Elmore, V. Gadepally, A. Dziedzic, “A demonstration of scidb: a science-oriented dbms,” Proceedings of J. Duggan, A. J. Elmore, and V. G. and, “BigDAWG: a polystore the VLDB Endowment, vol. 2, no. 2, pp. 1534–1537, 2009. for diverse interactive applications,” in Workshop on Data Systems for [11] D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu, P. Koutris, Interactive Analysis (DSIA) at IEEE VIS 2015, 2015. D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker et al., [25] M. Saeed and R. G. Mark, “A novel method for the efficient retrieval “Demonstration of the myria big data management service,” in Pro- of similar multiparameter physiologic time series using wavelet-based ceedings of the 2014 ACM SIGMOD international conference on Man- symbolic representations.” in AMIA, 2006. agement of data. ACM, 2014, pp. 881–884. [26] A. Haar, “Zur theorie der orthogonalen funktionensysteme,” Mathema- [12] D. Hutchison, J. Kepner, V. Gadepally, and A. Fuchs, “Graphulo tische Annalen, vol. 69, no. 3, pp. 331–371, 1910. implementation of server-side sparse matrix multiply in the accu- 978-1-5090-3525-0/16/$31.00 ©2016 IEEE