- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
The Tile DB Array Data Storage Manager
展开查看详情
1 . The TileDB Array Data Storage Manager Stavros Papadopoulos Kushal Datta‡ Samuel Madden Timothy Mattson‡ Intel Labs & MIT Intel Corporation MIT Intel Labs {stavrosp, madden}@csail.mit.edu ‡ {kushal.datta, timothy.g.mattson}@intel.com ABSTRACT 2D array, where each element corresponds to a pixel. Geo- We present a novel storage manager for multi-dimensional ar- locations (i.e., points in a 2D or 3D coordinate space) can be rays that arise in scientific applications, which is part of a represented by non-empty elements in a sparse 2D or 3D array larger scientific data management system called TileDB. In that models the coordinate space. contrast to existing solutions, TileDB is optimized for both Scientific array data can be very large, containing billions dense and sparse arrays. Its key idea is to organize array el- of non-null values that do not readily fit into the memory of ements into ordered collections called fragments. Each frag- a single or even multiple machines. As a result, many appli- ment is dense or sparse, and groups contiguous array elements cations need to read and write both individual (random) ele- into data tiles of fixed capacity. The organization into frag- ments as well as large sequential extents of these arrays to and ments turns random writes into sequential writes, and, cou- from the disk. Simply storing arrays as files forces application pled with a novel read algorithm, leads to very efficient reads. programmers to handle many issues, including array represen- TileDB enables parallelization via multi-threading and multi- tation on disk (i.e., sparse vs. dense layouts), compression, processing, offering thread-/process-safety and atomicity via parallel access, and performance. Alternatively, these issues lightweight locking. We show that TileDB delivers compa- can be handled by optimized, special-purpose array data stor- rable performance to the HDF5 dense array storage manager, age management systems, which perform complex analytics while providing much faster random writes. We also show on scientific data. Central to such systems are efficient data that TileDB offers substantially faster reads and writes than access primitives to read and write arrays. These primitives the SciDB array database system with both dense and sparse are the focus of this work. arrays. Finally, we demonstrate that TileDB is considerably faster than adaptations of the Vertica relational column-store 1.1 Existing Array Management Systems for dense array storage management, and at least as fast for the case of sparse arrays. A number of others systems and libraries for managing and accessing arrays exist. HDF5 [16] is a well-known array data storage manager. It is a dense array format, coupled with a 1. INTRODUCTION C library for performing the storage management tasks. Sev- Many scientific and engineering fields generate enormous eral scientific computing packages integrate HDF5 as the core amounts of data through measurement, simulation, and exper- storage manager (such as NetCDF-4 [9], h5py [6] and PyTa- imentation. Examples of data produced in such fields include bles [13]). HDF5 groups array elements into regular hyper- astronomical images, DNA sequences, geo-locations, social rectangles, called chunks, which are stored on the disk in bi- relationships, and so on. All of these are naturally represented nary format in a single large file. HDF5 suffers from two main as multi-dimensional arrays, which can be either dense (when shortcomings. First, it does not efficiently capture sparse ar- every array element has a value) or sparse (when the majority rays. A typical approach is to represent denser regions of a of the array elements are empty, i.e., zero or null). For in- sparse array as separate dense arrays, and store them into a stance, an astronomical image can be represented by a dense (dense) HDF5 array of arrays. This requires enormous manual labor to identify dense regions and track them as they change. Second, HDF5 is optimized for in-place writes of large blocks. In-place updates result in poor performance when writing small blocks of elements that are randomly distributed, due to the This work is licensed under the Creative Commons Attribution- expensive random disk accesses they incur. Parallel HDF5 NonCommercial-NoDerivatives 4.0 International License. To view (PHDF5) is a parallel version of HDF5 with some additional a copy of this license, visit http://creativecommons.org/licenses/by- limitations: (i) it does not allow concurrent writes to com- nc-nd/4.0/. For any use beyond those covered by this license, obtain pressed data, (ii) it does not support variable-length element permission by emailing info@vldb.org. Proceedings of the VLDB Endowment, Vol. 10, No. 4 values, and (iii) operation atomicity requires some coding ef- Copyright 2016 VLDB Endowment 2150-8097/16/12. fort from the user, and imposes extra ordering semantics [4]. 349
2 . Random writes are important in many applications. For in- called TileDB. TileDB is an open-source software system writ- stance, random writes occur with large OLAP cube applica- ten in C++ and maintained by Intel Labs and MIT. More infor- tions [1] and selective image processing (e.g., applying blur mation and code can be found at http://www.tiledb.org. and de-noising filters). In dense linear algebra, while tradi- TileDB is already in use by the Broad Institute, one of the tional algorithms [15] are designed around regular updates of largest genomics institutes in the world, for storing many Ter- large blocks, more recent algorithms based on DAG schedul- abytes of genomics data, which are modeled as a huge sparse ing [11] often result in small block updates irregularly dis- 2D array [2]. The focus of this paper is on TileDB’s stor- tributed throughout the matrix. When combined with Asyn- age module (we defer a discussion of TileDB’s distributed ar- chronous Many-Task runtime systems [3, 7] for extreme scale chitecture and complex analytics engine to follow-up work). systems, this irregularity requires an array storage manager op- Throughout the rest of the paper, we use TileDB to refer to the timized for random updates of small blocks. storage manager component of the larger TileDB system. Array-oriented databases implement their own storage man- TileDB exposes a C library for efficient writes and reads agers and, thus, they can arguably serve as the storage layer for to arrays using a novel dense and sparse on-disk representa- other scientific applications built on top (see [24] for a survey tion, while supporting important features such as compression, of these systems). SciDB [18] is a popular array database, parallelism, and more. Contrary to specialized software like which however also has significant storage management lim- GenericIO, TileDB handles arrays of arbitrary dimensionality itations. It is not truly designed for sparse arrays as it relies and schema and, thus, can be readily used in a wide range of on regular dimensional chunking (similar to HDF5). It also large-scale scientific applications. The key idea of TileDB is requires reading an entire chunk even if a small portion of it that it organizes array elements into ordered collections called is requested, and updating an entire chunk even when a sin- fragments. Fragments may overlap in the index space of the gle element changes, leading to poor performance in the case array and are used to represent batches of updates to the ar- of random writes. ArrayStore [25] propose optimizations for ray. They may also have either a dense or a sparse repre- the sparse case on top of SciDB, including grouping underuti- sentation. Dense fragments group their elements into regular- lized chunks into bigger ones to address imbalance. However, sized chunks in the index space, which we call data tiles. A the original chunks still remain the atomic units of processing, sparse fragment represents an element by explicitly storing its thus many of the above problems persist. indices in a specific, global order. Sparse elements are grouped Alternatively, relational databases (e.g., MonetDB[20] or into data tiles of fixed element capacity (in contrast to SciDB), Vertica [21]) have been used as the storage backend for array which balances I/O cost when accessing the array. management (e.g., in RAM [26] for dense and SRAM [19] for The concept of fragments dramatically improves write per- sparse arrays), storing non-empty elements as records and ex- formance as writes are performed in batches, and each batch plicitly encoding the element indices as extra table columns. (which may consist of many small, random writes) is written Subarray queries (used to retrieve any part of the array) are to a separate fragment sequentially. Most importantly, sparse performed via optimized SELECT WHERE SQL queries on fragments can be used to speed up random writes even in dense the array indices. This explicit index encoding leads to poor arrays. In contrast to HDF5’s in-place updates, TileDB batches performance in the case of dense arrays. Similar to RAM and random requests and sequentially writes many at a time as a SRAM, RasDaMan [17] stores array data as binary large ob- sparse fragment. This approach also enables efficient repre- jects (BLOBs) in PostgreSQL [12], and enables query process- sentation and update of variable-length values (such as strings). ing via a SQL-based array query language called RasQL. Like Turning random-writes into sequential appends is a classic RAM and SRAM, array storage and access has to go through technique in storage systems [23], but implementing it in an PostgreSQL, which is typically quite slow. Moreover, Ras- array storage manager is non-trivial. This is because multiple DaMan is designed mainly for dense arrays; its support for fragments may cover the same logical region of an array, and sparse arrays relies on the user manually decomposing the ar- some fragments may be sparse, which makes it hard for the ray space into irregular regions of approximately equal number read algorithm to find the most recent fragment that updated of non-empty elements, which is cumbersome (similar to the each returned element. TileDB implements an efficient read case of HDF5). algorithm that does this; it also avoids unnecessary tile reads A second concern with the database approach (for some when a portion of a fragment is totally covered by a newer users) is that databases run as a separate server that is accessed fragment. Finally, as the number of fragments grows and read via SQL queries over a ODBC/JDBC interface. In contrast, performance may degrade, TileDB employs a consolidation programmers of high-performance scientific and image pro- algorithm that merges multiple fragments into a single one. cessing applications who make heavy use of arrays usually Consolidation can be performed in the background while other want API-based access to directly write and read array ele- concurrent reads and writes are performed. ments; they either prefer libraries like HDF5 for dense data, TileDB supports parallelism through both multi-threading or develop their own special-purpose tools (such as Generi- (pthreads and OpenMP) and multi-processing (e.g., via forks cIO [5], designed to manage sparse 3D particles). and MPI). It provides thread-/process-safety with lightweight locking. Moreover, it guarantees operation atomicity, without 1.2 Our Approach imposing any additional ordering constraints (this is in con- In this paper we introduce the first array storage manager trast to parallel implementations of HDF5, like PHDF5 [10]). that is optimized for both dense and sparse arrays, and which The key contributions of this paper are the following: lies at the core of a new scientific data management system 350
3 . • We describe the TileDB storage manager designed to This is sufficient to capture many applications, as long as support both sparse and dense arrays. TileDB uses tiles each data item can be uniquely identified by a set of coordi- of fixed element capacity, making efficient use of the nates. For example, in an imaging application, each image can I/O subsystem. It also employs a fragment-based layout, be modeled by a 2D dense array, with each cell representing which makes all writes sequential, and implements pro- the RGB values for that pixel. Another application is Twitter, tocols for efficiently reading elements in the presence of where geo-tagged tweets can be stored in a 2D sparse array multiple fragments, as well as a consolidation algorithm whose cells are identified by float geographical coordinates, for compacting many fragments into one. with each tweet stored in a variable-length char attribute. Figure 1 shows two example arrays, one dense one sparse, • We describe how these features allow TileDB to provide which have two dimensions, rows and columns. Both di- desirable properties for parallel programming, such as mensions have domain [1,4]. The arrays have two attributes, thread- and process-safety, as well as atomicity of con- a1 of type int32 and a2 of type variable char. Each cell is current reads and writes to a single array. identified by its coordinates, i.e., a pair of rows, columns. In our examples, empty cells are shown in white, and non- • We conduct a thorough experimental evaluation. We empty cells in gray. In the dense array, every cell has an at- show that, for dense arrays, TileDB delivers comparable tribute tuple, e.g., cell (4,4) contains 15, pppp , whereas sev- performance to HDF5, while being orders of magnitude eral cells in the sparse array are empty, e.g., cell (4,1). faster on random element writes. We also demonstrate Dense array Sparse array that TileDB offers orders of magnitude faster reads and writes than SciDB in most settings for both dense and dimensions columns columns sparse arrays. In addition, TileDB is 2x-40x faster than 1 2 3 4 1 2 3 4 0 1 4 5 0 1 2 Vertica for dense arrays, and at least as fast as Vertica for 1 a bb e ff 1 a bb ccc 2 3 7 3 sparse arrays. Finally, we confirm the effectiveness of rows 2 ccc dddd 6 ggg hhhh rows 2 dddd our multi-fragment read and consolidation, and the fact 3 8 9 12 13 empty 3 4 6 7 i jj m nn e ggg hhhh that TileDB offers excellent scalability, both in terms of domain [1,4] 1o 11 14 15 cell 5 parallelism and when operating on large datasets. 4 kkk llll ooo pppp 4 ff tuple of cell (4,4) <a1 (int32), a2 (var char)> = <15, pppp> 2. TileDB OVERVIEW attributes Data Model. TileDB provides an array-oriented data model Figure 1: The logical array view that is similar to that of SciDB [18]. An array consists of di- mensions and attributes. Each dimension has a domain, and all Global cell order. Whenever multi-dimensional data is stored dimension domains collectively orient the logical space of the to disk or memory it must be laid out in some linear order, array. A combination of dimension domain values, referred since these storage media are single-dimensional. This choice to as coordinates, uniquely identifies an array element, called of ordering can significantly impact application performance, cell. A cell may either be empty (null) or contain a tuple of since it affects which cells are near each other in storage. In attribute values. Each attribute is either a primitive type (int, TileDB, we call the mapping from multiple dimensions to a float, or char), or a fixed or variable-sized vector of a linear order the global cell order. primitive type. For instance, consider an attribute represent- A desirable property is that cells that are accessed together ing complex numbers. Each cell may then store two float should be co-located on the disk and in memory, to minimize values for this attribute, one for the real part and the other for disk seeks, page reads, and cache misses. The best choice of the imaginary part. As another example, a string can be repre- global cell order is dictated by application-specific characteris- sented as a vector of characters. tics; for example, if an application reads data a row-at-a-time, Arrays in TileDB can be dense, where every cell contains an data should be laid out in rows. A columnar layout in this case attribute tuple, or sparse, where some cells may be empty; all will result in a massive number of additional page reads. TileDB operations can be performed on either type of array, TileDB offers various ways to define the global cell order but the choice of dense vs. sparse can significantly impact for an array, enabling the user to tailor the array storage to his application performance. Typically, an array is stored in sparse or her application for maximum performance. For dense ar- format if more than some threshold fraction of cells are empty, rays, the global cell order is specified in three steps. First, the which highly depends on the application. user decomposes the dimensional array domain into space tiles All dimension domains have the same type. For dense ar- by specifying a tile extent per dimension (e.g., 2×2 tiles). This rays, only int dimensions are supported; for sparse arrays effectively creates equi-sized hyper-rectangles (i.e., each con- float dimensions are also allowed. This is because TileDB taining the same number of cells) that cover the entire logical materializes the coordinates of the non-empty cells for sparse space. Second, the user determines the cell order within each arrays and, thus, can store them as real numbers (see Sec- space tile, which can be either row-major or column-major. tion 3). In other words, TileDB natively supports continuous Third, the user determines a tile order, which is also either multi-dimensional spaces, without the need for discretization. row-major or column-major. Figure 2 shows the global cell On the other hand, dense arrays have inherently discrete do- orders resulting from different choices in these three steps (the mains that are naturally represented as integers. space tiles are depicted in blue). 351
4 . space tile extents: 4x2 space tile extents: 2x2 space tile extents: 2x2 tile order: row-major tile order: row-major tile order: column-major cell order: row-major cell order: row-major cell order: row-major Fragments. A fragment is a timestamped snapshot of a batch of array updates, i.e., a collection of array modifications car- space tiles ried out via write operations and made visible at a particular time instant. For instance, the initial loading of the data into the array constitutes the first array fragment. If at a later time a set of cells is modified, then these cells constitute the second array fragment, and so on. In that sense, an array is comprised Figure 2: Global cell orders in dense arrays of a collection of array fragments, each of which can be re- garded as a separate array, whose collective logical overlap The notion of a global cell order also applies to sparse ar- composes the current logical view of the array. rays. However, creating sparse tiles is somewhat more com- A fragment can be either dense or sparse. Dense fragments plex because simply using tiles of fixed logical size could lead are used only with dense arrays, but sparse fragments may be to many empty tiles. Even if we suppressed storage of empty applied to both dense and sparse arrays. tiles, skew in many sparse datasets would create tiles of highly Figure 4 shows an example of an array with three fragments; varied capacity, leading to ineffective compression, bookkeep- the first two are dense and the third is sparse. Observe that the ing overheads, and some very small tiles where seek times rep- second fragment is dense within a hyper-rectangular subarray, resent a large fraction of access cost. Therefore, to address the whereas any cell outside this subarray is empty. The figure case of sparse tiles, we introduce the notion of data tiles. also illustrates the collective logical view of the array; the cells of the most recent fragments overwrite those of the older ones. Data tiles. A data tile is a group of non-empty cells. It is the atomic unit of compression and has a crucial role during reads Fragment #1 Fragment #2 Fragment #3 (dense) (dense) (sparse) (explained in Section 4.1). Similarly to a space tile, a data tile 1 2 3 4 1 2 3 4 1 2 3 4 is enclosed in logical space by a hyper-rectangle. For dense 1 0 1 4 5 1 1 a bb e ff arrays, each data tile has a one-to-one mapping to a space tile, 2 3 6 7 2 ccc dddd ggg hhhh 2 2 i.e., it encompasses the cells included in the space tile. 3 8 9 12 13 3 112 113 3 208 212 213 For the sparse case, TileDB instead allows the user to spec- i 1o jj 11 m nn 15 M 114 NN 115 u 211 x yy 14 ify a data tile capacity, and creates the data tiles such that they 4 kkk llll ooo pppp 4 OOO PPPP 4 wwww all have the same number of non-empty cells, equal to the ca- pacity. To implement this, assuming that the fixed capacity is Collective logical array view denoted by c, TileDB simply traverses the cells in the global # 1 1 2 3 4 nt me 0 1 4 5 cell order imposed by the space tiles and creates one data tile F rag t# 2 1 a bb e ff en 3 gm t# en 2 3 6 7 for every c non-empty cells. A data tile of a sparse array is Fra gm 2 Fra ccc dddd ggg hhhh 3 208 9 212 213 represented in the logical space by the tightest hyper-rectangle u jj x yy 1o 211 that encompasses the non-empty cells it groups, called the 4 114 kkk wwww OOO PPPP 115 minimum bounding rectangle (MBR). Figure 3 illustrates var- ious data tiles resulting from different global cell orders, as- Figure 4: Fragment examples suming that the tile capacity is 2. The space tiles are depicted in blue color and the (MBR of the) data tiles in black. Note that Fragments are the key concept that enables TileDB to per- data tiles in the sparse case may overlap, but each non-empty form rapid writes. If they cover relatively disjoint subareas cell corresponds to exactly one data tile. of the domain, or if their number is modest (e.g., hundreds to thousands as shown in our experiments in Section 6), then their presence does not significantly affect the read performance. space tile extents: 4x2 space tile extents: 2x2 space tile extents: 2x2 tile order: row-major cell order: row-major tile order: row-major cell order: row-major tile order: column-major cell order: row-major If numerous fragments are created and the read performance space tiles becomes unsatisfactory, TileDB employs an efficient consoli- dation mechanism that coalesces fragments into a single one. Consolidation can happen in parallel in the background, while reads and writes continue to access the array. The concept of data tiles fragments and their benefits are explained in Section 4. Array metadata. TileDB stores two types of metadata about Figure 3: Data tiles in sparse arrays an array: the array schema and the fragment bookkeeping. The former contains information about the definition of the Compression. TileDB employs tile-based compression. Ad- array, such as its name, the number, names and types of di- ditionally, it allows the user to select different compression mensions and attributes, the dimension domain, the space tile schemes on a per-attribute basis, as attributes are stored sepa- extents, data tile capacity, and the compression types. The lat- rately (as discussed in Section 3). The API supports defining ter summarizes information about the physical organization of different compression schemes for each attribute, but currently the stored array data in a fragment (explained in Section 3). only gzip is implemented. In the future, TileDB will be able to System architecture. The TileDB storage manager is exposed support more compression schemes, such as RLE, LZ, as well as a C API library to users. The system architecture is de- as user-defined schemes. scribed in Figure 5. The basic array operations are init, 352
5 .write, read, consolidate and finalize; these are space tile extents: 2x2 tile order: row-major thoroughly explained in Section 4. The storage manager keeps cell order: row-major Files 1 2 3 4 in-memory state for the open arrays, i.e., arrays that have been 0 1 4 5 (binary format) 1 a bb e initialized and not yet finalized. This state keeps the book- 2 3 6 ff 7 a1.tdb 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 ccc dddd ggg hhhh keeping metadata for each fragment of every open array in 3 8 9 12 13 a2.tdb 0 1 3 6 10 11 13 16 20 21 23 26 30 31 33 36 memory; this is shared between multiple threads that access i 1o jj 11 m 14 nn 15 a2_var.tdb a bb ccc dddd e ff ggg hhhh i jj kkk llll m … 4 the same array at the same time. Locks are used to mediate kkk llll ooo pppp access to the state (covered in detail in Section 5). Each thread has its own array object, which encapsulates a read state and Figure 6: Physical organization of dense fragments one fragment object with its own read and write state. except that they contain only non-empty cells. That is, the cell values follow the global cell order in the attribute files, Storage Manager Array and there is an extra file for variable-sized attributes storing C API Read State the starting offsets of the variable-sized cell values. Unlike init Locking Fragment dense arrays where the offset of each cell value can be di- write Open Arrays Bookkeeping rectly computed, the same is not possible in sparse arrays, read consolidate Read State since the exact distribution of the non-empty cells is unknown. Bookkeeping Write State finalize In order to locate the non-empty cells, TileDB stores an addi- ... tional file with the explicit coordinates of the non-empty cells ( coords.tdb in the figure), which are again serialized in Figure 5: The TileDB storage manager architecture the global cell order. Note that the coordinates follow the order of the dimensions, as specified in the array schema. 3. PHYSICAL ORGANIZATION space tile extents: 2x2 tile order: row-major An array in TileDB is physically stored as a directory in cell order: row-major Files 1 2 3 4 (binary format) the underlying file system. The array directory contains one 0 1 2 a1.tdb 0 1 2 3 4 5 6 7 1 a bb ccc sub-directory per array fragment, plus a file storing the array 3 0 1 3 6 10 11 13 16 2 a2.tdb schema in binary format. Every fragment directory contains dddd 4 a2_var.tdb one file per fixed-sized attribute, and two files per variable- 3 e 6 ggg 7 hhhh a bb ccc dddd e ff ggg hhhh 5 sized attribute. The attribute files store cell values in the global 4 ff __coords.tdb 1,1 1,2 1,4 2,3 3,1 4,2 3,3 3,4 cell order in a binary representation. Storing a file per attribute is a form of “vertical partitioning”, and allows efficient subset- Figure 7: Physical organization of sparse fragments ting of the attributes in read requests. Each fragment directory also contains a file that stores the bookkeeping metadata of the In order to achieve efficient reads, TileDB stores two types fragment in binary, compressed format. We explain the physi- of bookkeeping information about the data tiles of a sparse cal storage for dense and sparse arrays with examples below. array. Recall that a data tile is a group of non-empty cells Figure 6 shows the physical organization of the dense ar- of fixed capacity, associated with an MBR. TileDB stores the ray of Figure 1, assuming that it follows the global cell order MBRs of the data tiles in the bookkeeping metadata, which illustrated in the middle array of Figure 2. The cell values facilitate the search for non-empty cells during a read request. along the fixed-sized attribute a1 are stored in a file called In addition, it stores bounding coordinates, which are the first a1.tdb, always following the specified global cell order. At- and last cell of the data tile along in the global cell order. These tribute a2 is variable-sized and, thus, TileDB uses two files are important for reads as well, as explained in Section 4.1. to store its values. File a2 var.tdb stores the variable- A final remark on physical storage concerns compression. sized cell values (serialized in the global cell order), whereas As mentioned earlier, TileDB compresses each attribute data a2.tdb stores the starting offsets of each variable-sized cell tile separately, based on the compression type specified in the value in file a2 var.tdb. This enables TileDB to efficiently array schema for each attribute. In other words, for every at- locate the actual cell value in file a2 var.tdb during read tribute file, TileDB compresses the data corresponding to each requests. For instance, cell (2,3) appears 7th in the global cell tile separately prior to writing them into the file. Note that dif- order. Therefore, TileDB can efficiently look-up the 7th off- ferent tiles may be compressed into blocks of different size, set in file a2.tdb (since the offsets are fixed-sized), which even for fixed-sized attributes. TileDB maintains bookkeeping points to the beginning of the cell value ggg in a2 var.tdb. information in order to be able to properly locate and decom- Using this organization, TileDB does not need to maintain press each tile upon a read request. any special bookkeeping information. It only needs to record the subarray in which the dense fragment is constrained, plus 4. OPERATIONS some extra information in the case of compression. In this section, we describe the core functions of TileDB. Figure 7 illustrates the physical organization of the sparse array of Figure 1, assuming that it follows the global cell order 4.1 Read of the middle array of Figure 3. The attribute files of sparse Read returns the values of any subset of attributes inside arrays are organized in the same manner as those of dense, a user-supplied subarray. The result is sorted on the global 353
6 .cell order within the subarray. The user specifies the subarray instance, if the subarray covers the entire tile [1:2, 1:2] (i.e., and attributes in the init call. During initialization, TileDB the upper left tile) of our dense array examples, then a single loads the bookkeeping data of each array fragment from the range is created, [(1, 1), (2, 2)]. If the subarray covers only disk into main memory. This bookkeeping data is of negligible one row of this tile, say [1:1, 1:2], then the range [(1, 1), (1, 2)] size (a few KB) in the dense case, whereas in the sparse case it is created. However, if it covers a column, say [1:2, 1:1] (recall depends on the tile capacity. For typical tile capacities of 10K- that our cell order is row-major), then two ranges are created, 1M, the bookkeeping size is 4-6 orders of magnitude smaller namely [(1, 1), (1, 1)] and [(2, 1), (2, 1)]. This is because we than the total array size. cannot construct a single range such that (1, 1) and (2, 1) ap- The user calls read one or more times with allocated buffers pear adjacent in the global cell order. In other words, the al- that will store the results, one per fixed-length attribute, two gorithm operates on whole qualifying ranges of cells stored per variable-lengthed attribute, along with their sizes. TileDB contiguously on the disk, rather than individual cells, thus opti- populates the buffers with the results, following the global mizing performance. For sparse fragments, similar ranges are cell order within the specified subarray. For variable-sized created, but this time considering the bounding coordinates of attributes, the first buffer contains the starting cell offsets of each tile overlapping with the subarray. Note that these ranges the variable-sized cells in the second buffer. For variable- are sparse, i.e., they may cover also empty cells. length attributes and sparse arrays, the result size may be un- Next, the algortihm creates one tuple [sc, ec], fid for each predictable, or exceed the size of the available main memory. such range of fragment fid , and inserts them into a priority To address this issue, TileDB gracefully handles buffer over- queue pq. The comparator of pq gives precedence to tuples flow. If the read results exceed the size of some buffer, TileDB with smallest sc value, breaking ties by giving precedence to fills in as much data as it can into the buffer and returns. The the tuple with the largest fid . For each tuple, through fid we user can then consume the current results, and resume the pro- can know whether it comes from a dense or sparse fragment. cess by invoking read once again. TileDB maintains read Once a priority queue has been built for T , the algorithm state that captures the location where reading stopped. pops a tuple at a time from pq (we call this tuple popped). The The main challenge of reads is the presence of multiple frag- algorithm compares popped to the new top tuple, emitting new ments in the array. This is because the results must be returned result tuples for the second stage of the algorithm to consume in the global cell order, and read cannot simply search each and re-inserting some tuples into pq. Figure 8 illustrates the fragment individually; if a recent fragment overwrites a cell possible overlaps between popped and top tuples. Here a solid of an older fragment, no data should be retrieved from the old thick line corresponds to a range of a dense fragment, a dashed fragment for this cell, but the old fragment should be used for thin line to a range of a sparse fragment, and a double solid line adjacent cells. TileDB implements a read algorithm that al- to a range of either a dense or a sparse fragment (recall that a lows it to efficiently access all fragments, skipping data that dense array may consist of both dense and sparse fragments). does not qualify for the result. The algorithm is slightly differ- In case (i), since popped does not overlap with top, it is ent for dense and sparse arrays, explained in turn below. guaranteed not to overlap with any other range of any other fragment (by the definition of pq) and, thus, it is simply in- Dense arrays. The algorithm has two stages. The first com- serted in the result. In case (ii), only the disjoint portion of putes a sorted list of tuples of the form [sc, ec], fid ; [sc, ec] the popped range can be inserted in the result. The other part is a range of cells between start coordinates sc and end coor- must be re-inserted in pq, because the overlapping portion of dinates ec (inclusive), where sc precedes or is equal to ec in top may be from a more recent fragment (since the priority the global cell order, and fid is a fragment id. Each fragment queue is sorted by sc). This re-insertion will result in (iii) is assigned a unique fid based on its timestamp, with larger or (iv) triggering. In case (iii), popped is a range of a dense fid values corresponding to more recent fragments (ties are fragment that begins at the same sc value as top. This must broken arbitrarily based on the id of the thread or process that come from a more recent fragment than top, thus top can be created the fragment). The following rules hold: (i) all ranges discarded (the disjoint portion of top is re-inserted into pq.) must be disjoint, (ii) the ranges must be sorted in the global Note, however, that popped may still overlap with some other cell order, (iii) the ranges in the ordered list must contain all older fragment starting at a later sc value, hence it cannot be and only the actual, up-to-date result cells, and (iv) the cells emitted yet. Instead, the algorithm pops pq and processes the covered in each range must appear contiguously on the disk. new top versus this popped; this will generally result in case Given this list, the second stage retrieves the actual attribute (i), (ii) or (iii) being triggered. values from the respective fragment files, with a single file I/O Finally, case (iv) applies when popped is sparse. Recall into the user buffers for each range of each requested attribute. that initially the sparse ranges inserted into pq may also cover We focus the rest of this section on the first stage, as the sec- empty cells. Thus, the algorithm identifies the first two non- ond stage is simple (given sc and ec, it easy to calculate their empty cells in popped, which requires reading from the data offsets in the files via closed formulas). tile (these are shown as x’s in the figure). It then inserts the first The first stage proceeds iteratively, with each iteration oper- coordinate as a unary dense tuple in pq, to be handled by case ating on a single space tile. Space tiles are visited in the global (iii). Finally, it re-inserts the truncated sparse range starting at order. For each tile T , the algorithm considers each fragment the second coordinate into pq, as shown in the figure. overlapping T . It computes a set of disjoint coordinate ranges In summary, the algorithm performs a right sweep on the [sc, ec] that are in the query subarray within T , such that the global cell order, and inserts disjoint ranges into the ordered cells in these ranges are adjacent in the global cell order. For list, such that recent fragments always overwrite older ones. 354
7 . check against new top insert to result popped [ ] (i) popped (iii) top [ )[ ] memory is managed solely by the user. The write function top discard re-insert to pq simply appends the values from the buffers into the corre- global cell order global cell order re-insert to pq as dense sponding attribute files, writing them sequentially, and with- insert to result re-insert to pq re-insert to pq as sparse out requiring any additional internal buffering (i.e., it writes (ii) popped [ )[ ] (iv) popped x [x ] top top directly from the user buffers to the files). Note that the user global cell order global cell order may call write repeatedly with different buffer contents, in the event that memory is insufficient to hold all the data at Figure 8: Possible overlaps of popped and top tuples in pq once. Each write invocation results in a series of sequential writes to the same fragment, one write per attribute file. Due to this invariant, the emitted list abides by the rules we stated above and, thus, the algorithm is correct. Sparse fragments. There are two different modes for creat- Sparse arrays. The algorithm for sparse arrays is very sim- ing sparse fragments. In the first, similar to the dense case, ilar to that for dense. There are only two differences in the the user provides cell buffers to write. These buffers must first stage of computing the sorted cell range list. First, it- be sorted in the global cell order. There are three differences eration does not focus on space tiles (since they may be too with the dense case. First, the user provides values only for the large in sparse arrays). Instead, it focuses only on ranges that non-empty cells. Second, the user includes an extra buffer with start before the minimum end bounding coordinate of a data the coordinates of the non-empty cells. Third, TileDB main- tile in a fragment, and then proceeds by properly updating the tains some extra write state information for each created data next minimum end bounding coordinate. This keeps the queue tile (recall that a data tile may be different from a space tile). small in each iteration. Second, case (iii) of Figure 8 never Specifically, it counts the number of cells seen so far and, for arises, since a sparse array consists only of sparse fragments. every c cells (where c is the data tile capacity specified upon array creation), TileDB stores the minimum bounding rectan- TileDB supports various read I/O methods, such as POSIX gle (MBR) and the bounding coordinates (first and last coor- read, memory map (mmap), MPI-IO [14], even asynchronous dinates) of the data tile. Note that the MBR initially includes I/O, which can be configured easily even upon run-time. The the first coordinates in the data tile, and expands as more co- benefits of each method vary based on the application. ordinates are seen. For each data tile, TileDB stores its MBR The above read algorithm demonstrates the importance of and bounding coordinates into the bookkeeping structures of the global cell order, which maps a multi-dimensional prob- the fragment. lem into a single-dimensional one that can be handled more Sparse fragments are typically created when random up- intuitively and efficiently. Because the algorithm spends more dates arrive at the system, and it may be cumbersome for the time when the cell ranges overlap considerably across frag- user to sort the random cells along the global order. To han- ments, it is beneficial for the user to co-locate updates when dle this, TileDB also enables the user to provide unsorted cell possible across batch writes. If this is not possible and the buffers to write. TileDB internally sorts the buffers (using number of fragments becomes excessive, read performance multiple threads), and then proceeds as explained above for the may degrade. This is mainly because I/O is not performed sorted case. The only difference is that each write call in un- sequentially, since data retrievals from multiple fragment files sorted mode creates a separate fragment. This is very similar are interleaved. Moreover, storage space is wasted if there are to the sorted runs created in the traditional merge sort algo- many overwrites in both dense and sparse arrays, or numerous rithm. In Section 4.3 we explain that TileDB can effectively deletions in sparse arrays. TileDB offers a way to consolidate merge all the sorted fragments into a single one. multiple fragments into a single one, described in Section 4.3. Sparse fragments also support deletions. TileDB handles 4.2 Write deletions as insertions of empty cells (TileDB uses special constants for identifying values of “empty” cells). The user Write loads and updates data in TileDB arrays. Each write can then test if a retrieved cell is empty or not (using the spe- session writes cells sequentially in batches, creating a separate cial constants). If the number of deletions is high (e.g., many fragment. A write session begins when an array is initialized in more deletion entries appear in a subarray query than actual write mode (with init), comprises one or more write calls, results), the read performance will likely be impacted. In such and terminates when the array is finalized (with finalize). cases, the user can invoke consolidate (see Section 4.3), Depending on the way the array is initialized, the new frag- which merges the fragments purging the deletion entries. ment may be dense or sparse. A dense array may be updated Two final notes concern compression and variable-length with dense or sparse fragments, while a sparse array is only up- attributes, applicable to both dense and sparse fragments. If dated with sparse fragments. We explain the creation of dense compression is enabled for some attribute, TileDB buffers the and sparse fragments separately below. cell values corresponding to the same data tile for that at- Dense fragments. Upon array initialization, the user specifies tribute. When the tile fills up, TileDB compresses it internally the subarray region in which the dense fragment is constrained and appends it to the corresponding file. For variable-length (it can be the entire domain). Then, the user populates one attributes, write takes two buffers from the user; the first buffer per array attribute, storing the cell values respecting the stores the variable-length cell values, and the second stores global cell order. This means the user must be aware of the the starting offsets of each cell value in the first buffer. TileDB tiling, as well as the tile and cell order specified upon the ar- uses this information to create the extra attribute file that stores ray creation (HDF5 and SciDB operate similarly). The buffer the starting cell offsets, as explained in Section 3. 355
8 .4.3 Consolidate it so that reads simply see the logical view of the array with- Consolidation takes a set of fragments as input and pro- out the new fragment. This is in contrast to PHDF5, which duces a single new output fragment. It is easy to implement in imposes ordering constraints to ensure atomicity. TileDB given the multi-fragment read algorithm discussed in Finally, consolidation can be performed in the background Section 4.1: it simply repeatedly performs a read on the en- in parallel with other reads and writes. Locking is required tire domain, providing buffers whose size is configurable de- only for a very brief period. Specifically, consolidation is per- pending on the available memory. After every read, the func- formed independently of reads and writes. The new fragment tion invokes a write command which writes the retrieved that is being created is not visible to reads before consolidation cells into the new fragment. Due to the way it handles buffer is completed. The only time when locking is required is after overflow, TileDB stops reading when the buffers are full. If the consolidation finishes, when the old fragments are deleted any of the read fragments are dense, the consolidated fragment and the new fragment becomes visible. TileDB enforces its is also dense. If all fragments are sparse, then the consolidated own file locking at this point. After all current reads release fragment is sparse. Since the cells are read in the global cell their shared lock, the consolidation function gets an exclusive order, the resulting fragment stores the cells in the global cell lock, deletes the old fragments, makes the new fragment vis- order as well. In addition, the consolidation process discards ible, and releases the lock. After that point, any future reads any deleted entries. Upon successful termination, the function see the consolidated fragment. deletes the old fragments. The new fragment is identified by a timestamp that is larger than that of all the old fragments. 6. EXPERIMENTS Consolidating many small fragments with a few large ones In this section we experimentally evaluate TileDB in a num- can be costly, since the total consolidation time is dominated ber of dense and sparse array settings. We focus on 3 competi- by reading and writing the large fragments. This suggests con- tors, 2D arrays, and fixed-length attributes due to space limi- solidation should be applied on fragments of approximately tations. However, TileDB is an active project; we will keep on equal size. Moreover, read performance degrades when many posting benchmark results and include future suggested com- data tiles of different fragments overlap in the logical space. parisons online at www.tiledb.org. Thus, it is important to consolidate those fragments before others. To address these issues, TileDB enables the user to Competitors. We compared TileDB against HDF5 [16] (for run consolidation on a specific subset of fragments. An ad- dense arrays), SciDB [18] (for dense and sparse arrays), and ditional benefit of selective consolidation applies specifically Vertica [21] (for dense and sparse arrays). For all parallel for the case where all the fragments have approximately the experiments, we used the parallel version of HDF5, namely same size. In particular, the user can implement a hierarchical PHDF5. SciDB and Vertica are complete database systems, consolidation mechanism similar to that employed in LSM- but we only focused on their storage management capabilities Trees [22], to amortize the total consolidation cost. on a single node, namely writes (load and update) and reads (various types of subarray queries), as well as parallel execu- tion. All our benchmarks can found at https://github.com/ Intel-HLS/TileDB/tree/benchmarks. 5. PARALLEL PROGRAMMING TileDB allows both concurrent reads and writes on single System configuration. We performed all our experiments on array, offers thread-/process-safety and atomic reads and writes. an Intel R x86 64 platform with a 2.3 GHz 36-core CPU and It also enables consolidation to be performed in the background 128 GB of RAM, running CentOS6. We utilized a 4 TB, 7200 without obstructing concurrent reads and writes. Fragments rpm Western Digital HDD and a 480 GB Intel R SSD both enable all of these features as writes happen to independent not RAID-ed and equipped with the ext4 file system. Because fragments, and partial fragments are not visible to readers. the HDD is I/O bound even with a single thread, we ran serial Concurrent writes are achieved by having each thread or experiments on the HDD. Because the SSD is so much faster, process create a separate fragment. No internal state is shared we ran our parallel experiments on SSDs. and, thus, no locking is necessary. Each thread/process creates We compared against SciDB v15.12, Vertica v7.02.0201, a fragment with a unique name, using its id and the current and HDF5 v1.10.0. TileDB is implemented in C++, and the time. Thus, there are no conflicts even at the file system level. source code is available online at www.tiledb.org. For Reads from multiple processes are independent and no lock- TileDB, SciDB and HDF5, we experimented both with and ing is required. Every process loads its own bookkeeping data without gzip compression, using compression level 6. Note from the disk, and maintains its own read state. For multi- that PHDF5 does not support parallel load or updates on com- threaded reads, TileDB maintains a single copy of the frag- pressed datasets. In addition, Vertica cannot be used without ment bookkeeping data, utilizing locking to protect the open compression; the default compression method is RLE, which arrays structure so that only one thread modifies the bookkeep- is currently not supported by TileDB. Therefore, we used Ver- ing data at a time. Reads themselves do not require locking. tica with gzip (compression level 6). Note that Vertica uses a Concurrent reads and writes can be arbitrarily mixed. Each fixed 64 KB compression block. We gave an unlimited cache fragment contains a special file in its sub directory that in- size to HDF5, SciDB and Vertica. TileDB does not use a dicates that this fragment should be visible. This file is not caching mechanism, but it rather relies on the OS caching. created until the fragment is finalized, so that fragments under Before every experiment we flush all caches, and after every creation are not visible to reads. Fragment-based writes make load/update we sync the files. Finally, note that Vertica by 356
9 .default uses all the available cores in the system. In order to The input data are saved as binary files. The cell values are vary the number of threads used in our parallel experiments, sorted in the global cell order. TileDB and HDF5 read the in- we simply shut down a subset of the cores at the OS level. put file in internal buffers, and then issue write commands Datasets. For dense arrays, we constructed synthetic 2D ar- that write cells in batches. In SciDB, we follow the recom- rays of variable sizes, with a single int32 attribute, since mended method that loads the chunks directly into the array the distribution of cell values in our experiments affects only without the expensive redimension process. compression (we perform writes and reads, not array computa- TileDB matches the performance of HDF5 (it takes only 60s tions). To make compression effective, we stored in each cell to load 4GB), and it is consistently more than an order of mag- (i, j) the value i ∗ #col + j, where #col is the number of nitude faster than SciDB. A dollar (’$’) symbol indicates that columns in the array (this led to a 2.9 compression ratio). The the system did not manage to terminate within the shown time. array domain type is int64. For the case of sparse arrays, we Note that we experimented also with larger tile sizes for the used datasets retrieved from the AIS database [8], which was case of 4 GB dataset (plot omitted), and found that all systems collected by the National Oceanic and Atmospheric Adminis- are unaffected by the tile size. However, the performance of tration by tracking ship vessels in the U.S. and international TileDB+Z, HDF5+Z and SciDB+Z deteriorates as we increase waters. We extracted attributes X (longitude), Y (latitude), the tile size; this is due to overheads of of gzip on large files. 104 SOG, COG, Heading, ROT, Status, VoyageID, and MMSI. We 4 TileDB TileDB+Z TileDB TileDB+Z used X,Y as the dimensions (i.e., we created a 2D array), and 3 HDF5 HDF5+Z $ $ HDF5 SciDB SciDB SciDB+Z the rest characteristics as the attributes. SciDB does not sup- 103 Time (x1000 s) SciDB+Z Time (s) port real dimension values, thus we converted X and Y into 2 int64, transforming the domain of (X,Y) to [0, 360M], [0, 102 180M]. For simplicity, we represented all attributes as int64. 1 The resulting array is very sparse and skewed; most ship loca- 101 tions appear around the coasts of the U.S., they become very 4GB 8GB Dataset size 16GB 1 2 4 # Instances 8 16 sparse as they move into the ocean, and are never on land. (a) vs. dataset size (HDD) (b) vs. # instances (SSD) Takeaways. Our evaluation reveals the following main re- sults: (i) TileDB is several orders of magnitude faster than Figure 9: Load performance of dense arrays HDF5 in the case of random element updates for dense arrays, and at least as efficient in other settings, offering up to 2x bet- Figure 9(b) assesses the cost of loading data in parallel, for ter performance on compressed data; (ii) TileDB outperforms the 4GB array. For SciDB, we spawn multiple worker in- SciDB in all settings, generally by several orders of magni- stances, whereas for TileDB and HDF5 we spawn multiple tude; (iii) TileDB is 2x-40x faster than Vertica in the dense MPI processes. Each instance writes an equal number of tiles case. It is also at least as fast as Vertica in the sparse case, to the array. For SciDB, we follow the recommended approach featuring up to more than 2x faster parallel reads; (iv) the per- that avoids redimension. We activate as many CPU cores formance of the read algorithm in TileDB is robust up to a as the number of tested instances. We exclude HDF5+Z, since large number of fragments, whereas the consolidation mecha- PHDF5 does not allow parallel writes with compression. nism requires marginally higher time than that needed for the TileDB and HDF5 are unaffected by the level of parallelism, initial loading; (v) TileDB exhibits excellent scalability as the as they are I/O bound. Moreover, each process writes large dataset size and level of parallelism increase. data chunks at a time monopolizing the internal parallelism of the SSD and, thus, no extra savings are noticed with parallel 6.1 Dense Arrays processes. SciDB seems to scale because it is CPU bound. We compared versus HDF5, SciDB and Vertica. HDF5 and Moreover, the compressed versions of all systems scale nicely, SciDB are natural competitors for the dense case. On the other since they are CPU bound as well due to the CPU cost of gzip hand, there are multiple ways to use Vertica for dense array compression. The performance of TileDB and HDF5 match, management. In the following we focus mostly on HDF5 and whereas TileDB and TileDB+Z are between 2x and 7x faster SciDB, including a discussion on Vertica and a summary of than SciDB and SciDB+Z. our results at the end of the sub section. Update. Figure 10(a) shows the time to perform random ele- Load. Figure 9(a) shows the time to load a dense array into ment updates to the 4 GB array, as a function of the number TileDB, HDF5 and SciDB, as well as their gzip-compressed of updated elements. We generated random element coordi- versions, denoted TileDB+Z, HDF5+Z and SciDB+Z, respec- nates and stored them in a binary file. In HDF5 we batched tively. In this experiment, only one CPU core is activated the random updates using select elements, whereas in and each system runs on a single instance. We generated 2D SciDB we loaded the updates into a sparse array and then synthetic arrays with sizes 4 GB, 8 GB, and 16 GB (a scala- called insert with redimension. We exclude HDF5+Z bility experiment with much larger arrays is described later), as it does not support updates on compressed data. and dimension domains 50, 000 × 20, 000, 50, 000 × 40, 000, TileDB is up to more than 2 orders of magnitude faster than and 100, 000 × 40, 000, respectively. For all methods, we fix HDF5, with 100K updates running in less than 1s in TileDB the space tile extents (chunk extents in HDF5 and SciDB) to vs. more than 100s in HDF5, and more than 4 orders of magni- 2, 500 × 1, 000, which effectively creates tiles/chunks of size tude faster than SciDB. This is due to the sequential, fragment- 10 MB. The cell and tile order is row-major for all systems. based writes of TileDB, as opposed to the in-place updates 357
10 .in HDF5 and the chunk-based updates in SciDB. The perfor- Figure 11(b) plots the time as a function of the number of mance gap increases with the number of updated elements. sequential tiles read. This experiment shows that all methods 106 TileDB 106 TileDB scale linearly with the number of tiles, maintaining their per- 105 TileDB+Z HDF5 105 TileDB+Z HDF5 formance differences. SciDB SciDB 104 104 SciDB+Z SciDB+Z $ $ $ $ $ $$ $ 104 106 103 TileDB TileDB 103 Time (s) Time (s) TileDB+Z TileDB+Z 103 105 102 HDF5 HDF5 2 HDF5+Z 4 HDF5+Z 10 10 101 102 SciDB SciDB SciDB+Z SciDB+Z 101 103 100 101 Time (s) Time (s) 0 -1 102 10 10 100 101 10-1 10-2 -1 1K 10K 100K 1 2 4 8 16 10 100 # Updates # Instances 10-2 10-1 (a) vs. # updates (HDD) (b) vs. # instances (SSD) 10-3 10-2 Tile Par Col 1K 10 100 Subarray type # Tiles Figure 10: Random update performance of dense arrays (a) vs. subarray type (HDD) (b) vs. # tiles (HDD) Figure 10(b) illustrates the update cost on SSDs when vary- 10 7 TileDB 10 7 TileDB TileDB+Z TileDB+Z ing the number of instances and fixing the number of element 106 HDF5 HDF5+Z 106 HDF5 HDF5+Z updates to 100K. We split the updates equally into separate 105 SciDB 105 SciDB SciDB+Z SciDB+Z 4 4 Time (s) Time (s) binary files, one for each instance. TileDB and HDF5 are un- 10 $ $ $ $ 10 $$ $$ $$ $$ $$ 3 3 affected by the number of instances and, thus, all the previ- 10 10 2 2 10 10 ous observations hold. SciDB performance improves with the 101 101 number of instances. However, it is still up to more than 3 100 100 orders of magnitude slower than TileDB. 1K 10K 100K 1 2 4 8 16 # Elements # Instances We performed an additional experiment with large sequen- tial writes (e.g., whole tiles), not shown here. Note that in that (c) vs. # elements (HDD) (d) vs. # instances (SSD) experiment the performances of TileDB and HDF5 match, and Figure 11: Subarray performance for dense arrays both systems are substantially faster than SciDB (the obser- vations are very similar to the case of single-instance load). Figure 11(c) shows read performance when reading ran- Finally, we experimented with different tile/chunk sizes (plot dom elements instead of contiguous subarrays. For this ex- omitted due to space limitations). TileDB and HDF5 are prac- periment, we generated random coordinates for the elements, tically unaffected by this parameter (TileDB writes sequen- which we batched in HDF5 with the select elements tially to a new fragment, whereas HDF5 writes in-place, uti- API. In TileDB we used a similar API. In SciDB we used the lizing a B-tree for finding the chunks whose height grows log- cross between function, which performs a similar batch- arithmically with the number of chunks). However, SciDB’s ing of the random element reads. The cost in TileDB and performance deteriorates substantially with the chunk size, as HDF5 is the same. TileDB is up to 2 orders of magnitude for every update an entire chunk must be read from and written faster than SciDB. This is because TileDB and HDF5 read to the disk. Note that this is true in both SciDB and SciDB+Z. only the relevant cells to the query from the disk, whereas SciDB fetches entire tiles. Subarray. Figure 11(a) shows the read performance of all Figure 11(d) shows the performance of parallel random ele- systems (on a single instance), versus different subarray types, ment reads, as a function of the number of instances. We fixed focusing on the 4 GB array with a fixed 2500 × 1000 (10 MB) the number of elements to 100K, and divided them equally tile size. In TileDB and HDF5, the results are written into the across the instances. In TileDB, we found out that Linux read user’s memory buffers. In SciDB, we execute the between works better than mmap and, thus, we report the times us- function (followed by the recommended consume) and write ing the latter I/O method. The reason is that a mmap call is the results in a temporary main-memory binary table. In TileDB more expensive than a read call, which becomes noticeable we use the Linux read I/O method, whereas in TileDB+Z we on SSDs with sub-millisecond access times. TileDB matches use the mmap I/O method that leads to better performance (as the performance of HDF5, and both systems seem to scale it avoids double-buffering the compressed tiles). As noted in with the number of instances. However, TileDB+Z scales bet- Section 4.1, TileDB can run on either of these read modes, and ter than HDF5+Z, becoming up to 7x faster than HDF5+Z. properly tuning the mode may affect performance. On the other hand, SciDB did not complete in a reasonable We tested three different subarray types: (i) Tile, where the amount of time for any setting. subarray covers exactly one tile, (ii) Par (for partial), where the subarray query is a 2499 × 999 rectangle completely con- Effect of number of fragments and consolidation. In Fig- tained in a tile, and (iii) Col, where the subarray is a full array ure 12(a) we show read time as a function of the number of column, vertically intersecting 20 tiles. fragments present in an array. We take the initial 4 GB array The performances of TileDB and HDF5 once again match, as the first dense fragment, and update it with a number of except for the case of Par where TileDB is 10x faster than batches, where each batch contains 1000 randomly generated HDF5. This is due to TileDB’s efficient way of handling en- elements, creating a new sparse fragment. on the x-axis of the tire contiguous cell ranges instead of investigating cells one by figure, 1 means the initial 4 GB dense fragment, 1+10, 1+100 one. TileDB is 1-2 orders of magnitude faster than SciDB. and 1+1000 means the initial fragment plus 10, 100 and 1000 358
11 .sparse fragments, respectively, and 1C is 1+1000 after con- 6.2 Sparse Arrays solidation into a single fragment. We report the average time We next focus on sparse arrays, comparing TileDB with of 100 1K×1K subarray query randomly positioned inside the Vertica+Z (gzip-compressed and following SRAM [19]) and array space. Even after inserting 100 fragments, the read per- SciDB on the AIS dataset. HDF5 is not optimized for sparse formance deteriorates only by 7% (86 ms vs. 92 ms). After arrays, thus we omit it from these experiments. 1000 fragments, it becomes 2.8x worse. However, after con- solidation, the read time becomes 84 ms again. Note that the Load. Figure 13(a) shows the load performance of the three memory consumption of TileDB is negligible, regardless of systems. We created three different datasets with sizes 6 GB, the number of existing fragments. 12 GB, 24 GB by loading from the dataset (originally in CSV 3 TileDB 4 TileDB format), after de-duplicating the coordinates. The dataset is TileDB+Z TileDB+Z organized based on the months of data collection. In TileDB, 3 we called write for each month. In Vertica, we used the Time (x100 ms) 2 Time (x100 s) COPY statement (which bypasses the write optimized store to 2 directly write to the read optimized store) to load each dataset 1 1 into a relational table. Each tuple in this table corresponds to a (non-empty) array cell, where the X, Y coordinates repre- 1 1+10 1+100 1+1K 1C 1+10 1+100 1+1000 sent different database columns and constitute the composite # Fragments # Fragments key. Vertica sorts the data first on X and then on Y, effec- (a) Subarray time (HDD) (b) Consolidation time (HDD) tively enforcing a row-major order on the array elements. In Figure 12: Effect of # fragments in dense arrays SciDB, we first loaded the data into a 1D array, and then in- voked redimension followed by store. The tile extents Figure 12(b) depicts the consolidation time, when varying for TileDB and SciDB were 10K×10K, and the data tile ca- the number of fragments. Consolidating 100 fragments into pacity for TileDB was 10K. the 4 GB array takes the same time as the initial load, whereas TileDB is once again more than an order of magnitude faster merging 1000 fragments takes 3.4% longer. The memory con- than SciDB. Moreover, TileDB+Z is always slightly faster than sumption is 10 MB (a system parameter that determines the Vertica+Z. Note that we experimented with different tile ex- buffering size) regardless of the number of fragments being tents (not shown here) and discovered that they do not consid- consolidated. As described in Section 4.3, the system can con- erably affect TileDB and SciDB. tinue to process array operations while consolidation occurs. 107 105 TileDB TileDB Scalability. The last experiment for the dense case loads into 10 6 TileDB+Z Vertica+Z TileDB+Z Vertica+Z SciDB SciDB TileDB and queries two large arrays with sizes 128 GB and 10 5 10 4 $ $ $ $ 256 GB (we omit the plots due to space limitations), with tiles 4 Time (s) Time (s) 10 103 of size 2, 500 × 1, 000. The loading times were 1,815.78 s 103 and 3,630.89 s, respectively, which indicates perfect scalabil- 102 10 2 ity. We issued random 1K×1K subarray queries to both arrays 101 (similar to those used in Figure 12(a)), and got times of 80 ms 100 6GB 12GB 24GB 101 1 2 4 8 16 and 84 ms, respectively. Comparing these times to the 75 ms Dataset size # Instances we observed in the case of the 6 GB array, we conclude that (a) vs. dataset size (HDD) (b) vs. # instances (SSD) the read performance of TileDB is practically unaffected by Figure 13: Load performance of sparse arrays the array size. Finally, the memory consumption upon loading in TileDB is negligible. Figure 13(b) plots the time to load the 6 GB dataset in par- Vertica experiments. We conducted additional experiments allel, versus the number of instances. Note that Vertica by de- with three different variants for implementing dense array man- fault uses all the cores in the system. Thus, when we vary the agement with Vertica, which offer different trade-offs: (i) we number of instances, we activate only as many CPU cores as treated each dense element as a tuple, explicitly storing the the number of tested instances. The input files are evenly split X and Y element indices as table columns, and sorting first by lines across the multiple processes in TileDB. In Vertica, on X and then on Y (row-major order), (ii) similar to (i), but the COPY command automatically utilizes the available cores adding an extra column for the tile ID, and sorting on tile ID to load from a single file. In SciDB, the evenly split files are first, and then on X and Y (this imposes an identical global saved inside the instance directories. TileDB, TileDB+Z and cell order to TileDB), and (iii) we followed the RAM [26] ap- Vertica+Z scale nicely with the number of instances. The per- proach and stored a cell ID per element instead of X and Y formance differences among the systems are similar to what (i.e., a single extra column), where the cell IDs are calculated was explained above for the serial case. based on the TileDB global cell order. For all these three al- ternatives, we tested compressing the extra columns with both Subarray. The next set of experiments evaluates the subar- GZIP and RLE. TileDB was consistently 2x-40x better in all ray performance. In this experiment, we identified two array settings versus all these solutions. The main reason is that regions, one dense (Los Angeles harbor) and one sparse (the Vertica performs operations on a cell-basis, whereas TileDB middle of Pacific ocean); henceforth, DQ (SQ) refers to this operates on large batches of cells during reads and writes. In- dense (sparse) domain area. In each of the following experi- dividual benchmarks are available in the TileDB Github repo. ments, we selected a subarray with the target result size, and 359
12 .then derived from it 50 subarrays of similar size by shifting it 7. CONCLUSION using a random offset. Each query returns the X and Y cell We presented the TileDB multi-dimensional array storage coordinates that fall in the specified subarray. In TileDB, we manager, which is optimized for both dense and sparse ar- called read once per subarray; in SciDB we batched all reads rays. We described the basic concepts of TileDB, such as with the cross between command; in Vertica we issued the organization of elements into tiles and fragments, its read a SQL SELECT WHERE query. Once again, the results of and consolidation algorithms, and the effect of parallelism. Vertica and SciDB are written into temporary main-memory We experimentally demonstrated that TileDB offers (i) orders binary tables. In TileDB, we fix the data tile capacity to 10K. of magnitude better performance than the HDF5 dense array Figures 14(a) and 14(b) show the subarray times for the manager for random element writes, while matching its read dense and sparse areas, as a function of the result size (sin- performance, (ii) better performance in all settings than the gle instance on HDDs). TileDB is 1-2 orders of magnitude SciDB array database for both dense and sparse arrays, most faster than SciDB, and at least as fast as Vertica in all settings. of the time by several orders of magnitude, and (iii) an equiv- Note that because we cannot disable compression in Vertica, it alent performance to the parallel Vertica columnar database is not fair to compare uncompressed TileDB to Vertica+Z. on sparse arrays, while offering a programmer-friendly API- based interface similar to HDF5. 105 104 TileDB TileDB TileDB+Z TileDB+Z 104 103 10 3 Vertica+Z SciDB 10 2 Vertica+Z SciDB 8. REFERENCES 102 [1] Apache Kylin. http://kylin.apache.org/. 1 Time (s) Time (s) 10 [2] Broad Institute, Intel work together to develop tools to accelerate 1 10 100 biomedical research . http://genomicinfo.broadinstitute. 100 -1 10-1 org/acton/media/13431/broad-intel-collaboration. 10 10-2 [3] Charm++. 10-2 http://charm.cs.illinois.edu/research/charm. 10-3 10-3 10K 100K 1M 10M 10K 100K 1M [4] Enabling a Strict Consistency Semantics Model in Parallel HDF5. Result size Result size https://www.hdfgroup.org/HDF5/doc/Advanced/ (a) DQ vs. result size (HDD) (b) SQ vs # result size (HDD) PHDF5FileConsistencySemantics/ PHDF5FileConsistencySemantics.pdf. 105 105 [5] GenericIO. TileDB TileDB TileDB+Z TileDB+Z Vertica+Z Vertica+Z http://trac.alcf.anl.gov/projects/genericio. 104 SciDB 104 SciDB [6] HDF5 for Python. http://www.h5py.org/. 103 103 [7] Legion Parallel System. http://legion.stanford.edu/. Time (s) Time (s) [8] National Oceanic and Atmospheric Administration. Marine Cadastre. 2 2 10 10 http://marinecadastre.gov/ais/. 10 1 10 1 [9] NetCDF. http://www.unidata.ucar.edu/software/netcdf. $ $ $ $ $ $ $ $ 100 100 [10] Parallel HDF5. https://www.hdfgroup.org/HDF5/PHDF5/. 1 2 4 8 16 1 2 4 8 16 # Instances # Instances [11] PLASMA. http://www.netlib.org/plasma/. [12] PostgreSQL. http://www.postgresql.org/. (c) DQ vs. # instances (SSD) (d) SQ vs. # instances (SSD) [13] PyTables. http://www.pytables.org/. Figure 14: Subarray performance for sparse arrays [14] ROMIO: A High-Performance, Portable MPI-IO Implementation. http://www.mcs.anl.gov/projects/romio/. [15] ScaLAPACK. http://www.netlib.org/scalapack/. Figures 14(c) and 14(d) plot the subarray times for the dense [16] The HDF5 Format. http://www.hdfgroup.org/HDF5/. and sparse areas, as a function of the number of instances [17] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The Multidimensional Database System RasDaMan. In SIGMOD, 1998. (on SSD). In these two experiments, we evenly divided 320 [18] P. G. Brown. Overview of SciDB: Large Scale Array Storage, Processing random subarray queries (created inside the dense and sparse and Analysis. In SIGMOD, 2010. area, respectively) across the instances, and reported the total [19] R. Cornacchia, S. H´eman, M. Zukowski, A. P. Vries, and P. Boncz. time. Each subarray returns 1M elements. TileDB+Z and Ver- Flexible and Efficient IR Using Array Databases. The VLDB Journal, 17(1):151–168, 2008. tica+Z scale nicely with the number of instances and feature [20] S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender, and M. L. similar performance, whereas TileDB is more than an order of Kersten. MonetDB: Two Decades of Research in Column-oriented magnitude faster than SciDB in all settings. Database Architectures. IEEE Data Engin. Bulletin, 35(1):40–45, 2012. [21] A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, and Consolidate. We conducted a similar experiment to the dense C. Bear. The Vertica Analytic Database: C-store 7 Years Later. Proc. VLDB Endow., 5(12):1790–1801, 2012. case, drawing randomly the new cells from the AIS dataset. [22] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. The Log-structured The observations are similar to the dense case (we omit the Merge-tree (LSM-tree). Acta Inf., 33(4):351–385, 1996. plot). The read time of TileDB deteriorates by only 18% af- [23] M. Rosenblum and J. K. Ousterhout. The design and implementation of a ter inserting 100 fragments, and by about 2x after 1000 frag- log-structured file system. ACM TOCS, 10(1):26–52, Feb. 1992. [24] F. Rusu and Y. Cheng. A Survey on Array Storage, Query Languages, and ments, but returns to normal after consolidation. Moreover, Systems. ArXiv e-prints, 2013. consolidation takes the same time as the original load (and, [25] E. Soroush, M. Balazinska, and D. Wang. ArrayStore: A Storage once again, can be performed in the background). Manager for Complex Parallel Array Processing. In SIGMOD, 2011. [26] A. R. van Ballegooij. RAM: A Multidimensional Array DBMS. In EDBT Extended Database Technology Workshops, 2004. Acknowledgements This work was partially supported under NSF grant IIS-1110370. 360