Optimized Data Loading for a Multi-Terabyte Sky Survey Repositor

Advanced instruments in a variety of scientific domains are collecting massive amounts of data that must be post processed and organized to support research activities.Astronomers have been pioneers in the use of databases to host sky survey data. Increasing data volumes from more powerful telescopes pose enormous challenges to state-ofthe-art database systems and data-loading techniques.

1. Optimized Data Loading for a Multi-Terabyte Sky Survey Repository* † † †,‡ Y. Dora Cai , Ruth Aydt , Robert J. Brunner † National Center for Supercomputing Applications (NCSA) ‡ Department of Astronomy University of Illinois at Urbana-Champaign {ycai, aydt, rb}@ncsa.uiuc.edu Abstract FIRST and DENIS) have been built to house this data and to serve as valuable resources for astronomy researchers Advanced instruments in a variety of scientific domains are and the general public [13]. A repository for the Palomar- collecting massive amounts of data that must be post- Quest sky survey [8] is currently under construction at processed and organized to support research activities. NCSA. Astronomers have been pioneers in the use of databases to Several characteristics of sky survey data—such as rapid host sky survey data. Increasing data volumes from more data capture, massive data volume, and high data powerful telescopes pose enormous challenges to state-of- dimensionality—make data loading the first great challenge the-art database systems and data-loading techniques. in building advanced sky survey repositories. These In this paper we present SkyLoader, our novel framework characteristics are reflected in several demanding issues for data loading that is being used to populate a multi-table, that must be addressed when loading such data. First, data- multi-terabyte database repository for the Palomar-Quest loading speed must, at a minimum, keep up with data- sky survey. SkyLoader consists of an efficient algorithm acquisition speed. Second, it must be possible to populate for bulk loading, an effective data structure to support data multiple database tables from a single source file. Third, it integrity, optimized parallelism, and guidelines for system is often necessary to perform complex data transformations tuning. Performance studies show the positive effects of and computations during the loading process. Finally, these techniques, with load time for a 40-gigabyte data set automatic error recovery is required during the lengthy reduced from over 20 hours to less than 3 hours. data-loading process. Our framework offers a promising approach for loading In this paper we present SkyLoader, our optimized other large and complex scientific databases. framework for parallel bulk loading of a Palomar-Quest repository powered by an Oracle 10g relational database. 1. Introduction Our framework addresses all of the data-loading issues listed above through the development and application of the With the advent of computers, databases, data warehouses, following techniques: (1) an efficient algorithm to perform and World Wide Web technologies, astronomy research has bulk data loading, (2) an effective data structure to maintain been undergoing revolutionary changes. Advanced data- table relationships and allow proper error handling, (3) gathering technologies have collected tremendous amounts optimized parallelism to take full advantage of concurrent of digital sky survey data and many sky survey repositories loading processes, and (4) active database and system (e.g., SDSS, GALEX, 2MASS, GSC-2, DPOSS, ROSAT, tuning to achieve optimal data-loading performance. With systematic testing and refinement of the SkyLoader  framework we have significantly improved the data-loading *This work was supported in part by the National Science Foundation grants SCI 0525308, ACI-9619019, ACI-0332116 performance for the Palomar-Quest repository. Loading and by NASA grants NAG5-12578 and NAG5-12580. time for a 40-gigabyte data set was reduced from more than Permission to make digital or hard copies of all or part of this work for 20 hours to less than 3 hours on the same hardware and personal or classroom use is granted without fee provided that copies are operating system platform. not made or distributed for profit or commercial advantage, and that The issues outlined earlier are being faced not only by the copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, astronomy community, but also by other scientific requires prior specific permission and/or a fee. disciplines interested in building scalable databases to SC|05 November 12-18, 2005, Seattle, Washington, USA© 2005 ACM 1- house multi-terabyte archives of complex structured data. 59593-061-2/05/0011…$5.00 . 1 Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05) 1-59593-061-2/05 $20.00 © 2005 IEEE

2. We firmly believe the experience gained in this study will 3. Data Loading Challenges and Approaches benefit other data repositories of massive scale. The large data-collection rates and volumes noted in the The remainder of the paper is organized as follows. Section previous section dictate the necessity for a fast data 2 presents a brief introduction to the Palomar-Quest sky repository loading process that is capable of keeping up survey. Section 3 describes data-loading challenges and over time with the speed of data acquisition. A number of approaches. Section 4 details our SkyLoader framework for factors contribute to the difficulty of achieving this goal. addressing these data-loading challenges. Section 5 describes our experimental platform and presents results Collected raw image data and computed catalog data are and analyses for a variety of performance studies. Section 6 usually archived in a mass storage system that is separate discusses our approach in comparison to related work, and from the database server. The catalog data that must be Section 7 presents conclusions and future directions. transferred from the mass storage system to load the database repository typically saturates the available network bandwidth, introducing the network as the first bottleneck 2. The Palomar-Quest Sky Survey to fast data loading. The Palomar-Quest sky survey is a collaborative endeavor Sky survey data encompasses information of many different between the California Institute of Technology, Yale types, from sky region specifications to the observed details University, Indiana University and the University of of tiny objects. This variety of information is interleaved in Illinois, being jointly led by Principal Investigators Charles the catalog data set that is generated when the raw image Baltay at Yale and S. George Djorgovski at Caltech. data is processed. During the data-loading process the Palomar-Quest is a multi-year, multi-wavelength synoptic complex catalog data must be parsed, the correct survey conducted at the Palomar-Quest Observatory located destination tables must be identified, and the data must be in north San Diego County, California. The survey camera loaded into multiple target tables in the repository. Loading consists of 112 Charge-Coupled Devices (CCDs) and can data into multiple tables is further complicated by the observe a third of the sky in a single night. In contrast to presence of multiple relationships among tables— traditional sky surveys, Palomar-Quest repeatedly scans the relationships that must be maintained by complying with the night sky. If we characterize traditional sky surveys as primary and foreign key constraints during the loading taking digital snapshots of the sky, the Palomar-Quest process. survey in contrast is taking digital movies [8]. The time element inherent in this survey allows astronomers to Additional operations are also performed during the data- statistically analyze the variable nature of our universe and loading process. These operations include transformations contributes to the volume, richness, and complexity of the to convert data types and change precision, validation to survey data. filter out errors and outliers, and calculation of values such as the Hierarchical Triangular Mesh ID (htmid) and sky The data-collection rate is 7.4 gigabytes/hour or coordinates to facilitate the science research [10] that the approximately 70 gigabytes/night, with a monthly average repository is built to enable. All such intensive operations of 12–15 nights of observing. Extrapolating, Palomar- place an additional burden on the loading process. Finally, Quest can collect approximately 1 terabyte of image data since data loading is typically a lengthy process, a per month, assuming ideal observing conditions. The mechanism of automatic recovery from errors is a basic image data captured by the telescope camera is further requirement. processed to produce catalog data totaling approximately 15 gigabytes/night. Since going into production in 2003, Each major relational database management system over 6 terabytes of raw image data have been archived at (RDBMS) vendor provides a utility to load data from a flat NCSA from which more than a terabyte of catalog data has file into a table. The Oracle system supports SQL*Loader, been derived. the MS/SQLServer system supports Data Transformation Services (DTS), and IBM DB2 supports a LOAD utility. Researchers from the Department of Astronomy and the However, these are proprietary tools that can only work National Center for Supercomputing Applications at the with the vendor’s own databases. Furthermore, they are University of Illinois at Urbana-Champaign have jointly primarily designed to quickly load data into a single designed and developed a data repository system powered database table without performing any data transformation. by an Oracle 10g relational database to archive, process, These data-loading services are not suitable for use with and distribute the Palomar-Quest sky survey catalog data to massive scale sky survey data. research collaborators. This paper focuses on the optimized loading of derived catalog data into the sky survey Several packaged data-loading tools are available on the repository. market, such as BMC Fast Import, FACT (CoSORT’s FAst extraCT), and DataSift. However, these data-loading tools are black boxes that generate programs which cannot be easily customized [1]. Some new bulk-loading techniques have been proposed [1, 4, 5, 9, 11]; however, all of these . 2 Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05) 1-59593-061-2/05 $20.00 © 2005 IEEE

3. approaches are focused on bulk loading an index, such as simplify the diagram while still conveying the complexity B++-tree, Quad-tree and R-tree. Based on our experience of the model and inter-table relationships. and examination of the research literature, there is little Each table stores a unique aspect of the sky survey. For work on parallel bulk loading of huge amounts of data into example, metadata related to a night’s observation such as a multi-table database. telescope position, filters in use, and collection start time To meet the challenges in building the Palomar-Quest goes into the observations table. Metadata related to the repository, we have designed and implemented an CCDs such as CCD number and sky area covered goes into optimized framework, called SkyLoader, which consists of the table ccd_columns. Detailed information related to (1) an efficient algorithm to load data in bulk, (2) an observed objects goes into the objects table. effective data structure to maintain table relationships and handle errors, (3) optimized parallelism to take full advantage of concurrent loading processes, and (4) active database and system tuning to achieve optimal data-loading performance. Using this framework we can bulk load data in parallel, insert data into multiple database tables simultaneously without locking and constraints violations, and recover the loading process from errors. The SkyLoader framework has significantly improved the performance of data loading. We have been able to reduce the loading time for a 40-gigabyte data set from over 20 hours to less than 3 hours on the same hardware and operating system platform. Figure 1. Palomar-Quest Repository Data Model 4. The SkyLoader Framework In this section we present the design of our sky survey A primary key is defined in each table to force data repository and the details of our SkyLoader framework. uniqueness. Most tables have one or more foreign keys to maintain parent-child relationships. For example, a frame 4.1 Data Model and SkyLoader Tasks aperture is always related to a frame. The foreign key on the The raw images captured by the camera on the Palomar- table ccd_frame_apertures, which references the table Quest telescope are archived in NCSA’s Legato ccd_frames, enforces this constraint. The database table DiskXtender (UniTree) mass storage system. A program is sizes vary significantly. Some static metadata tables have run on the raw image data to extract catalog data, which less than 100 rows, while the objects table is expected to includes a wide range of information. Typically the catalog grow beyond a billion rows. data includes information on the telescope position, the sky Taking into account the data model and data characteristics region scanned, the parameters applied, the CCDs operated, of the Palomar-Quest repository, the SkyLoader framework the frames derived, and the objects captured. The catalog was designed to efficiently perform the following tasks information is first written to an ASCII file, which is saved using a parallel architecture: (1) read the data from the in the mass storage system and then uploaded to a catalog data files, (2) parse, validate, transform and repository database. The format of the catalog file varies compute data, (3) load data into the repository database and depending on the extraction program used. In most cases, distribute data to multiple tables, and (4) detect and recover different aspects of the catalog information are interleaved from errors in the data-loading process. in the file. For example, a row of frame information is followed by four rows of frame aperture information, and a 4.2 An Efficient and Scalable Bulk-Loading row of object information is followed by four rows of finger Algorithm information. Each row in the catalog data file usually has a tag or a keyword that can be used to determine the For massive volumes of sky survey data, it is crucial to destination table in the database. explore scalable data-loading techniques. The first such technique to explore is bulk loading. Bulk loading allows A commercial relational database, Oracle 10g, has been multiple insert operations to be packed into a single batch chosen to host the Palomar-Quest repository. The and performed with one database call, minimizing network repository database has been designed to store the catalog roundtrip traffic and disk I/O [17]. data and support data analysis. Figure 1 shows the data model for the database, which consists of 23 tables. Only It is straightforward to perform bulk loading to a single the table names and relationships are shown in Figure 1 to table, and most RDBMS system tools and some on-the- shelf software packages can accomplish this efficiently. However, it is nontrivial to bulk load multiple tables . 3 Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05) 1-59593-061-2/05 $20.00 © 2005 IEEE

4. simultaneously due to the complicated relationships among model. Our approach does not cover circular parent-child the tables. If the data belonging to a child table is loaded relationships, as a good database design does not have before the corresponding parent keys, a foreign key circular dependencies between tables. constraint is violated. Our technique to avoid this problem is to first buffer the data into separate arrays designated for different tables, and then to follow the parent-child relationship order when performing the bulk inserts. The Input: a series of input data files Output: populated database tables parent table is loaded first, then the child table(s). This table-loading order is illustrated in Figure 2. int array-size /* the size of an array */ int batch-size /* the size of a batch; Parent Array Child Array Grandchild Array typically << array-size */ Procedure bulk_loading Step 1 (1) for each data file { (2) open the file Catalog Data Set (3) for each row { Step 2 Step 3 Step 4 (4) parse the row, do validation, transformation and Parent Table Child Table Grandchild Table computation, and buffer it in a designated array based on the destination table; (5) if (any array.size >= array-size) { (6) for each array ordered by parent-child relationship { (7) first_idx = 0; Loading must be in the order: Parent, Child, Grandchild (8) last_idx = array.size; (9) while (first_idx <= last_idx) Figure 2. Bulk Loading Order with Multiple Tables (10) first_idx = batch_row(array, destination_table, first_idx, last_idx) (11) } /* for each array */ Another difficulty in data loading is recoverability in a (12) } /* if reach array-size */ lengthy data-loading process. The catalog data set to be (13) } /* for each row */ loaded sometimes contains errors such as missing and/or (14) } /* for each data file */ invalid values. To make the loading process recoverable Function batch_row (array, destination_table, first_idx, from these errors, we use an array-index tracing technique last_idx) that can quickly detect errors, skip the problematic rows, (15) while (first_idx <= last_idx) { and resume the loading process immediately. (16) prepare SQL statement; For our SkyLoader framework we developed an efficient (17) add to batch; algorithm, bulk-loading, that enables bulk loading into (18) if (batch-size reached) { /* time to insert */ multiple tables. This algorithm not only speeds up data (19) insert batch into the destination table; (20) if (successful insert) { loading by a factor of 7 to 9, but also maintains the (21) first_idx += batch-size; relationships of multiple tables and enables the system to (22) } else { /* if an error occurred skip that row */ recover from errors during data loading. Our bulk-loading (23) skip_one_row; algorithm is presented in Figure 3. (24) return (the_next_index); The algorithm, bulk-loading, contains two user-tunable (25) } constants, array-size and batch-size, controlling the size of (26) } else if ( first_idx == last_idx) { /* array done */ (27) insert batch into the destination table; an array and the size of a batch, respectively. The procedure (28) if (successful insert) { bulk_loading (Line 4, in Figure 3) first parses a data row, (29) return (last_idx + 1 ); performs validation, transformation, and computation, and (30) } else { /* if an error occurred skip that row */ then buffers the data row into a designated array. This (31) skip_one_row; buffering step separates data into different arrays based on (32) return (the_next_index); the destination tables and is necessary to maintain the (33) } relationships between multiple tables and to facilitate error (34) } handling. We explain this step in more detail in the next (35) } /* while there are more rows to process */ section on our buffering data structure. When any data array reaches array-size (Line 5), the Figure 3. bulk-loading Algorithm batch_row procedure is called (Line 10) for each array based on the parent-child relationship. The array for the The function batch_row prepares the SQL statements (Line parent table is processed first, followed by the child tables. 16), adds the SQL command to a batch (Line 17) and This processing sequence depends entirely on the data makes a database call when batch-size is reached (Line 19). . 4 Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05) 1-59593-061-2/05 $20.00 © 2005 IEEE

5. If no error is encountered (Line 21), the function loops and repacking the batch to continue each time that an error through the array and inserts all rows in batch into the is encountered. appropriate database table (Line 27). In the case of errors, Performance results demonstrating the benefits of bulk the function skips the error row, and returns the next array loading for our Palomar-Quest repository are shown in index to the calling procedure bulk_loading (Lines 23, 24 Section 5.1. The effects of the user-tunable constants and 31, 32). The bulk-loading procedure continues to make batch-size and array-size are presented in Sections 5.2 and calls to the function batch_row with a new index range 5.3 respectively. (Line 9 to Line 10) until all rows in the array are processed. Example 1: Loading a data set into two tables. Suppose 4.3 An Effective Data Structure to Buffer a sky survey repository has two tables: frames and objects. Data The table frames stores the frame information derived from an image and the table objects keeps the object information As discussed previously, the catalog data set used to load measured in each frame. A foreign key constraint between the Palomar-Quest repository contains rows of data frames and objects is enforced. Consider a case with 5 destined for multiple target tables in the database. This frames and 1000 objects interleaved in a data file. When interleaving of data for multiple target tables, combined packaging the SQL insert statements into a batch for bulk with the presence of multiple relationships among tables— loading, the rows with object information will reach batch- relationships that must be maintained by complying with the size first. However, if the object rows get inserted before primary and foreign key constraints during the loading the frame rows, the frames-objects foreign key constraint process—makes bulk loading especially challenging. To will be violated immediately. Using our bulk-loading manage the interleaved data and complex table algorithm, a batch-size of 40, and an array-size of 1000, relationships, and to facilitate quick recovery when an error this data set can be loaded efficiently and correctly through is detected during the data-loading process, we have the following steps: designed an effective data structure, array-set, in our framework. Step 1. Read data in and buffer the frame data into array1 The array-set data structure consists of a dynamically and the object data into array2. maintained set of two-dimensional arrays, each associated Step 2. When either array reaches array-size, 1000 in this with a destination table in the database. One dimension of example, bulk loading is triggered. In this example, array2 each array corresponds to table rows, and the other to table will fill up first. Despite that, the bulk loading proceeds by attributes. Arrays are cached in memory and used in the following the parent-child relationship order, meaning the bulk loading process as described in the previous section. rows in array1 are processed before array2. The number of arrays in the array-set at a given time during Step 3. If no error occurs, a single call to the function data loading depends on the degree to which the data in the batch_row will initiate bulk loads to insert all rows in an catalog data set is interleaved. As the input catalog data set array into a database table. Suppose row 45 in array2 has is processed, the framework creates a new array in array-set an error. Recall, we are using a batch-size of 40. The whenever it reads an input row targeted for a database table function batch_row inserts rows 1 to 40 in the first batch, for which no array is currently maintained. When any of inserts rows 41 to 44 in the second batch, skips row 45 the arrays in array-set are fully populated, bulk loading where the error occurs, and returns to the calling procedure, occurs. At the end of the bulk-loading cycle, the arrays in bulk_loading. Since the array was not completely loaded, array-set are destroyed and their memory released. The bulk_loading calls batch_row again for array2 with a new framework resumes reading the input catalog data and starting index, and loading proceeds with rows 46 to 85, creates new arrays as required to buffer the incoming table rows 86 to 125, and so on until all remaining rows in rows. array2 have been inserted. To reiterate and expand on the motivations for the array-set The bulk-loading algorithm has been implemented using structure, the Palomar-Quest catalog data set contains the JDBC™ core API. Let N denote the total number of various levels of information for a sky survey, and that rows in the data set. In the best case, that is when the data information is interleaved in a single data set with the set is error-free, the algorithm will generate N/batch-size relationships between levels embedded in the file. If data is database calls and result in N/batch-size database I/Os. In bulk-loaded by simply following the order of the data rows the worst case, for example primary key violations on every and starting bulk loads into various tables when a threshold row caused by repeatedly loading duplicate rows, bulk is hit, a foreign key constraint may be violated because the loading will deteriorate to a series of singleton insert referencing data may be loaded before the referenced data. operations which make N database calls and perform N In order to load the catalog data items into different database I/Os. This behavior results from the algorithm destination tables and retain the proper relationships, we breaking up the problematic batch, skipping the error row, use array-set to buffer the data and execute the bulk loading in the order of parent-child sequence. . 5 Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05) 1-59593-061-2/05 $20.00 © 2005 IEEE

6. Error handling in bulk data loading is difficult. In the array-set data structure. We may also explore the use of JDBC core API, when an error is encountered during a bulk an overall “memory high water mark” that would trigger load, the remaining data in the batch is ignored. bulk loading and memory reclamation whenever the Furthermore, after the batch has been dispatched to the aggregate memory used by the cached arrays reached that database server, it is impossible to reapply it [12]. Since it size. is not unusual for sky survey data to have missing and/or invalid values due to the complexity of the collection and 4.4 Exploration of Optimized Parallelism processing pipeline, stringent data checking is performed by The Palomar-Quest survey collects a tremendous amount of the database to guard against hidden corruption, and errors data, with each observation generating 28 image data sets are detected during bulk loads fairly often. Quickly and each image data set containing the data collected by 4 recovering from an error in a single row and continuing to CCDs. The raw image data is processed to derive the insert data in the batch following that row is critical. catalog data, which is also organized in 28 files. The 28 Failure to recover properly could result in the loss of a huge catalog data files can be processed independently, and we amount of information and in an incomplete catalog currently load them in parallel from Radium, an NCSA repository. Failure to recover quickly will negatively impact Condor [16] cluster, to a centralized Oracle database overall loading time. powered by an 8-processor SGI Altix server that is one of The use of the array-set data structure allows us to solve NCSA’s TeraGrid resources. Parallelism enables multiple these problems. Buffering data in an array enables random processors to work simultaneously with the database server access of any data element. A row in a batch always maps and substantially improves the data-loading performance back to the source array. By detecting the error row in a [2]. batch during bulk inserts, our algorithm can quickly identify The optimal degree of parallelism, which we will refer to as the corresponding row in the source array, skip the error p, varies depending on the system resources and the running row, repack the batch, and continue the bulk data-loading applications. In an ideal environment with our 8-processor process from the row following the error. In addition to database server and well-matched Condor nodes and performing this runtime recovery, the data loading program network connectivity, we would expect 8 parallel loading saves the error rows in a separate database table that can be processes to fully utilize all CPUs on the database server. reviewed later by the domain scientists. The scientists can However, warning messages from the Oracle performance decide on a row-by-row basis to discard an error row or to monitor during our tests indicated that parallelism at this repair and reload it. level caused locking problems. All RDBMSs have a limit The tunable parameter array-size is one of the factors that on the supported number of concurrent transactions, and the effects bulk loading performance. A large array-set may complex nature of database locking makes it difficult to consume too much memory on the client machine and cause pinpoint in advance exactly when lock contention will excessive memory paging. This slowdown on the client become an issue for a particular workload. Even without where the loading process is running is reflected in the locking issues, the performance gain in data loading is degraded loading performance on the database server. On usually not proportional to the degree of parallelism. the other hand, an array-size value that is too small may Parallel processing introduces some system overhead that increase the overhead for array initialization and limits the performance benefit to less than perfect speedup. population. In our framework, we adjusted array-size In cases such as the Palomar-Quest sky survey repository based on the system resources and data characteristics to where data loading is a critical and ongoing activity, it is achieve optimal performance. Results of the performance worth the time to conduct careful experiments to determine studies that we performed to select an optimal array-size p, the optimal degree of parallelism. Methodical value are presented in Section 5.3. experimentation—even when the detailed database system Our current SkyLoader framework uses a single array-size implementation is unknown—can help identify the best user-tunable constant to control the number of rows in all possible degree of parallelism. In our framework, we have memory-resident arrays used to cache table data prior to parallelized the data loading according to the number of bulk loading. Since the systems that we are using to run the processors available, the underlying data characteristics, client data-loading processes have generous memory and the results of our performance studies. Performance configurations, our primary consideration was to implement study results are shown in Section 5.4. a solution quickly rather than to carefully minimize the When scheduling the loading of the 28 catalog data files space needed by the array-set data structure. We plan to associated with an observation, we recognize that these files revisit this implementation and make use of a configuration vary in size and, consequently, in loading time. Because of file to support arrays with variable number of rows. By this variation, we adopt a master/worker scheduling model understanding the structure of our catalog data set and the where p Condor worker nodes are employed in parallel, interleave pattern of the rows there, we can make more each loading one catalog data file, until no unloaded catalog intelligent memory-management decisions regarding the data files remain. Running the condor_submit_dag program . 6 Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05) 1-59593-061-2/05 $20.00 © 2005 IEEE

7. with –maxjobs p specified implements the desired schedule, 4.5.3 Reduce I/O Contention and the overall runtime is shorter than it would be if the Bulk data loading is typically I/O bound. To reduce I/O data sets were divided evenly among the Condor nodes. contention, we distributed the database (1) data files and This load-balancing methodology also helps minimize the temporary files, (2) indices, and (3) logs onto three separate overall data-loading time when one or more data files have RAID devices. a higher-than-average frequency of errors that slow the loading process. 4.5.4 Presort Data In our framework, the data files are sorted by the primary 4.5 Active Database and System Tuning keys of the tables prior to data loading. This sorting is done Many factors impact the performance of data loading. In as a byproduct of the processing that extracts the catalog our SkyLoader framework, we performed active database data from raw images. Through improvement of the and system tuning to achieve the best possible clustering factor on the disk data, this presorting procedure configuration. Such performance tuning is crucial to reduces disk I/O contention, especially if Index Organized achieve the fast loading of massive data volumes required Table (in Oracle) or Clustered Index (in MS SQLServer) for our sky survey repository. We believe our experience features are used [7]. will benefit others who are faced with loading large quantities of scientific data from various disciplines into 4.5.5 Manage Memory Allocation relational database systems. In our experiments, we discovered that allocating a smaller database data cache actually improves the data-loading 4.5.1 Delay Index Building performance. Since a database writer needs to scan the Since the Palomar-Quest sky survey is a multi-year entire data cache when writing new data from data cache to continuous effort, the survey repository must serve two disk, the reduced data cache size minimizes the work that purposes at the same time. First, it must be a warehouse to the database writer has to do each time [7]. This reduced store incrementally loaded data. Second, it must act as a cache configuration should be adjusted after the intensive query engine to support scientific research. For optimal data-loading phase is complete because a larger data cache query performance, it is necessary to create indices on usually performs better for user queries. database tables. However, indices usually make data loading slower because every insert requires an update of 5. Performance Analyses all index entries [6]. In this section we report and analyze various aspects of our Our tests showed that locking tends to happen more performance studies on parallel bulk loading of data into frequently at a lower degree of data-loading parallelism the Palomar-Quest sky survey repository. when indices are present. Based on our experiments, presented in Section 5.5, the impact of indices on data All experiments were conducted in a client-server loading varies depending on the type and size of the index environment. All data-loading programs (clients) were and on the pattern of the index keys. Because of these implemented using Java™ and communicate with the findings, we dropped most secondary indices to speed up database server using JDBC. The loading programs were the data loading. Once the catch-up phase of loading is initiated from multiple nodes of an NCSA Condor cluster. complete and load time is not as critical, these secondary All nodes in the cluster are dual CPU (1.5 GHz Intel indices will be rebuilt and kept current as subsequent data is Pentium III processor), 1 gigabyte RAM, Linux servers. collected and loaded. Recognizing the need to balance load An Oracle 10g database runs on an 8-processor (1.3 GHz time and query performance, some very selective indices Intel Itanium 2), 64-bit SGI Altix machine running SGI that are crucial to the scientific research queries, such as the Linux Propack 3 with 12 gigabytes of memory. The Altix index on htmid, have been maintained during the intensive has a single Gigabit Ethernet interface and two Qlogic data loading phase. FibreChannel host-bus-adapter cards. The Qlogic cards are used to access the disk environment via a storage area 4.5.2 Reduce Frequency of Transaction Commits network (SAN). The disk environment is comprised of A commit command in data loading permanently writes the three separate Data Direct 8500 disk controllers, each loaded data to the database. The RDBMS must perform a hosting 10 terabytes of RAIDed SATA disks for a total of considerable amount of processing when a transaction 30 terabytes of storage. 2-gigabit FibreChannel commits [7], but infrequent commits can lead to large redo components are used throughout the environment. and undo logs, and lengthen the time needed to recover the All tests were performed using the same data model and database in the event of a hardware failure. In our load identical sky survey catalog data extracted from a framework, we chose to execute commits very infrequently single night’s observation. All constraints, including during the loading of catalog data, resulting in a significant primary key constraints, foreign key constraints, unique performance increase. constraints, and check constraints were maintained in the . 7 Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05) 1-59593-061-2/05 $20.00 © 2005 IEEE

8. data loading process. Tests were performed on an empty The optimal batch size for the testing data set lies in the database unless otherwise noted. The time reported is the range between 40 and 50. The optimal batch size varies runtime of the data-loading process as measured on the depending on the patterns of the catalog data file and the database server. underlying data model. Even with this variation, experimenting with a variety of batch sizes and choosing 5.1 Bulk Loading vs. Non-Bulk Loading one that is close to optimal for a typical data file can Figure 4 shows the runtime of bulk loading versus non-bulk improve performance markedly over a random choice. loading when a single loading process is used. For the bulk-loading process, the bulk_loading algorithm described 5.3 Array Size earlier was used with a batch-size of 40. For the non-bulk Figure 6 shows the runtime of the bulk_loading algorithm loading case, a series of individual SQL insert statements with respect to the array size. The array-set resides in the were issued. The runtime of both approaches is memory of the client host. A change to the array-size proportional to the input data size. As shown in Figure 4, parameter effects the memory hit ratio and the paging rate the bulk_loading algorithm is much faster than the on the client host, resulting in runtime variations on the individual inserts. However, although the batch-size was database server. Increasing the size of the array allows 40 we see a speedup of only 7 to 9 (not 40). This more data to reside in memory and can speed up the data discrepancy indicates that bulk loading incurs some loading. However, on our system configuration, this benefit overhead. The batch-size of 40 was determined to be was lost when array-size was increased beyond 1000. This optimal through a series of performance tests using loss occurred because the high paging rate at those settings different values for batch-size. deteriorated the data-loading performance. These findings indicate that the array-size should be adjusted based on the client system configuration, the characteristics of the data Bulk vs. Non-Bulk model, and the interleave factor of the data being loaded. 18000 The latter two combine to determine the incremental 16000 amount of memory used when the array size is increased. Runtime (Seconds) 14000 Bulk 12000 Non-Bulk 10000 8000 6000 Effect of Array Size 4000 (loading a 200 MB data set) 2000 270 0 Runtime (Seconds) 200 400 600 800 1000 1200 260 Size of Data loaded into the Database tables (MB) 250 Figure 4. Runtime of Bulk and Non-Bulk Loading 240 230 5.2 Batch Size 250 500 750 1000 1250 1500 Array Size Figure 5 shows the runtime of the bulk_loading algorithm with respect to the batch size used for the bulk loads. Figure 6. Effect of Array Size on Runtime Initially, increasing the batch size decreases the loading time. However, the benefit lessens as the batch size 5.4 Parallel Data Loading continues to increase. Figure 7 shows the loading throughput for varying numbers of concurrent bulk-loading processes running on separate Effect of Batch Size nodes of the Condor cluster. A batch-size of 40 was used (loading a 200 MB data set) for these bulk loads. The throughput goes up almost linearly 350 when six or fewer degrees of parallelism are used. Since Runtime (Seconds) the database server has eight processors, one might expect 300 the data-loading performance should be optimal when eight 250 loading processes are run concurrently. Unfortunately, our performance tests have shown this is not the case. The 200 data-loading throughput peaked at a parallel degree of 6–7 10 20 30 40 50 60 in our studies. Benefits decreased after that, and, very Batch Size infrequently even 6 parallel loads caused stalls and dramatic degradation in the overall throughput. Figure 5. Effect of Batch Size on Runtime As the degree of parallelism increased, we observed escalating occurrences of database locks, indicating we . 8 Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05) 1-59593-061-2/05 $20.00 © 2005 IEEE

9. were hitting the RDBMS limit on the number of concurrent user queries. With ongoing database tuning, we have been transactions. The relationship between the observed able to maintain the single-integer attribute index without database locks and the database concurrent transaction limit incurring increased load-time overhead as the database size is not an intuitive one. Our results indicate that various has grown. We have delayed the composite index creation degrees of parallelism should be tested in the environment until the intensive data-loading phase is complete. for typical data loading tasks to maximize the power of parallel processing while avoiding database lock Effect of Indices contention. Because of the infrequent but very long stalls 1400 No Indices Runtime (Seconds) we observed when running with an “optimal” parallel 1200 Index on 1 integer attribute 1000 degree of 6, we chose to run 5 concurrent loading processes Index on 3 float attributes 800 in our production SkyLoader framework. 600 400 200 Effect of Parallelism 0 4 200 400 600 800 1000 1200 Throughput (MB/Second) 3.5 Data Size (MB) 3 2.5 2 Figure 8. Effect of Indices on Runtime 1.5 1 5.6 Database Size 0.5 0 In our final performance experiment, we explored the effect 1 2 3 4 5 6 7 8 of database size on bulk data-loading runtime. With some Number of Parallel Data Loading Processes loading techniques, the time required to load the same amount of data increases as the size of the database grows. Figure 7. Effect of Parallelism on Throughput This characteristic, while tolerable with relatively small and fairly static databases, is very problematic with databases 5.5 Attribute Indices that are large and growing. The results shown in Figure 9 Figure 8 shows the impact of various indices on the runtime indicate that with our SkyLoader framework the database of bulk data loading. We have experimented with three size has no significant impact on data-loading time when scenarios. In the first, no indices were created. In the indices are disabled. Loading time for a 200 megabyte second, one index was built and maintained on a large dataset remains constant as the size of the database grew integer attribute. In the final scenario, one index was built from 50 gigabytes to 300 gigabytes. The Palomar-Quest and maintained on three float attributes. All experiments database size currently exceeds 1.5 terabytes and we have were carried out on an empty database, with six different not seen a decrease in loading speed even at this scale. data set sizes bulk-loaded by a single process. Our experiments confirm that indices do slow down the data-loading process, but to varying degrees. The single- Effect of Database Size (loading a 200 MB data set) integer attribute index had an almost undetectable average performance impact of 1.5% over the six data set sizes. In 400 350 Runtime (Seconds) contrast, the composite index built from three float 300 attributes causes a significant performance degradation 250 averaging 8.5%. 200 150 In these tests, as the size of the data sets increased, the 100 performance degradation attributable to the indices tended 50 to increase as well. When parallel loads are being done, 0 maintenance of indices introduces more concurrent 50 100 150 200 250 300 transactions (and locks) into the bulk-load process. Database Size (GB) Furthermore, in our early prototype environment as the size of the Palomar-Quest repository grew, the overhead that the Figure 9. Effect of Database Size on Runtime indices placed on the data-loading performance became increasingly worrisome. 6. Discussion Based on the results of these studies and our observations Similar to other database repositories, a considerable with the larger database, we elected to maintain only the amount of the effort that has gone into building the single-integer attribute index during data loading to support Palomar-Quest sky survey repository has been focused on . 9 Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05) 1-59593-061-2/05 $20.00 © 2005 IEEE

10. data loading. Much time and energy has been invested in and proper error handling during the loading process, exploring new approaches to improve the data-loading support for optimized parallelism, and guidelines for performance, and in performance studies to identify optimal database and system tuning. Our framework has taken parameter settings for our configuration and data model. In advantage of high-performance computing and parallel- this section we compare and contrast our data-loading processing resources, and has made the building of a approach to that of the Sloan Digital Sky Survey. terabyte-plus repository a reality. With this framework, we One of the most successful sky survey repositories, the have decreased the loading time for a 40-gigabyte data set Sloan Digital Sky Survey (SDSS), utilizes a framework [13, from over 20 hours to less than 3 hours, running on the 14, 15] to perform most of the data-loading steps in same hardware architecture and operating system. parallel, on a cluster of SQL Server nodes, using distributed Looking ahead, we are interested in collaborating with transactions. The SDSS data loading is a two-phase load: other scientists to apply our data-loading framework to their the data is first loaded into Task databases, not exceeding large scientific database problems. While some of the 20-30 gigabytes each. Then the data is fully validated framework is specific to the Palomar-Quest repository, we before being published to its final destination in the Publish believe that much of it is directly applicable to other fields. database. In the SDSS framework, the catalog data is Although our efforts to date have involved loading disk- converted to comma-separated-value ASCII files before the based data sets, we anticipate that with some modification two-phase loading begins. The data in each comma- our framework will also be able to load streaming data in separated-value file is associated with a single database near-real-time. The loading of near-continuous streaming table. The data is bulk loaded into MS/SQLServer using data presents more stringent rate and reliability Data Transformation Services (DTS). By carefully ordering requirements, and we are anxious to explore applications in the loading sequence of the ASCII files, the table this domain. It is our hope and expectation that the relationships in the database are maintained. knowledge gained and optimization techniques developed Our framework differs significantly from the SDSS in working with the Palomar-Quest project will allow us to approach. In the SkyLoader framework, all data-loading quickly adapt the code to other operating environments and tasks, including data validation, transformation, data models. computation and insertion, are performed in a single pass. As part of the effort to make our framework more portable Since our approach does not split the data into multiple data and tunable, we plan to revise the array-set data structure to files based on the destination database tables, nor does it take advantage of the memory-saving configuration options require an intermediate temporary database be loaded prior discussed in Section 4.3. The use of configuration files to to loading the permanent database, we believe our approach control array-set initialization will not only lower client can be more efficient. That said, due to the incompatibility memory requirements, but also make the framework more of these two repositories, we are unable to conduct a direct adaptable for use with data sets other than the Palomar- performance comparison to test this hypothesis. Quest sky survey. Another difference between the SDSS data-loading process In addition, we plan to explore database-hosting and that of the SkyLoader framework is that SDSS relies on architectures and Oracle RAC technology to see how they a proprietary tool, DTS, while our framework is scale for databases of the Palomar-Quest magnitude and implemented using Java. The use of Java makes our complexity. For many projects, the option of hosting a framework platform-independent, portable and extensible. production database on a cluster configuration that can be scaled up as the data size and usage increases is an 7. Conclusions and Future Work attractive one—provided performance and stability are not sacrificed. The Palomar-Quest sky survey is a multi-year project to collect, archive, process, and distribute survey data for Finally, we will continue our collaboration on the Palomar- research collaborations. The repository being built at Quest catalog repository. With the data-loading phase NCSA to hold catalog data for the Palomar-Quest survey under control, we will turn our attention toward tuning the currently exceeds 1.5 terabytes in size, and is expected to database to meet the needs of the scientists who will be ultimately exceed 5 terabytes. The first significant challenge submitting queries through web interfaces, as well as this project faced was to load the catalog data into the programmatically from scientific codes executing on high- repository database in a timely fashion. Parallel bulk end compute resources such as those provided by the loading with array buffering has proven to be a viable TeraGrid. Our ultimate goal is to enable new scientific approach to address this challenge. discoveries through the effective coupling of compute and data technologies. We have proposed and implemented our SkyLoader framework to realize this data-loading goal. Our framework consists of an efficient algorithm for bulk loading, an effective data structure to support data integrity . 10 Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05) 1-59593-061-2/05 $20.00 © 2005 IEEE

11. Acknowledgements Andrews, A. Szalay, R. Brunner, J. Musser. “Palomar- QUEST: A case study in designing sky surveys in the VO We would like to thank Michael Remijan, Adam Rengstorf, era”. Astronomical Data Analysis Software and Systems Nicholas Waggoner, and Brian Wilhite, members of the XIII, ASP Conference Series. Vol. 314, 2004. NCSA Laboratory for Cosmological Data Mining, for their work in parallel data loading using the NCSA Condor [9] S. Leutenegger, D. Nicol. “Efficient Bulk-Loading of cluster. We would also like to thank Michelle Butler, Chad Gridfiles”. IEEE Transactions on Knowledge and Data Kerner, and Chris Cribbs, members of the NCSA Storage Engineering, 9(3):410-420, 1997. Enabling Technologies Group, for their assistance with [10] W. O’Mullane, A.J. Banday, K.M. Gorski, P. Kuntz, tuning the SGI Altix system and disk environment. Finally, A.S.Szalay. “Splitting the Sky – HTM and HEALPix”, we would like to thank the members of the Palomar-Quest Mining the Sky, Banday et al ed. Springer, 2000, p639-648. collaboration for their dedication and hard work that has [11] A. Papadopoulos, Y. Manolopoulos. “Parallel bulk- produced the rich dataset we have leveraged in the work loading of spatial data”. In Parallel Computing described in this paper. 29(10):1419-1444, Oct. 2003. [12] G. Reese. Database Programming with JDBC and References Java. O’Reilly. 2nd Edition, Aug. 2000. [1] S. Amer-yahia and S. Cluet. “A Declarative Approach [13] A. Szalay, P. Kunszt, A. Thakar, J. Gray, R. Brunner. to Optimize Bulk Loading into Databases”. ACM “Designing and Mining Multi-Terabyte Astronomy Transactions on Database Systems, Vol. 29, Issue 2, June Archives: The Sloan Digital Sky Survey”. In Proc. 2004. SIGMOD, Austin, TX, May 2000. [2] T. Barclay, R. Barnes, J. Gray, P. Sundaresan. “Loading [14] A. Szalay, J. Gray, A. Thakar, P. Kunszt, T. Malik, J. Databases Using Dataflow Parallelism”. SIGMOD Raddick, C. Stoughton, J. vandenBerg. “The SDSS RECORD, 23(4), Dec. 1994 SkyServer-Public Access to the Sloan Digital Sky Server [3] T. Barclay, J. Gray, D. Slutz. “Microsoft TerraServer: A Data”. Microsoft Technical Report. MSR-TR-2001-104, Spatial Data Warehouse”. In Proc. SIGMOD, Austin, TX, Nov 2001. May 2000. [15] A. Szalay, J. Gray, A. Thakar, B. Boroski, R. Gal, N. [4] J. Berchen, B. Seeger. “An Evaluation of Generic Bulk Li, P. Kunszt, T. Malik, W. O’Mullane, M. Nieto- Loading Techniques”. In Proc. 27th VLDB Conference, Santisteban, J. Raddick, C. Stoughton, J. vandenBerg. “The Rome, Italy, 2001. SDSS DR1 SkyServer, Public Access to a Terabyte of Astronomical Data”. http://cas.sdss.org/dr3/en/skyserver. [5] C. Bohm and H. Kriegel. “Efficient Bulk Loading of Large High-Dimensional Indexes”. In Proc. Int. Conf. Data [16] D. Thain, T. Tannenbaum, M. Livny. "Condor and Warehousing and Knowledge Discovery (DaWak), 1999. the Grid". in Grid Computing: Making The Global Infrastructure a Reality. Fran Berman, Anthony J.G. Hey, [6] D. Burleson. “Hypercharge Oracle data load speed”. Geoffrey Fox, editors. John Wiley, 2003. http://www.dba-oracle.com/oracle_tips_load_speed.htm [17] J. Wiener, J. Naughton. “Bulk Loading into an OODB: [7] D. Burleson. “Hypercharging Oracle Data Loading”. A Performance Study”. In Proc. 20th VLDB Conference. http://www.orafaq.com/articles/archives/000020.htm Santiago, Chile, pp. 120–131, 1994. [8] M. Graham, R. Williams, S. Djorgovski, A. Mahabal, C. Baltay, D. Rabinowitz, A. Bauer, J. Snyder, N. Morgan, P. . 11 Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05) 1-59593-061-2/05 $20.00 © 2005 IEEE