A Language for Manipulating Arrays
展开查看详情
1. A Language for Manipulating Arrays Arunprasad P. Marathe and Kenneth Salem Department of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1 Canada fapmarath,kmsalemg@uwaterloo.ca which removes the part of an image that lies outside of a speci ed clip region. Expressions such as Abstract select g(f(x; threshold); clipregion) This paper describes the Array Manipulation from <relation> Language (AML), an algebra for multidimen where <condition> sional array data. AML is generic, in the sense that it can be customized to support a wide can be used to retrieve clipped, thresholded versions variety of domainspeci c operations on ar of image attribute x from the speci ed relation. rays. AML expressions can be treated declara Ideally, nonrelational expressions such as the one tively and subjected to rewrite optimizations. appearing in the select clause above would be To illustrate this, several rewrite rules that treated declaratively and optimized. For example, exploit the structural properties of the AML f(g(x; clipregion); threshold) generates the same re operations are presented. Some techniques sult as the original expression. The latter form may for e cient evaluation of AML expressions are be less costly to evaluate, since only the clip region, also discussed. rather than the entire image, needs to be thresholded. Currently, most objectrelational systems do not per 1 Introduction form such optimizations, although there is certainly It has become widely recognized that database sys interest in doing so 10, 12]. Such optimizations are tems should support nontraditional data types, such important because objects may be large, and their as sequences, images, and video. Objectrelational methods may be expensive to evaluate. In fact, the cost of a nonrelational subexpression in a relational database systems currently support such data through query may easily dominate the cost of evaluating the userde ned data types and their associated methods. query. These methods can be applied to selected data, or can To optimize such expressions, they must be written be used in selection or join conditions. For example, in some language. In this paper, we propose a sim suppose that 16bit grayscale images have been de ple language for multidimensional array data, called ned as a database type and that two methods are de the Array Manipulation Language (AML). AML is an ned for this type: f is a thresholding function which algebra in the sense that the relational algebra is an replaces each pixel value above a speci ed threshold algebra. Its operators operate on arrays and generate value with the threshold, and g is a clipping function arrays. Permission to copy without fee all or part of this material is Arrays are an important class of data. Obviously, granted provided that the copies are not made or distributed for raster images are twodimensional arrays, and can be direct commercial advantage, the VLDB copyright notice and manipulated by the AML operators. Arrays of three the title of the publication and its date appear, and notice is or more dimensions are also very commonly found in given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee scienti c data sets. For example, multispectral satel and/or special permission from the Endowment. lite images can be treated as arrays with two spatial Proceedings of the 23rd VLDB Conference and one spectral dimension. Video data can also be Athens, Greece, 1997 thought of in terms of multidimensional arrays. One Page 1
2.indication of the importance of array data in the scien ti c community is the proliferation of lebased data nighttime array management packages, such as CDF 9], NetCDF 11] and HDF 14], that support array data. These le based packages arose to ll a datamanagement vac A daily array uum that existed because of the inability of older database management systems to handle bulky array data. daytime array C In this paper, we de ne an array data model and a small set of AML operators based on this model. One of the AML operators is apply, which applies a user B de ned function to an array in a particular way. AML D is generic and customizable in the sense that its ap dim. 1 ply operator can work with any userde ned function. dim. 0 daily lowres array Because there are so many possible array operations, many of which are domainspeci c, any general pur Figure 1: Arrays in an Example Database pose array language should have some facility for ex tension or customization. Thus, this is an important ture array is de ned by taking the average of the day feature of AML. time and nighttime temperature at each point. We We also show that AML expressions can be treated also de ne a reducedresolution version of the daily declaratively, and subjected to rewritebased optimiza array, obtained by dividing the daily array into non tions. To illustrate the possibilities, we de ne several overlapping 4 4 chunks, and replacing each chunk useful rewrite rules. These rules are very general, in with a single value, the average of the values within the sense that they exploit only the structure of ar the chunk. rays, and \structural" properties of the AML opera Using the AML operations to be de ned in Sec tors, including apply. For example, if y = f(x), we tion 4, a daily array (C) can be de ned in terms of a utilize the knowledge that a certain part of array y nighttime array (A) and a daytime array (B) as is computed using data from certain part of array x, but we do not care about the computation itself; the C = apply (merge2 (A; B; 10); f; (1; 1; 2)) rules treat it as a black box. The advantage of this approach is that a single rule can apply to any func where f is a userde ned function that maps two ar tion with the same \structural" properties as f. Of ray values to a single average value and (1; 1; 2) is a course, this does not preclude rewrite rules that uti \shape" that helps determine how f is to be applied. lize knowledge of the computation being performed, The \10" is a bit pattern that indicates that the merge but such rules are speci c to a particular function or is to be performed by interleaving one slab of A fol class of functions. lowed by one slab of B. In e ect, merge operator In Section 6 we discuss some of the issues that arise takes the nighttime and daytime arrays and stacks in evaluating AML expressions. These issues include them one atop the other (in dimension 2, as indicated pipelining of AML operators, limiting memory usage, by the subscript of merge) to produce a single three and reducing the costs associated with materializing dimensional array. The apply operator then applies f intermediate results. These are important issues be to each 1 1 2 subarray to produce the average tem cause arrays may be very large. perature values. Each such value becomes one element of the resulting daily array C. Similarly, the lowresolution array (D) can be de 2 An Illustrative Example ned in terms of C as The simple array database illustrated in Fig. 1 will D = tiled apply (C; g; (4; 4)) be used as a running example throughout the remain der of the paper. The database includes several types where g is a function that will be used to map 4 4 sub of twodimensional arrays describing air temperature. arrays of C to a single value, namely the average of the The dimensions of these arrays can be thought of as sixteen values in the 4 4 array. The tiled apply op longitude (dimension zero), and latitude (dimension eration breaks C into nonoverlapping 4 4 subarrays one). There are two arrays per day, one describing and applies g to each to produce one of the values in nighttime temperatures, the other describing daytime D. The tiled apply operation is actually de ned as temperatures. a special case of the more general AML apply opera For each day/night pair of arrays, a daily tempera tion. Page 2
3. The primary purpose of this simple example is to illustrate the behavior of the AML operations. In gen a subarray of a slab along eral, an almost limitless variety of array transforma A at x dimension 0 (1) tions can be imagined. For example, we might have chosen a more sophisticated lossy compression tech nique, such as JPEG 15], with which to de ne the re duced resolution version of the daily array. The AML array A apply operator makes it easy to do this. x [1] AML describes logical relationships among arrays. Fig. 1 can be seen as a sort of schema. In particu lar, the daily array can be seen as a view de ned in terms of the daytime and nighttime arrays. The view dim. 1 x [0] dim. 0 de nition is the AML expression, given above, which maps the daytime and nighttime arrays to the daily Figure 2: Subarrays and Slabs array. Similarly, the lowresolution array is a view of the daily array. We have said nothing, at least at this De nition 3.2 The dimensionality of array A, writ point, about the physical representations of these ar ten dim(A), is the smallest i such that A~ j] = 1 for all rays. It may be that the daily array is physically stored j i. If there is no such i, then dim(A) is unde ned. by laying out its values in a le in rowmajor order, or it may be stored in some more compact, compressed form. Alternatively, it may not be materialized at all, QDe1 nition ~A i]. 3.3 The size of array A, written jAj, is i=0 as it can be derived when necessary from the corre sponding daytime and nighttime arrays. We will restrict ourselves to arrays of nite size. How ever, it will sometimes be convenient for us to think of 3 Data Model and Terminology arrays as having in nite length in all dimensions. For Throughout this paper we will use a vector arrow, as this purpose, we de ne A ~x] = NULL for all points in ~x, to denote in nite vectors of integers. The usual ~x that are not in A, where NULL is some value not notation ~x i] will refer to the element with index i. found in DA Expressions involving operations on vectors, such as An array having a length of zero in one or more di ~z = b~x=~yc refer to elementbyelement application of mensions is called a null array. Such arrays have zero the operation, i.e., ~z i] = b~x i]=~y i]c. size and since there are no points in a null array, it An array has a shape and a domain. We will con is considered to have the value NULL at every point. sider arrays to have an in nite number of dimensions, We will also need a notion of subarray. A subarray numbered from zero. Each array dimension is indexed is simply an array that is wholly contained within an by the nonnegative integers, i.e., indexing starts at other, as shown in Fig. 2. We will identify the position zero. A shape is an in nite vector of nonnegative of the subarray within the containing array by the po integers which de nes the array's length in each di sition of its smallest point, as shown in the gure. mension. When it is necessary to write a particu De nition 3.4 Let A and B be arrays, and let ~x be a lar shape, the shape's elements will be parenthesized. vector in A. Array B is called a subarray of A at ~x i All elements not listed explicitly are assumed to be DB = DA , and for every point ~y in B , B ~y] = A ~x +~y ]. ones. Thus, the shapes (1; 1; 2) and (4; 4) that were used in the examples in Section 2 denote the in nite Finally, we de ne informally the notion of a slab of an vectors (1; 1; 2; 1; 1;1; ) and (4; 4; 1; 1; 1; ), respec array along dimension i. A slab is simply a slice of unit tively. The domain of an array is a set of possible width through an array along the speci ed dimension. values, one of which is present at each indexed point This is also illustrated in Fig. 2. within the array. De nition 3.1 An array A consists of a shape A~, a 4 The Array Manipulation Language domain DA , and a mapping MA . The ith element of The Array Manipulation Language (AML) consists of A~ represents the length of the array in dimension i. A three operators which manipulate arrays. Each oper vector ~x is de ned to be in array A i 0 x i] < A~ i] ator takes one or more arrays as arguments and pro for all i 0. The mapping, MA maps each vector duces an array as result. subsample is a unary opera ~x in A to an element of the array's domain, DA . We tor which can delete data, i.e., the size of the result of will use the standard array notation A ~x] to denote the subsampling an array A is never larger than A. merge domain value to which ~x is mapped. is a binary operator which combines two arrays de ned Page 3
4.over the same domain. apply applies a function to an array, in a manner to be described below, to produce B = SUB 0(A, 10) a new array. 02 22 42 Neither subsample nor merge changes the values 01 21 41 found in its operands, i.e., every value found in the 00 20 40 result of these operations can be found in an operand. array A B = SUB 1(A, 10) The third operator, apply, may generate new values 02 12 22 32 42 52 02 12 22 32 42 52 as a result of applying the function. 01 11 21 31 41 51 00 10 20 30 40 50 4.1 An Introduction to Bit Patterns 00 10 20 30 40 50 B = SUB 0(A, 000011) All of the AML operators take bit patterns as param 42 52 eters. A bit pattern is an in nite binary vector. As for 41 51 dimension 1 other vectors, indexing of bit patterns starts at zero. 40 50 The ith element of a pattern P is denoted by P~ i]. dimension 0 When the context makes it clear that P~ is a pattern, Figure 3: Examples of the subsample Operation we will drop the explicit vector notation and simply write P or P i]. where A is an array, P is pattern, and i is the dimen We will be interested only in those patterns that sion number. consist of an in nite number of repetitions of some The subsample operator divides A into slabs along nite vector, and we will use that nite vector to rep dimension i, and then keeps or discards slabs based on resent the entire pattern. For example, we may write the pattern P. If P k] = 1, then slab k is kept and P = 1010 to mean P = 1010101010 . Note that included in B, otherwise it is not. The slabs that are there is more than one nite representation of any kept are concatenated to produce the result B. Several such bit pattern. For example, Q = 10 represents the applications of the subsample operator are illustrated same pattern as P. We will sometimes use a regular in Fig. 3. expressionlike notation to describe patterns. For ex Formally, if B = subi (A; P), then B is de ned as ample 0i 1j 0k , for positive integers i; j and k, repre follows: sents a pattern in which j 1's are sandwiched between i 0's on the left and k 0's on the right. The bitwise DB = D A complement of a pattern P, obtained by replacing P's ones with zeros and vice versa, will be written P. if A~ i] > 0, then B~ i] = count(P; A~ i] ? 1), else There are two pattern functions, index and count, B~ i] = 0. that we will make heavy use of. for all j 0 except j = i, B~ j] = A~ j] De nition 4.1 If P is a bit pattern (P 6= 0) and k a positive integer, index(P; k) is the index of the kth 1 for all points ~x in B, B : : :;~x i ? 1];~x i];~x i + in P . By de nition, if k = 0 or P = 0, index(P; k) = 1]; : : :] = A : : :;~x i ? 1]; index(P;~x i] + 1);~x i + 0. 1]; : : :]. De nition 4.2 If P is a bit pattern and k a non Note that subsampling a null array results in a null negative integer, count(P; k) is the number of ones in array, regardless of the dimension number or the sub the rst k + 1 positions of P , i.e., from P 0] to P k], sampling pattern. Also, if P = 0, then B~ i] = 0 and inclusive. B is a null array. Both functions are monotonically nondecreasing 4.3 The merge Operator in k. It should be obvious that for any k 1, count(P; index(P; k)) = k, unless P = 0. The merge operator takes two arrays, a dimension number, a pattern, and a default value as parameters. 4.2 The subsample Operator It merges the two arrays to produce its result. As was done for subsample, the dimension number will The subsample operator takes an array, a dimension normally be written as a subscript, as in number and a pattern as parameters and produces an array. The dimension number will normally be written C = mergei (A; B; P; ) as a subscript and subsample will be abbreviated as sub, as in where A and B are arrays, P is the pattern, and B = subi (A; P) is the default value. The explicit reference to will Page 4
5. C = MERGE0 (A, B, 110, δ ) DC = DA fNULLg 0 a01 a11 b01 a21 a31 b11 C~ 0 i] = max(index(P; A~ i]); index(P; B~ i])) + 1 for all j 0 except j = i, C~ 0 j] = max(A~ j]; B~ j]) a00 a10 b00 a20 a30 b10 array A a01 a11 a21 a31 C = MERGE0 (A, B, 1100101, δ ) a00 a10 a20 a30 a01 a11 b01 b11 a21 δ a31 for all points ~x in C 0: a00 a10 b00 b10 a20 δ a30 { if P ~x i]] = 1, then C 0 : : :;~x i ? 1];~x i];~x i + array B C = MERGE1 (A, B, 110, δ ) 1]; : : :] = A : : :;~x i ? 1]; count(P;~x i]) ? b01 b11 δ δ 1;~x i + 1]; : : :], { otherwise C 0 : : :;~x i ? 1];~x i];~x i + 1]; : : :] = b01 b11 b00 b10 δ δ δ δ δ δ δ δ B : : :;~x i ? 1]; count(P;~x i]) ? 1;~x i + 1]; : : :] b00 b10 δ δ We then obtain C by removing any NULL values in side of C 0: DC = DC ? fNULLg; for all i 0, C~ i] = dimension 1 a01 a11 a21 a31 0 dimension 0 a00 a10 a20 a30 C~ 0 i]; and for all points ~x in C, if C 0 ~x] = NULL then C ~x] = , otherwise C ~x] = C 0 ~x]. Figure 4: Examples of the merge Operation 4.4 The apply Operation The apply operator applies a function to an array to often be dropped if the default is not important. The produce a new array. In its most general form, it is operation is de ned only if DA = DB and 2 DA . written as Conceptually, merge divides both A and B into slabs along dimension i. The result is produced by B = apply (A; f; D~f ; R~f ; P0; P1; : : :; Pd?1) interleaving slabs from the two arrays according to the where f is the function to be applied, A is the array pattern P. Each one in P corresponds to a slab from to apply it to, D~f and R~f are shapes, the Pi's are the rst array (A), and each zero to a slab from the second (B). For example, if P = 1001, then along the patterns, and d = dim(A). The parameters D~f and ith dimension, one slab from array A, two slabs from R~f are called the domain shape and the range shape. array B and then a slab from array A are taken and We will often use a special case of apply, written concatenated in that order. This process is repeated until all slabs from both A and B have been used. B = apply (A; f; D~f ; R~f ) (Recall that P = 1001 denotes the in nite pattern for which we assume that Pi = 1 for all 0 i < d. In 100110011001 .) addition, either the range shape or both shapes may be Fig. 4 illustrates the merge operation. The exam left unspeci ed when apply is written. These shapes ples show that the default value may be used for two default to (1; 1; 1; ) if they are not speci ed. reasons. One is that the slabs from one array may be A simple way to de ne apply is to insist that f exhausted while slabs remain in the other. This is the map from arrays of A's shape and domain to arrays of case in the second example in Fig. 4. Another reason B's shape and domain. The operator would then sim is an array shape mismatch in some dimension other ply compute B = f(A). However, many common ar than the merge dimension. In case of such a mismatch, ray functions have some structural locality: the value the shorter array is expanded using default values un found at a particular point in B depends only on the til its length matches that of the longer array. This is values at certain points in A, not on the values at all illustrated in the third example in Fig. 4. points in A. For example, if f is a smoothing func It is convenient to formally de ne merge in two tion that maps each point in A to the average of that steps. The rst generates an array C 0 by interleaving point and its neighbors, then to determine the value slabs from A and B, as described above. Because of at some point in B, we need only look at that point shape mismatches between A and B, however, or be and its neighbors in A. Such information can be very cause of the particular pattern P, some values in C 0 valuable for optimizing the execution of an expression may be NULL. The second step eliminates this prob involving the array operators. lem by transforming any such NULL values to the The apply operation is de ned so that this kind of default value . The result of this nal step is indeed structural relationship can be made explicit when it an array, and is the result of the merge operation. exists. The apply operator requires that f be de ned The intermediate array, C 0, is de ned as follows: to map subarrays of A of shape D~f to subarrays of Page 5
6. for all ~x in B, B ~x] = f(A; ~y ) ~x MOD R~f ], where ~y i] = index(Pi ; b~x i]=R~f i]c + 1) for all 0 i < array A 3 dim(A) If D~f i] > A~ i] for some i 0, then the de nition 2 dimension 1 1 0 f(A, (2,2)) above implies that B will be a null array. f(A, (1,2)) 0 1 2 dimension 0 3 4 4.5 Additional Operations f(A, (1,0)) In this section, we show a few useful special cases of 1 f(A, (2,0)) the AML operators, and give them names. concat: The concat operator concatenates two arrays along some dimension. Concatenation can be 0 0 1 2 3 B = APPLY(A, f, (2,2), (2,1), 0110, 10) de ned using merge as follows: D Rf P P concati (A; B; ) mergei (A; B; 1A~ i] 0B~ i] ; ) f 0 1 Since merge requires A and B to have a common do Figure 5: An Illustration of the apply Operation main, so does concat. Note that if A and B have length mismatches in dimensions other than i, the ar B of shape R~f . We will use the notation f(A;~x) to ray with the shorter length will be extended using the refer to the result of applying f to the subarray of A default value . of shape D~f at ~x. Thus, f(A;~x) is an array of shape clip: clip clips an image along a speci ed dimen R~f . sion, keeping only those slabs within a clip region de The apply operator applies f to certain subarrays ned by parameters x and y, where 0 x y A~ i]. of A, and concatenates the results to generate B. This It can be implemented using subsample as follows: process is illustrated in Fig. 5. The pattern Pi can be thought of as selecting slabs in dimension i, with the clipi (A; x; y) subi(A; 0x 1y?x0A~ i]?y ) selected slabs corresponding to the ones in the pattern. tiled apply: Often, we will wish to apply a func The function f is applied at a point ~x only if that point tion to all nonoverlapping subarrays of a particular falls in selected slabs in all d dimensions of the array, shape. In the example in Section 2, this is the case i.e., only if Pi ~x i]] = 1 for all 0 i < dim(A). In the when the lowresolution daily array (D) is being com gure, the patterns select two slabs in each dimension, puted from the daily array (C). Since this type of leading to a total of four applications of the function function application is quite common, we can de ne f. Several features of the application of f should be the tiled apply operator to support it. Assuming noted. First, while the selected subarrays may over that A has dimensionality d, the de nition is as fol lap in A, the results of applying the function do not lows: overlap in the resulting array B. Second, the arrange ment of resulting subarrays in B preserves the spatial tiled apply (A; f; D~f ; R~f ) apply (A; f; D~f ; R~f ; P) arrangement of the selected subarrays in A. Finally, the subarrays to which f is applied must be entirely where P denotes the patterns contained within A. In the example in Fig. 5, this 10D~f 0]?1 ; 10D~f 1]?1 ; ; 10D~f d?1]?1 . means that even if the point ~x = (3; 3) was selected by the patterns, f(A;~x) would not be evaluated, since 4.6 More on Patterns and Shapes that subarray lies partially outside of A. If B = apply (A; f; D~f ; R~f ; P0; : : :; Pdim(A)?1), and We allow patterns and shapes appearing in AML ex f is a function that maps from arrays of shape D~f over pressions to be de ned in terms of the array arguments of their AML operators. As an example, consider the domain DA to arrays of shape R~f over domain Drange, expression apply (A; f; (A~ 0]; 1)), in which f is applied then B is formally de ned as follows: to each row of A. Aliases (as in SQL) can be used in DB = Drange AML expressions when necessary to de ne names for unnamed intermediate arrays. In the AML expression for all i 0, apply (sub1 (B; P) A; f; (A~ 0]; 1)) the alias A is used { if A~ i] < D~f i] or Pi = 0, then B~ i] = 0 to refer to the result of the inner sub operation so that the apply's shape argument can be de ned. The { otherwise B~ i] = count(Pi; A~ i]?D~f i]) R~f i] scope of such an alias is the AML operator in which Page 6
7.it is de ned. In the case of the apply operator, it and B can be removed before they are merged, poten is also possible to refer to the domain shape and the tially speeding up the merging step. 2 range shape in the operator's patterns. An example of this can be seen in the de nition of the tiled apply Example 5.3 operation in Section 4.5. An interesting situation arises in the following exam The shape of the result of an AML operation can ple. Rewriting subi (mergei (A; B; 0100); 100010) us always be determined (without actually evaluating the ing Theorem 5.2 yields R = 0; S = 100110010 and operator) if the shapes of its array arguments are T = 0. So an equivalent form for the expression is known. By induction we can show that the shape of the result of an arbitrary AML expression can be de mergei(subi (A; 0); subi(B; 100110010); 0): termined once the shapes of its \terminal" arrays are known. This property is useful when AML expressions Since merge with a pattern of 0 results in its second are being evaluated, since it implies that the space re argument, the above expression can be transformed to quired to implement an AML operation can be deter subi(B; 100110010). From the original expression, it mined in advance. is not immediately apparent that the whole of array A gets subsampled out but the equivalent expression 5 Rewrite Rules makes this obvious. 2 In many cases it will be possible to rewrite AML ex The ability to push subsampling through function pressions into one or more equivalent forms. Often, application is also potentially very valuable. To sim one form will have a more e cient implementation plify our presentation, we consider a restricted version than another, so rewriting is useful for query optimiza of a rewrite rule that accomplishes this. tion. In this section, we present several rewrite rules Theorem 5.3 If R~f i] = D~f i] = 1, Pi = 1, and for AML expressions. Many such rules are possible d = dim(A), then and this presentation is not intended to be compre subi(apply (A; f; D~f ; R~f ; P0; : : :; Pd?1); Q) = hensive. Instead, we hope to demonstrate by examples ~ ~ apply (subi (A; Q); f; Df ; Rf ; P0; : : :; Pd?1). that useful rewrite rules do exist. The rst rule shows that two successive subsam Example 5.4 ples along the same dimension can be combined into a single subsample. Recall from Section 2 that the daily temperature array C was de ned as Theorem 5.1 subi(subi(A; P); Q) = subi(A; R) where R j] = P j] ^ Q count(P; j) ? 1] for all j 0. C = apply (merge2(A; B; 10); f; (1; 1; 2)): Example 5.1 where D~f = (1; 1; 2) and R~f defaults to (1; 1; 1; ) since it is not speci ed. Suppose we want to subsample Applying the above rewrite rule to the ex the array C in dimension 0 using the pattern P = 0616. pression sub0 (sub0 (A; 1000); 10), we get R = That is, we would like to evaluate 1000000010000000 . So the expression gets simpli ed to sub0(A; 10000000). 2 sub0 (apply (merge2(A; B; 10); f; (1; 1; 2)); P) The next rule shows that we can push sub through merge. Heuristically, this should be bene cial be Using Theorem 5.3, this can be rewritten as cause the merge operation will be able to operate on smaller subsampled images. apply (sub0(merge2(A; B; 10); P); f; (1; 1;2)) Theorem 5.2 subi(mergei(A; B; P); Q) = We can optimize further by pushing sub0 inside of mergei (subi (A; R); subi (B; S); T) merge2 . This is trivial, since they operate in di erent where for all j 0, R j] = Q index(P; j + 1)], S j] = dimensions. This gives us Q index(P; j + 1)], and T j] = P index(Q; j + 1)]. apply (merge2(sub0 (A; P); sub0 (B; P); 10); f; (1; 1;2)) Example 5.2 which indicates that to retrieve parts of the daily tem Applying the above theorem to the expression perature array, we need only retrieve parts of the day sub0(merge0 (A; B; 10); 101) yields R = 110, S = 011 time and nighttime arrays, as expected. 2 and T = 1100. So the transformed expression is There are situations in which the result of an apply merge0 (sub0 (A; 110); sub0 (B; 011); 1100). From pat operation is being subsampled, but we cannot push the terns R and S we see that about onethird of arrays A sub through the apply. This often happens when the Page 7
8.function is being applied to overlapping subarrays. Consider the following AML expression: array Step 1 generates B = subi(apply (A; f; (2; 2); (2; 2)); 110010) output here in which f maps 2 2 subarrays from A to 2 2 sub synchronization arrays in the result. Note that if ~x is a point in slab Step 2 takes boundary input from here 1 (i.e., the second slab) of dimension i of A, the result of evaluating f at ~x will be completely discarded by Figure 6: Multidimensional Synchronization for Ar the sub that follows apply. The results of such eval rays. uations form slabs 2 and 3 in the resulting array, and both P 2] and P 3] in the subsample pattern are zero. In fact, because the subsampling pattern is an in nite query execution plan might generate n such interme repetition of 110010, the result of evaluating f at any diate results. This may be very time consuming, even ~x with ~x i] MOD 3 = 1 will be discarded. Clearly, if the steps are implemented entirely in memory. the function f should not be evaluated at such points. The rst of these problems can be addressed by al These evaluations cannot be avoided by moving the lowing steps to execute concurrently. This requires sub before the apply however, since all of A is needed some mechanism for synchronizing access to arrays to generate parts of B that are kept. that are simultaneously being produced by one step Although we cannot always push sub through and consumed by a subsequent step. Often, this is apply, we may be able to push sub into apply. For accomplished by treating the data passed from one the special case in which R~f has unit size, the following step to another as a linear stream. A stream can be rule shows this. thought of as having a boundary point which indicates Theorem 5.4 If jR~f j = 1 and d = dim(A), then how much of the stream data has actually been gen erated by the rst step. The subsequent step is forced subi(apply (A; f; D~f ; R~f ; P0; ; Pi; ; Pd?1); S) = to wait if it has consumed all of the stream data up to apply (A; f; D~f ; R~f ; Q0; ; Qi; ; Qd?1), the boundary. A direct application of this idea to mul where Qj = Pj for all j 6= i, and Qi k] = Pi k] ^ tidimensional arrays would require that steps agree on S count(Pi ; k) ? 1] for all k 0. how an array is to be \linearized" to form a stream. An alternative is to generalize the synchronization bound Example 5.5 ary to accommodate multidimensional arrays. This is sub1(apply (A; f; D~f ; 11; 101100; 110); 011) illustrated for twodimensional arrays in Fig. 6. This approach divides each array into two regions. As the gets transformed to apply (A; f; D~f ; 11; 001100; 110) rst step runs, the region accessible to the second step according to this rewrite rule. 2 grows until it covers the entire array. To address the second problem, we can choose an 6 Query Evaluation array representation that permits steps to avoid creat Query processing involves the generation of a query ing new copies of large arrays. In particular, we may execution plan for a given AML expression.1 An n be able to allow a step to simply modify its input array operator AML expression can be executed in n sequen and then use the modi ed array as its output. Fig. 7 tial steps in which each step generates an intermediate illustrates an array representation that permits this result which is used as input by a subsequent step. for subsample and merge steps. This array repre This straightforward approach has several potential sentation has several features. One is a vector of valid disadvantages. First, it does not allow for pipelining of bits per array dimension. These bits can be cleared to steps. It should be possible for a step to begin execu indicate that a particular slab of data is invalid, i.e., tion even if its input has only been partially generated. that it should be ignored by any subsequent steps that Second, it may result in the generation of many large use the array. This provides an easy way to implement intermediate results. For example, consider steps that a subsample operation, since valid bits can simply be implement operations such as subi(A; 1111111110) or cleared according to the positions of the zeros in the mergei (A; B; 10). If the arrays A and B are large, so subsample pattern. Of course, the disadvantage of too will be the output of these operations. An nstep this approach is that the size of the array represen tation is not actually reduced by subsampling. This 1 In fact, we may generate several candidate execution plans, suggests that this mechanism should be used to im and then choose a good one using execution cost estimates. Here, we will focus on some of the issues involved in execution plement subsample steps that have low selectivity, plan generation. i.e., steps whose subsample patterns have a high ra Page 8
9. the userde ned expressions themselves. A notable re pointers to cent exception is PREDAT OR 12], which treats user array blocks de ned expressions declaratively, and passes them to an optimizer that can handle them. Special purpose image database systems also handle valid bits array blocks array data, at least in two dimensions. 2] is a survey for dim. 1 of work in this area. These systems focus on selec tion and retrieval of images, or parts of images, based on image content. AML does not directly support re trieval based on image content. However, it can be valid bits for dim. 0 used in conjunction with contentbased indexing and Figure 7: An Array Representation Permitting Fast retrieval techniques. subsample and merge There have been several other proposals for query languages for arrays, including 6] and 4]. Both of these are based on calculi which can be used to express tio of ones to zeros. More selective subsamples can arrayrelated operations, as well as nonarray opera be implemented by generating a new, smaller array tions. We will brie y describe the Array Query Lan representation. Note that invalid slabs will actually guage (AQL) de ned in 6]. AQL is based on a calcu be removed from the array representation by the rst lus which provides four arrayrelated primitives: two \downstream" step that actually generates a new ar are used to create arrays, one performs subscripting, ray. i.e., it extracts a value from an array, and one deter The array representation also incorporates indirec mines the shape of an array. Using these very lowlevel tion. The array data is divided into blocks, and constructs (plus such things as conditionals and arith a multidimensional array of pointers refers to these metic operations), higher level operations can be con blocks. Indirection allows some merge steps to be structed. For example, operations similar to the sub implemented without copying array data. For exam sample, merge, and apply operations de ned here ple, consider the merge operation used to de ne the can be expressed in terms of those primitives. Opti daily temperature array (C) in Fig. 1. It concatenates mization of AQL expressions is performed at the level the daytime and nighttime arrays. This can be imple of the primitive operations after replacing higherlevel mented by generating a new, larger pointer array, and operations with their de nitions. Implementations of then setting the pointers to point to the existing blocks each of the primitive operations are then used to eval of the two arrays. Clearly, the existence of such a fast uate the optimized queries. This is a very powerful implementation depends on the shape of the blocks of and exible approach. For example, if new higher the arrays to be merged, and on the merge pattern. level operations are added, they are expressed using This creates an interesting opportunity for optimiza the calculus. They can then be optimized, i.e., there tion, since whenever a new array copy is created, we is no need to generate new rewrite rules \manually" can choose the parameters of its representation, e.g., for the new operations. the size and shape of the blocks. Neither proposal suggests any particular set of high In combination with multidimensional synchroniza level operations for arrays. Rather, they show how tion, indirection can also help reduce the memory re such operations can be de ned and optimized. It quirements of a query execution plan. This is because is not clear how e ective such optimizations will be. the space used by individual array blocks can be re Whether an optimizer will nd appropriate rewrite leased as soon as that block is no longer needed. In rules, and how quickly it will nd them, remain open many cases, only a small part of a large array will need questions. The e ciency of execution of query plans to be represented at any time. consisting of many small, primitive operations is also a potential concern. 7 Related Work A variety of database systems now provide support 8 Summary and Conclusion for userde ned data types such as arrays. These in We have described the Array Manipulation Language, clude commercial systems 10, 1] and research systems an algebra for arrays. AML can be used as a query such as Postgres, Paradise, and others 7, 13, 3, 12]. language for array data, and as a view de nition lan As noted in the introduction, these systems may opti guage, to de ne new arrays in terms of existing ones. mize relational expressions in which userde ned func AML's apply operator can be customized to support tions appear. However, they generally do not optimize a wide variety of userde ned array functions. AML Page 9
10.expressions can also be optimized. Optimizations can 5] L. M. Haas et al. An optimizer for heteroge exploit the structural properties of the AML opera neous systems with nonstandard data and search tions. capabilities. Bulletin of the IEEE Computer So In 8], Maier and Vance claim that a reasonable al ciety Technical Committee on Data Engineering, gebra for ordered types, such as arrays, should have a 19(4):37{44, December 1996. small number of operators, should encapsulate a sig 6] L. Libkin et al. A Query Language for Multidi ni cant fraction of the control structures used for le processing, should possess nontrivial transformations mensional Arrays: Design, Implementation, and useful for query optimization, and should admit to rea Optimization Techniques. In Proc. of the SIG sonably e cient implementation over large arrays. We MOD Conference, pages 228{239, 1996. claim that AML has at least the rst, third, and fourth 7] V. Linnemann et al. Design and Implementation of these properties. The second property, expressive of an Extensible Database Management System ness, is more di cult to pin down as it is very domain Supporting User De ned Data Types and Func dependent. However, we note that AML is at least tions. In Proc. of the 14th VLDB Conference, expressive enough to mimic the widely used lebased pages 294{305, 1988. data management packages, such as NetCDF, which support multidimensional arrays. 8] D. Maier and B. Vance. A call to order. In Proc. Array data will be most useful in conjunction with of the ACM SIGACTSIGMODSIGART Sympo other types of data. In particular, we may wish to sium on Principles of Database Systems, pages associate various sorts of metadata with each array 1{16, 1993. to facilitate the selection of individual arrays from a 9] National Space Science Data Center, Greenbelt, set. Thus, AML will be most useful if it can be imple Maryland. CDF User's Guide, October 1996. Ver mented as part of a system capable of integrating data sion 2.6. of various types. One promising approach is o ered by PREDAT OR 12], which views the world as an 10] M. A. Olson et al. Query Processing in a Parallel integrated collection of data types, each of which sup ObjectRelational Database System. Bulletin of ports a declarative, optimizable query language (such the IEEE Computer Society Technical Committee as AML). A similar, but more loosely coupled, ap on Data Engineering, 19(4):3{10, December 1996. proach is taken by systems like Garlic 5] that attempt to federate a collection of independent and heteroge 11] R. Rew et al. NetCDF User's Guide. Unidata neous data repositories. Extensible objectrelational Program Center, Boulder, Colorado, February systems, such as the Informix Universal Server 10], 1996. Version 2.4. may also serve as useful platforms for implementation 12] P. Seshadri et al. EADTs:turbocharging com of AML. We are currently considering these implemen plex data. Bulletin of the IEEE Computer So tation alternatives. ciety Technical Committee on Data Engineering, 19(4):11{18, December 1996. References 13] M. Stonebraker et al. The implementation of 1] F. Bancilhon and G. Ferran. ODMG93: The POSTGRES. IEEE Transactions on Knowledge object database standard. Bulletin of the IEEE and Data Engineering, 2(1):125{142, 1990. Computer Society Technical Committee on Data Engineering, 17(4):3{14, December 1994. 14] University of Illinois at UrbanaChampaign. NCSA HDF Calling Interfaces and Utilities, 3.1 2] S.K. Chang and A. Hsu. Image information edition, July 1990. systems: Where do we go from here? IEEE 15] G. K. Wallace. The JPEG still picture com Transactions on Knowledge and Data Engineer pression standard. Communications of the ACM, ing, 4(5):431{442, October 1992. 34(4):30{44, April 1991. 3] D. J. DeWitt et al. Client{Server Paradise. In Proc. of the 20th VLDB Conference, pages 558{ 569, 1994. 4] L. Fegaras and D. Maier. Towards and E ective Calculus for Object Query Languages. In Proc. of the SIGMOD Conference, pages 47{58, 1995. Page 10

WiredTiger BTree and WiredTiger InMemory
空白

18/12  SASI, Cassandra on the full text search ride  Apache Bi
中国Cassandra技术社区

19_08  CassandraTokenManagement
中国Cassandra技术社区

16_07  cassandrasummit2016runningcassandraonapachemesosacrossm
中国Cassandra技术社区

19_10  strapdata_Elassandra_cassandra_es
中国Cassandra技术社区