# A Language for Manipulating Arrays

## 展开查看详情

1. A Language for Manipulating Arrays Arunprasad P. Marathe and Kenneth Salem Department of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1 Canada fapmarath,kmsalemg@uwaterloo.ca which removes the part of an image that lies outside of a speci ed clip region. Expressions such as Abstract select g(f(x; threshold); clipregion) This paper describes the Array Manipulation from <relation> Language (AML), an algebra for multidimen- where <condition> sional array data. AML is generic, in the sense that it can be customized to support a wide can be used to retrieve clipped, thresholded versions variety of domain-speci c operations on ar- of image attribute x from the speci ed relation. rays. AML expressions can be treated declara- Ideally, non-relational expressions such as the one tively and subjected to rewrite optimizations. appearing in the select clause above would be To illustrate this, several rewrite rules that treated declaratively and optimized. For example, exploit the structural properties of the AML f(g(x; clipregion); threshold) generates the same re- operations are presented. Some techniques sult as the original expression. The latter form may for e cient evaluation of AML expressions are be less costly to evaluate, since only the clip region, also discussed. rather than the entire image, needs to be thresholded. Currently, most object-relational systems do not per- 1 Introduction form such optimizations, although there is certainly It has become widely recognized that database sys- interest in doing so 10, 12]. Such optimizations are tems should support non-traditional data types, such important because objects may be large, and their as sequences, images, and video. Object-relational methods may be expensive to evaluate. In fact, the cost of a non-relational subexpression in a relational database systems currently support such data through query may easily dominate the cost of evaluating the user-de ned data types and their associated methods. query. These methods can be applied to selected data, or can To optimize such expressions, they must be written be used in selection or join conditions. For example, in some language. In this paper, we propose a sim- suppose that 16-bit gray-scale images have been de- ple language for multidimensional array data, called ned as a database type and that two methods are de- the Array Manipulation Language (AML). AML is an ned for this type: f is a thresholding function which algebra in the sense that the relational algebra is an replaces each pixel value above a speci ed threshold algebra. Its operators operate on arrays and generate value with the threshold, and g is a clipping function arrays. Permission to copy without fee all or part of this material is Arrays are an important class of data. Obviously, granted provided that the copies are not made or distributed for raster images are two-dimensional arrays, and can be direct commercial advantage, the VLDB copyright notice and manipulated by the AML operators. Arrays of three the title of the publication and its date appear, and notice is or more dimensions are also very commonly found in given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee scienti c data sets. For example, multi-spectral satel- and/or special permission from the Endowment. lite images can be treated as arrays with two spatial Proceedings of the 23rd VLDB Conference and one spectral dimension. Video data can also be Athens, Greece, 1997 thought of in terms of multi-dimensional arrays. One Page 1

2.indication of the importance of array data in the scien- ti c community is the proliferation of le-based data nighttime array management packages, such as CDF 9], NetCDF 11] and HDF 14], that support array data. These le- based packages arose to ll a data-management vac- A daily array uum that existed because of the inability of older database management systems to handle bulky array data. daytime array C In this paper, we de ne an array data model and a small set of AML operators based on this model. One of the AML operators is apply, which applies a user- B de ned function to an array in a particular way. AML D is generic and customizable in the sense that its ap- dim. 1 ply operator can work with any user-de ned function. dim. 0 daily low-res array Because there are so many possible array operations, many of which are domain-speci c, any general pur- Figure 1: Arrays in an Example Database pose array language should have some facility for ex- tension or customization. Thus, this is an important ture array is de ned by taking the average of the day- feature of AML. time and nighttime temperature at each point. We We also show that AML expressions can be treated also de ne a reduced-resolution version of the daily declaratively, and subjected to rewrite-based optimiza- array, obtained by dividing the daily array into non- tions. To illustrate the possibilities, we de ne several overlapping 4 4 chunks, and replacing each chunk useful rewrite rules. These rules are very general, in with a single value, the average of the values within the sense that they exploit only the structure of ar- the chunk. rays, and \structural" properties of the AML opera- Using the AML operations to be de ned in Sec- tors, including apply. For example, if y = f(x), we tion 4, a daily array (C) can be de ned in terms of a utilize the knowledge that a certain part of array y nighttime array (A) and a daytime array (B) as is computed using data from certain part of array x, but we do not care about the computation itself; the C = apply (merge2 (A; B; 10); f; (1; 1; 2)) rules treat it as a black box. The advantage of this approach is that a single rule can apply to any func- where f is a user-de ned function that maps two ar- tion with the same \structural" properties as f. Of ray values to a single average value and (1; 1; 2) is a course, this does not preclude rewrite rules that uti- \shape" that helps determine how f is to be applied. lize knowledge of the computation being performed, The \10" is a bit pattern that indicates that the merge but such rules are speci c to a particular function or is to be performed by interleaving one slab of A fol- class of functions. lowed by one slab of B. In e ect, merge operator In Section 6 we discuss some of the issues that arise takes the nighttime and daytime arrays and stacks in evaluating AML expressions. These issues include them one atop the other (in dimension 2, as indicated pipelining of AML operators, limiting memory usage, by the subscript of merge) to produce a single three- and reducing the costs associated with materializing dimensional array. The apply operator then applies f intermediate results. These are important issues be- to each 1 1 2 sub-array to produce the average tem- cause arrays may be very large. perature values. Each such value becomes one element of the resulting daily array C. Similarly, the low-resolution array (D) can be de- 2 An Illustrative Example ned in terms of C as The simple array database illustrated in Fig. 1 will D = tiled apply (C; g; (4; 4)) be used as a running example throughout the remain- der of the paper. The database includes several types where g is a function that will be used to map 4 4 sub- of two-dimensional arrays describing air temperature. arrays of C to a single value, namely the average of the The dimensions of these arrays can be thought of as sixteen values in the 4 4 array. The tiled apply op- longitude (dimension zero), and latitude (dimension eration breaks C into non-overlapping 4 4 sub-arrays one). There are two arrays per day, one describing and applies g to each to produce one of the values in nighttime temperatures, the other describing daytime D. The tiled apply operation is actually de ned as temperatures. a special case of the more general AML apply opera- For each day/night pair of arrays, a daily tempera- tion. Page 2

3. The primary purpose of this simple example is to illustrate the behavior of the AML operations. In gen- a subarray of a slab along eral, an almost limitless variety of array transforma- A at x dimension 0 (1) tions can be imagined. For example, we might have chosen a more sophisticated lossy compression tech- nique, such as JPEG 15], with which to de ne the re- duced resolution version of the daily array. The AML array A apply operator makes it easy to do this. x [1] AML describes logical relationships among arrays. Fig. 1 can be seen as a sort of schema. In particu- lar, the daily array can be seen as a view de ned in terms of the daytime and nighttime arrays. The view dim. 1 x [0] dim. 0 de nition is the AML expression, given above, which maps the daytime and nighttime arrays to the daily Figure 2: Sub-arrays and Slabs array. Similarly, the low-resolution array is a view of the daily array. We have said nothing, at least at this De nition 3.2 The dimensionality of array A, writ- point, about the physical representations of these ar- ten dim(A), is the smallest i such that A~ j] = 1 for all rays. It may be that the daily array is physically stored j i. If there is no such i, then dim(A) is unde ned. by laying out its values in a le in row-major order, or it may be stored in some more compact, compressed form. Alternatively, it may not be materialized at all, QDe1 nition ~A i]. 3.3 The size of array A, written jAj, is i=0 as it can be derived when necessary from the corre- sponding daytime and nighttime arrays. We will restrict ourselves to arrays of nite size. How- ever, it will sometimes be convenient for us to think of 3 Data Model and Terminology arrays as having in nite length in all dimensions. For Throughout this paper we will use a vector arrow, as this purpose, we de ne A ~x] = NULL for all points in ~x, to denote in nite vectors of integers. The usual ~x that are not in A, where NULL is some value not notation ~x i] will refer to the element with index i. found in DA Expressions involving operations on vectors, such as An array having a length of zero in one or more di- ~z = b~x=~yc refer to element-by-element application of mensions is called a null array. Such arrays have zero the operation, i.e., ~z i] = b~x i]=~y i]c. size and since there are no points in a null array, it An array has a shape and a domain. We will con- is considered to have the value NULL at every point. sider arrays to have an in nite number of dimensions, We will also need a notion of sub-array. A sub-array numbered from zero. Each array dimension is indexed is simply an array that is wholly contained within an- by the non-negative integers, i.e., indexing starts at other, as shown in Fig. 2. We will identify the position zero. A shape is an in nite vector of non-negative of the sub-array within the containing array by the po- integers which de nes the array's length in each di- sition of its smallest point, as shown in the gure. mension. When it is necessary to write a particu- De nition 3.4 Let A and B be arrays, and let ~x be a lar shape, the shape's elements will be parenthesized. vector in A. Array B is called a subarray of A at ~x i All elements not listed explicitly are assumed to be DB = DA , and for every point ~y in B , B ~y] = A ~x +~y ]. ones. Thus, the shapes (1; 1; 2) and (4; 4) that were used in the examples in Section 2 denote the in nite Finally, we de ne informally the notion of a slab of an vectors (1; 1; 2; 1; 1;1; ) and (4; 4; 1; 1; 1; ), respec- array along dimension i. A slab is simply a slice of unit tively. The domain of an array is a set of possible width through an array along the speci ed dimension. values, one of which is present at each indexed point This is also illustrated in Fig. 2. within the array. De nition 3.1 An array A consists of a shape A~, a 4 The Array Manipulation Language domain DA , and a mapping MA . The ith element of The Array Manipulation Language (AML) consists of A~ represents the length of the array in dimension i. A three operators which manipulate arrays. Each oper- vector ~x is de ned to be in array A i 0 x i] < A~ i] ator takes one or more arrays as arguments and pro- for all i 0. The mapping, MA maps each vector duces an array as result. subsample is a unary opera- ~x in A to an element of the array's domain, DA . We tor which can delete data, i.e., the size of the result of will use the standard array notation A ~x] to denote the subsampling an array A is never larger than A. merge domain value to which ~x is mapped. is a binary operator which combines two arrays de ned Page 3

4.over the same domain. apply applies a function to an array, in a manner to be described below, to produce B = SUB 0(A, 10) a new array. 02 22 42 Neither subsample nor merge changes the values 01 21 41 found in its operands, i.e., every value found in the 00 20 40 result of these operations can be found in an operand. array A B = SUB 1(A, 10) The third operator, apply, may generate new values 02 12 22 32 42 52 02 12 22 32 42 52 as a result of applying the function. 01 11 21 31 41 51 00 10 20 30 40 50 4.1 An Introduction to Bit Patterns 00 10 20 30 40 50 B = SUB 0(A, 000011) All of the AML operators take bit patterns as param- 42 52 eters. A bit pattern is an in nite binary vector. As for 41 51 dimension 1 other vectors, indexing of bit patterns starts at zero. 40 50 The ith element of a pattern P is denoted by P~ i]. dimension 0 When the context makes it clear that P~ is a pattern, Figure 3: Examples of the subsample Operation we will drop the explicit vector notation and simply write P or P i]. where A is an array, P is pattern, and i is the dimen- We will be interested only in those patterns that sion number. consist of an in nite number of repetitions of some The subsample operator divides A into slabs along nite vector, and we will use that nite vector to rep- dimension i, and then keeps or discards slabs based on resent the entire pattern. For example, we may write the pattern P. If P k] = 1, then slab k is kept and P = 1010 to mean P = 1010101010 . Note that included in B, otherwise it is not. The slabs that are there is more than one nite representation of any kept are concatenated to produce the result B. Several such bit pattern. For example, Q = 10 represents the applications of the subsample operator are illustrated same pattern as P. We will sometimes use a regular- in Fig. 3. expression-like notation to describe patterns. For ex- Formally, if B = subi (A; P), then B is de ned as ample 0i 1j 0k , for positive integers i; j and k, repre- follows: sents a pattern in which j 1's are sandwiched between i 0's on the left and k 0's on the right. The bit-wise DB = D A complement of a pattern P, obtained by replacing P's ones with zeros and vice versa, will be written P. if A~ i] > 0, then B~ i] = count(P; A~ i] ? 1), else There are two pattern functions, index and count, B~ i] = 0. that we will make heavy use of. for all j 0 except j = i, B~ j] = A~ j] De nition 4.1 If P is a bit pattern (P 6= 0) and k a positive integer, index(P; k) is the index of the k-th 1 for all points ~x in B, B : : :;~x i ? 1];~x i];~x i + in P . By de nition, if k = 0 or P = 0, index(P; k) = 1]; : : :] = A : : :;~x i ? 1]; index(P;~x i] + 1);~x i + 0. 1]; : : :]. De nition 4.2 If P is a bit pattern and k a non- Note that subsampling a null array results in a null negative integer, count(P; k) is the number of ones in array, regardless of the dimension number or the sub- the rst k + 1 positions of P , i.e., from P 0] to P k], sampling pattern. Also, if P = 0, then B~ i] = 0 and inclusive. B is a null array. Both functions are monotonically non-decreasing 4.3 The merge Operator in k. It should be obvious that for any k 1, count(P; index(P; k)) = k, unless P = 0. The merge operator takes two arrays, a dimension number, a pattern, and a default value as parameters. 4.2 The subsample Operator It merges the two arrays to produce its result. As was done for subsample, the dimension number will The subsample operator takes an array, a dimension normally be written as a subscript, as in number and a pattern as parameters and produces an array. The dimension number will normally be written C = mergei (A; B; P; ) as a subscript and subsample will be abbreviated as sub, as in where A and B are arrays, P is the pattern, and B = subi (A; P) is the default value. The explicit reference to will Page 4

5. C = MERGE0 (A, B, 110, δ ) DC = DA fNULLg 0 a01 a11 b01 a21 a31 b11 C~ 0 i] = max(index(P; A~ i]); index(P; B~ i])) + 1 for all j 0 except j = i, C~ 0 j] = max(A~ j]; B~ j]) a00 a10 b00 a20 a30 b10 array A a01 a11 a21 a31 C = MERGE0 (A, B, 1100101, δ ) a00 a10 a20 a30 a01 a11 b01 b11 a21 δ a31 for all points ~x in C 0: a00 a10 b00 b10 a20 δ a30 { if P ~x i]] = 1, then C 0 : : :;~x i ? 1];~x i];~x i + array B C = MERGE1 (A, B, 110, δ ) 1]; : : :] = A : : :;~x i ? 1]; count(P;~x i]) ? b01 b11 δ δ 1;~x i + 1]; : : :], { otherwise C 0 : : :;~x i ? 1];~x i];~x i + 1]; : : :] = b01 b11 b00 b10 δ δ δ δ δ δ δ δ B : : :;~x i ? 1]; count(P;~x i]) ? 1;~x i + 1]; : : :] b00 b10 δ δ We then obtain C by removing any NULL values in- side of C 0: DC = DC ? fNULLg; for all i 0, C~ i] = dimension 1 a01 a11 a21 a31 0 dimension 0 a00 a10 a20 a30 C~ 0 i]; and for all points ~x in C, if C 0 ~x] = NULL then C ~x] = , otherwise C ~x] = C 0 ~x]. Figure 4: Examples of the merge Operation 4.4 The apply Operation The apply operator applies a function to an array to often be dropped if the default is not important. The produce a new array. In its most general form, it is operation is de ned only if DA = DB and 2 DA . written as Conceptually, merge divides both A and B into slabs along dimension i. The result is produced by B = apply (A; f; D~f ; R~f ; P0; P1; : : :; Pd?1) interleaving slabs from the two arrays according to the where f is the function to be applied, A is the array pattern P. Each one in P corresponds to a slab from to apply it to, D~f and R~f are shapes, the Pi's are the rst array (A), and each zero to a slab from the second (B). For example, if P = 1001, then along the patterns, and d = dim(A). The parameters D~f and i-th dimension, one slab from array A, two slabs from R~f are called the domain shape and the range shape. array B and then a slab from array A are taken and We will often use a special case of apply, written concatenated in that order. This process is repeated until all slabs from both A and B have been used. B = apply (A; f; D~f ; R~f ) (Recall that P = 1001 denotes the in nite pattern for which we assume that Pi = 1 for all 0 i < d. In 100110011001 .) addition, either the range shape or both shapes may be Fig. 4 illustrates the merge operation. The exam- left unspeci ed when apply is written. These shapes ples show that the default value may be used for two default to (1; 1; 1; ) if they are not speci ed. reasons. One is that the slabs from one array may be A simple way to de ne apply is to insist that f exhausted while slabs remain in the other. This is the map from arrays of A's shape and domain to arrays of case in the second example in Fig. 4. Another reason B's shape and domain. The operator would then sim- is an array shape mismatch in some dimension other ply compute B = f(A). However, many common ar- than the merge dimension. In case of such a mismatch, ray functions have some structural locality: the value the shorter array is expanded using default values un- found at a particular point in B depends only on the til its length matches that of the longer array. This is values at certain points in A, not on the values at all illustrated in the third example in Fig. 4. points in A. For example, if f is a smoothing func- It is convenient to formally de ne merge in two tion that maps each point in A to the average of that steps. The rst generates an array C 0 by interleaving point and its neighbors, then to determine the value slabs from A and B, as described above. Because of at some point in B, we need only look at that point shape mismatches between A and B, however, or be- and its neighbors in A. Such information can be very cause of the particular pattern P, some values in C 0 valuable for optimizing the execution of an expression may be NULL. The second step eliminates this prob- involving the array operators. lem by transforming any such NULL values to the The apply operation is de ned so that this kind of default value . The result of this nal step is indeed structural relationship can be made explicit when it an array, and is the result of the merge operation. exists. The apply operator requires that f be de ned The intermediate array, C 0, is de ned as follows: to map sub-arrays of A of shape D~f to sub-arrays of Page 5

6. for all ~x in B, B ~x] = f(A; ~y ) ~x MOD R~f ], where ~y i] = index(Pi ; b~x i]=R~f i]c + 1) for all 0 i < array A 3 dim(A) If D~f i] > A~ i] for some i 0, then the de nition 2 dimension 1 1 0 f(A, (2,2)) above implies that B will be a null array. f(A, (1,2)) 0 1 2 dimension 0 3 4 4.5 Additional Operations f(A, (1,0)) In this section, we show a few useful special cases of 1 f(A, (2,0)) the AML operators, and give them names. concat: The concat operator concatenates two arrays along some dimension. Concatenation can be 0 0 1 2 3 B = APPLY(A, f, (2,2), (2,1), 0110, 10) de ned using merge as follows: D Rf P P concati (A; B; ) mergei (A; B; 1A~ i] 0B~ i] ; ) f 0 1 Since merge requires A and B to have a common do- Figure 5: An Illustration of the apply Operation main, so does concat. Note that if A and B have length mismatches in dimensions other than i, the ar- B of shape R~f . We will use the notation f(A;~x) to ray with the shorter length will be extended using the refer to the result of applying f to the sub-array of A default value . of shape D~f at ~x. Thus, f(A;~x) is an array of shape clip: clip clips an image along a speci ed dimen- R~f . sion, keeping only those slabs within a clip region de- The apply operator applies f to certain sub-arrays ned by parameters x and y, where 0 x y A~ i]. of A, and concatenates the results to generate B. This It can be implemented using subsample as follows: process is illustrated in Fig. 5. The pattern Pi can be thought of as selecting slabs in dimension i, with the clipi (A; x; y) subi(A; 0x 1y?x0A~ i]?y ) selected slabs corresponding to the ones in the pattern. tiled apply: Often, we will wish to apply a func- The function f is applied at a point ~x only if that point tion to all non-overlapping sub-arrays of a particular falls in selected slabs in all d dimensions of the array, shape. In the example in Section 2, this is the case i.e., only if Pi ~x i]] = 1 for all 0 i < dim(A). In the when the low-resolution daily array (D) is being com- gure, the patterns select two slabs in each dimension, puted from the daily array (C). Since this type of leading to a total of four applications of the function function application is quite common, we can de ne f. Several features of the application of f should be the tiled apply operator to support it. Assuming noted. First, while the selected sub-arrays may over- that A has dimensionality d, the de nition is as fol- lap in A, the results of applying the function do not lows: overlap in the resulting array B. Second, the arrange- ment of resulting sub-arrays in B preserves the spatial tiled apply (A; f; D~f ; R~f ) apply (A; f; D~f ; R~f ; P) arrangement of the selected sub-arrays in A. Finally, the sub-arrays to which f is applied must be entirely where P denotes the patterns contained within A. In the example in Fig. 5, this 10D~f 0]?1 ; 10D~f 1]?1 ; ; 10D~f d?1]?1 . means that even if the point ~x = (3; 3) was selected by the patterns, f(A;~x) would not be evaluated, since 4.6 More on Patterns and Shapes that subarray lies partially outside of A. If B = apply (A; f; D~f ; R~f ; P0; : : :; Pdim(A)?1), and We allow patterns and shapes appearing in AML ex- f is a function that maps from arrays of shape D~f over pressions to be de ned in terms of the array arguments of their AML operators. As an example, consider the domain DA to arrays of shape R~f over domain Drange, expression apply (A; f; (A~ 0]; 1)), in which f is applied then B is formally de ned as follows: to each row of A. Aliases (as in SQL) can be used in DB = Drange AML expressions when necessary to de ne names for unnamed intermediate arrays. In the AML expression for all i 0, apply (sub1 (B; P) A; f; (A~ 0]; 1)) the alias A is used { if A~ i] < D~f i] or Pi = 0, then B~ i] = 0 to refer to the result of the inner sub operation so that the apply's shape argument can be de ned. The { otherwise B~ i] = count(Pi; A~ i]?D~f i]) R~f i] scope of such an alias is the AML operator in which Page 6

7.it is de ned. In the case of the apply operator, it and B can be removed before they are merged, poten- is also possible to refer to the domain shape and the tially speeding up the merging step. 2 range shape in the operator's patterns. An example of this can be seen in the de nition of the tiled apply Example 5.3 operation in Section 4.5. An interesting situation arises in the following exam- The shape of the result of an AML operation can ple. Rewriting subi (mergei (A; B; 0100); 100010) us- always be determined (without actually evaluating the ing Theorem 5.2 yields R = 0; S = 100110010 and operator) if the shapes of its array arguments are T = 0. So an equivalent form for the expression is known. By induction we can show that the shape of the result of an arbitrary AML expression can be de- mergei(subi (A; 0); subi(B; 100110010); 0): termined once the shapes of its \terminal" arrays are known. This property is useful when AML expressions Since merge with a pattern of 0 results in its second are being evaluated, since it implies that the space re- argument, the above expression can be transformed to quired to implement an AML operation can be deter- subi(B; 100110010). From the original expression, it mined in advance. is not immediately apparent that the whole of array A gets subsampled out but the equivalent expression 5 Rewrite Rules makes this obvious. 2 In many cases it will be possible to rewrite AML ex- The ability to push subsampling through function pressions into one or more equivalent forms. Often, application is also potentially very valuable. To sim- one form will have a more e cient implementation plify our presentation, we consider a restricted version than another, so rewriting is useful for query optimiza- of a rewrite rule that accomplishes this. tion. In this section, we present several rewrite rules Theorem 5.3 If R~f i] = D~f i] = 1, Pi = 1, and for AML expressions. Many such rules are possible d = dim(A), then and this presentation is not intended to be compre- subi(apply (A; f; D~f ; R~f ; P0; : : :; Pd?1); Q) = hensive. Instead, we hope to demonstrate by examples ~ ~ apply (subi (A; Q); f; Df ; Rf ; P0; : : :; Pd?1). that useful rewrite rules do exist. The rst rule shows that two successive subsam- Example 5.4 ples along the same dimension can be combined into a single subsample. Recall from Section 2 that the daily temperature array C was de ned as Theorem 5.1 subi(subi(A; P); Q) = subi(A; R) where R j] = P j] ^ Q count(P; j) ? 1] for all j 0. C = apply (merge2(A; B; 10); f; (1; 1; 2)): Example 5.1 where D~f = (1; 1; 2) and R~f defaults to (1; 1; 1; ) since it is not speci ed. Suppose we want to subsample Applying the above rewrite rule to the ex- the array C in dimension 0 using the pattern P = 0616. pression sub0 (sub0 (A; 1000); 10), we get R = That is, we would like to evaluate 1000000010000000 . So the expression gets simpli- ed to sub0(A; 10000000). 2 sub0 (apply (merge2(A; B; 10); f; (1; 1; 2)); P) The next rule shows that we can push sub through merge. Heuristically, this should be bene cial be- Using Theorem 5.3, this can be rewritten as cause the merge operation will be able to operate on smaller subsampled images. apply (sub0(merge2(A; B; 10); P); f; (1; 1;2)) Theorem 5.2 subi(mergei(A; B; P); Q) = We can optimize further by pushing sub0 inside of mergei (subi (A; R); subi (B; S); T) merge2 . This is trivial, since they operate in di erent where for all j 0, R j] = Q index(P; j + 1)], S j] = dimensions. This gives us Q index(P; j + 1)], and T j] = P index(Q; j + 1)]. apply (merge2(sub0 (A; P); sub0 (B; P); 10); f; (1; 1;2)) Example 5.2 which indicates that to retrieve parts of the daily tem- Applying the above theorem to the expression perature array, we need only retrieve parts of the day- sub0(merge0 (A; B; 10); 101) yields R = 110, S = 011 time and nighttime arrays, as expected. 2 and T = 1100. So the transformed expression is There are situations in which the result of an apply merge0 (sub0 (A; 110); sub0 (B; 011); 1100). From pat- operation is being subsampled, but we cannot push the terns R and S we see that about one-third of arrays A sub through the apply. This often happens when the Page 7

8.function is being applied to overlapping sub-arrays. Consider the following AML expression: array Step 1 generates B = subi(apply (A; f; (2; 2); (2; 2)); 110010) output here in which f maps 2 2 sub-arrays from A to 2 2 sub- synchronization arrays in the result. Note that if ~x is a point in slab Step 2 takes boundary input from here 1 (i.e., the second slab) of dimension i of A, the result of evaluating f at ~x will be completely discarded by Figure 6: Multidimensional Synchronization for Ar- the sub that follows apply. The results of such eval- rays. uations form slabs 2 and 3 in the resulting array, and both P 2] and P 3] in the subsample pattern are zero. In fact, because the subsampling pattern is an in nite query execution plan might generate n such interme- repetition of 110010, the result of evaluating f at any diate results. This may be very time consuming, even ~x with ~x i] MOD 3 = 1 will be discarded. Clearly, if the steps are implemented entirely in memory. the function f should not be evaluated at such points. The rst of these problems can be addressed by al- These evaluations cannot be avoided by moving the lowing steps to execute concurrently. This requires sub before the apply however, since all of A is needed some mechanism for synchronizing access to arrays to generate parts of B that are kept. that are simultaneously being produced by one step Although we cannot always push sub through and consumed by a subsequent step. Often, this is apply, we may be able to push sub into apply. For accomplished by treating the data passed from one the special case in which R~f has unit size, the following step to another as a linear stream. A stream can be rule shows this. thought of as having a boundary point which indicates Theorem 5.4 If jR~f j = 1 and d = dim(A), then how much of the stream data has actually been gen- erated by the rst step. The subsequent step is forced subi(apply (A; f; D~f ; R~f ; P0; ; Pi; ; Pd?1); S) = to wait if it has consumed all of the stream data up to apply (A; f; D~f ; R~f ; Q0; ; Qi; ; Qd?1), the boundary. A direct application of this idea to mul- where Qj = Pj for all j 6= i, and Qi k] = Pi k] ^ tidimensional arrays would require that steps agree on S count(Pi ; k) ? 1] for all k 0. how an array is to be \linearized" to form a stream. An alternative is to generalize the synchronization bound- Example 5.5 ary to accommodate multidimensional arrays. This is sub1(apply (A; f; D~f ; 11; 101100; 110); 011) illustrated for two-dimensional arrays in Fig. 6. This approach divides each array into two regions. As the gets transformed to apply (A; f; D~f ; 11; 001100; 110) rst step runs, the region accessible to the second step according to this rewrite rule. 2 grows until it covers the entire array. To address the second problem, we can choose an 6 Query Evaluation array representation that permits steps to avoid creat- Query processing involves the generation of a query ing new copies of large arrays. In particular, we may execution plan for a given AML expression.1 An n- be able to allow a step to simply modify its input array operator AML expression can be executed in n sequen- and then use the modi ed array as its output. Fig. 7 tial steps in which each step generates an intermediate illustrates an array representation that permits this result which is used as input by a subsequent step. for subsample and merge steps. This array repre- This straightforward approach has several potential sentation has several features. One is a vector of valid disadvantages. First, it does not allow for pipelining of bits per array dimension. These bits can be cleared to steps. It should be possible for a step to begin execu- indicate that a particular slab of data is invalid, i.e., tion even if its input has only been partially generated. that it should be ignored by any subsequent steps that Second, it may result in the generation of many large use the array. This provides an easy way to implement intermediate results. For example, consider steps that a subsample operation, since valid bits can simply be implement operations such as subi(A; 1111111110) or cleared according to the positions of the zeros in the mergei (A; B; 10). If the arrays A and B are large, so subsample pattern. Of course, the disadvantage of too will be the output of these operations. An n-step this approach is that the size of the array represen- tation is not actually reduced by subsampling. This 1 In fact, we may generate several candidate execution plans, suggests that this mechanism should be used to im- and then choose a good one using execution cost estimates. Here, we will focus on some of the issues involved in execution plement subsample steps that have low selectivity, plan generation. i.e., steps whose subsample patterns have a high ra- Page 8

9. the user-de ned expressions themselves. A notable re- pointers to cent exception is PREDAT OR 12], which treats user- array blocks de ned expressions declaratively, and passes them to an optimizer that can handle them. Special purpose image database systems also handle valid bits array blocks array data, at least in two dimensions. 2] is a survey for dim. 1 of work in this area. These systems focus on selec- tion and retrieval of images, or parts of images, based on image content. AML does not directly support re- trieval based on image content. However, it can be valid bits for dim. 0 used in conjunction with content-based indexing and Figure 7: An Array Representation Permitting Fast retrieval techniques. subsample and merge There have been several other proposals for query languages for arrays, including 6] and 4]. Both of these are based on calculi which can be used to express tio of ones to zeros. More selective subsamples can array-related operations, as well as non-array opera- be implemented by generating a new, smaller array tions. We will brie y describe the Array Query Lan- representation. Note that invalid slabs will actually guage (AQL) de ned in 6]. AQL is based on a calcu- be removed from the array representation by the rst lus which provides four array-related primitives: two \downstream" step that actually generates a new ar- are used to create arrays, one performs subscripting, ray. i.e., it extracts a value from an array, and one deter- The array representation also incorporates indirec- mines the shape of an array. Using these very low-level tion. The array data is divided into blocks, and constructs (plus such things as conditionals and arith- a multidimensional array of pointers refers to these metic operations), higher level operations can be con- blocks. Indirection allows some merge steps to be structed. For example, operations similar to the sub- implemented without copying array data. For exam- sample, merge, and apply operations de ned here ple, consider the merge operation used to de ne the can be expressed in terms of those primitives. Opti- daily temperature array (C) in Fig. 1. It concatenates mization of AQL expressions is performed at the level the daytime and nighttime arrays. This can be imple- of the primitive operations after replacing higher-level mented by generating a new, larger pointer array, and operations with their de nitions. Implementations of then setting the pointers to point to the existing blocks each of the primitive operations are then used to eval- of the two arrays. Clearly, the existence of such a fast uate the optimized queries. This is a very powerful implementation depends on the shape of the blocks of and exible approach. For example, if new higher- the arrays to be merged, and on the merge pattern. level operations are added, they are expressed using This creates an interesting opportunity for optimiza- the calculus. They can then be optimized, i.e., there tion, since whenever a new array copy is created, we is no need to generate new rewrite rules \manually" can choose the parameters of its representation, e.g., for the new operations. the size and shape of the blocks. Neither proposal suggests any particular set of high- In combination with multidimensional synchroniza- level operations for arrays. Rather, they show how tion, indirection can also help reduce the memory re- such operations can be de ned and optimized. It quirements of a query execution plan. This is because is not clear how e ective such optimizations will be. the space used by individual array blocks can be re- Whether an optimizer will nd appropriate rewrite leased as soon as that block is no longer needed. In rules, and how quickly it will nd them, remain open many cases, only a small part of a large array will need questions. The e ciency of execution of query plans to be represented at any time. consisting of many small, primitive operations is also a potential concern. 7 Related Work A variety of database systems now provide support 8 Summary and Conclusion for user-de ned data types such as arrays. These in- We have described the Array Manipulation Language, clude commercial systems 10, 1] and research systems an algebra for arrays. AML can be used as a query such as Postgres, Paradise, and others 7, 13, 3, 12]. language for array data, and as a view de nition lan- As noted in the introduction, these systems may opti- guage, to de ne new arrays in terms of existing ones. mize relational expressions in which user-de ned func- AML's apply operator can be customized to support tions appear. However, they generally do not optimize a wide variety of user-de ned array functions. AML Page 9

10.expressions can also be optimized. Optimizations can 5] L. M. Haas et al. An optimizer for heteroge- exploit the structural properties of the AML opera- neous systems with nonstandard data and search tions. capabilities. Bulletin of the IEEE Computer So- In 8], Maier and Vance claim that a reasonable al- ciety Technical Committee on Data Engineering, gebra for ordered types, such as arrays, should have a 19(4):37{44, December 1996. small number of operators, should encapsulate a sig- 6] L. Libkin et al. A Query Language for Multidi- ni cant fraction of the control structures used for le processing, should possess non-trivial transformations mensional Arrays: Design, Implementation, and useful for query optimization, and should admit to rea- Optimization Techniques. In Proc. of the SIG- sonably e cient implementation over large arrays. We MOD Conference, pages 228{239, 1996. claim that AML has at least the rst, third, and fourth 7] V. Linnemann et al. Design and Implementation of these properties. The second property, expressive- of an Extensible Database Management System ness, is more di cult to pin down as it is very domain Supporting User De ned Data Types and Func- dependent. However, we note that AML is at least tions. In Proc. of the 14th VLDB Conference, expressive enough to mimic the widely used le-based pages 294{305, 1988. data management packages, such as NetCDF, which support multidimensional arrays. 8] D. Maier and B. Vance. A call to order. In Proc. Array data will be most useful in conjunction with of the ACM SIGACT-SIGMOD-SIGART Sympo- other types of data. In particular, we may wish to sium on Principles of Database Systems, pages associate various sorts of metadata with each array 1{16, 1993. to facilitate the selection of individual arrays from a 9] National Space Science Data Center, Greenbelt, set. Thus, AML will be most useful if it can be imple- Maryland. CDF User's Guide, October 1996. Ver- mented as part of a system capable of integrating data sion 2.6. of various types. One promising approach is o ered by PREDAT OR 12], which views the world as an 10] M. A. Olson et al. Query Processing in a Parallel integrated collection of data types, each of which sup- Object-Relational Database System. Bulletin of ports a declarative, optimizable query language (such the IEEE Computer Society Technical Committee as AML). A similar, but more loosely coupled, ap- on Data Engineering, 19(4):3{10, December 1996. proach is taken by systems like Garlic 5] that attempt to federate a collection of independent and heteroge- 11] R. Rew et al. NetCDF User's Guide. Unidata neous data repositories. Extensible object-relational Program Center, Boulder, Colorado, February systems, such as the Informix Universal Server 10], 1996. Version 2.4. may also serve as useful platforms for implementation 12] P. Seshadri et al. E-ADTs:turbo-charging com- of AML. We are currently considering these implemen- plex data. Bulletin of the IEEE Computer So- tation alternatives. ciety Technical Committee on Data Engineering, 19(4):11{18, December 1996. References 13] M. Stonebraker et al. The implementation of 1] F. Bancilhon and G. Ferran. ODMG-93: The POSTGRES. IEEE Transactions on Knowledge object database standard. Bulletin of the IEEE and Data Engineering, 2(1):125{142, 1990. Computer Society Technical Committee on Data Engineering, 17(4):3{14, December 1994. 14] University of Illinois at Urbana-Champaign. NCSA HDF Calling Interfaces and Utilities, 3.1 2] S.-K. Chang and A. Hsu. Image information edition, July 1990. systems: Where do we go from here? IEEE 15] G. K. Wallace. The JPEG still picture com- Transactions on Knowledge and Data Engineer- pression standard. Communications of the ACM, ing, 4(5):431{442, October 1992. 34(4):30{44, April 1991. 3] D. J. DeWitt et al. Client{Server Paradise. In Proc. of the 20th VLDB Conference, pages 558{ 569, 1994. 4] L. Fegaras and D. Maier. Towards and E ective Calculus for Object Query Languages. In Proc. of the SIGMOD Conference, pages 47{58, 1995. Page 10