Fast Subsequence Matching in Time-Series Databases

We present an efficient indexing method to locate ldimeneional subsequences within a collection of sequences,such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each datasequence into a small set of multidimensional rectangles in feature space. Then, these rectangles can be readilyindexed using traditional spatial access methods, like theR*-tree [9]. In more deteil, we use a sliding window over the data sequence and extract its features; the result is a trail in feature space. We propose an efficient and effective algorithm to divide such trails into sub-trails, which are subsequently represented by their Minimum Bounding Rectangles (MBRs). We also examine queries of varying lengths, and we show how to handle each case efficiently.

1. Fast Subsequence Matching in Time-Series Databases Clu-istos F~outsos* M. llanganat~an~ Yanais Maaolopoulo$ Department of Computer Science and Institute for Systems Research (ISR) University of Maryland at College Park emaik christos, ranga, manolopo~cs. umd. edu Abstract benefit from such a facility. Specific applications include We present an efficient indexing method to locate 1- the following: dimeneional subsequences witbin a collection of sequences, ● financial, marketing and production time series, such such that the subsequences match a given (query) pattern as stock prices, sales numbers etc. In such databases, within a specified tolerance. The idea is to map each data typical queries would be ‘jind companies whose stock sequence into a small set of multidimensional rectangles prices move similarly’, or ‘find other companies that in feature space. Then, these rectangles can be readily have similar sales patterns with our company’, or indexed using traditional spatial access methods, like the R*-tree [9]. In more deteil, we use a sliding window ‘jind cases in the past that resemble last month’s over the data sequence and extract its features; the result sales pattern of our product) is a trail in feature space. We propose an efficient and ● scientific databases, with time series of sensor effective algorithm to divide such trails into sub-trails, which are subsequently represented by their Minimum Bounding data. For example, in weather data [11], geological, Rectangles (MBRs). We also examine queries of varying environmental, astrophysics [30] databases, etc., we lengths, and we show how to handle each case efficiently. want to ask queries of the form, e.g., ‘jind past days We implemented our method and carried out experiments in which the solar magnetic wind showed patterns on synthetic and real data (stock price movements). We similar to today’s pattern’ to help in predictions of compared the method to sequential scanning, which is the the earth’s magnetic field [30]. only obvious competitor. The results were excellent: our method accelerated the search time from 3 times up to 100 Searching for similar patterns in such databases is times. essential, because it helps in predictions, hypothesis testing and, in general, in ‘data mining’ [1, 3, 4] and rule discovery. 1 Introduction For the rest of the paper, we shall use the following The problem we focus on is the design of fast searching notational conventions: If S and Q are two sequences, methods that will search a database with time-series of then: real numbers, to locate subsequences that match a query subsequence, exactly or approximately. Historical, ● Len(S) denotes the length of S temporal [29] and spatio-temporal [5] databaeea will ● S[i : j] denotes the subsequence that includes entries in positions i through j *This research was partially funded by the Institute for Systems Research (ISR), by the National Science Foundation ● S[i] denotes the i-th entry of sequence S under Grants IRI-9205273 and IRI-S958546 (PYI), with matching funds from EMPRESS Software Inc. and Thinking Machines Inc. ● D(S, Q) denotes the distance of the two (equal t Also with IBM Federal Systems Company, Gaitheraburg, MD 20879. lerigth) sequences S and Q. $On sabbatical leave from the Department of Informatics, Aristotle University, Thessaloniki, Greece 54006. Similarity queries can been clsaaified into two cate- gories: Permission to cc y without fee all or part of this material Is . Whole Matching. Given a collection of N data granted providJ that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the sequences of real numbers S1, Sa, . . . . SN and a query title of the publication and its date appear, and notice is given sequence Q, we want to find those data aequencee that copying is by permission of the Association of Computing that are within distance e from Q. Notice that data Machinery. To copy otherwise, or to republish, requires a fee anct/orspecific permission. and query sequences must have the same length. SIGMOD 94- 5/94 Minneapolis, Minnesota, USA (EJ 1994 ACM 0-89791-839-5/94/0005..$3.50 419

2.● Subsequence Matching. Given N data sequences the contributions of the present paper, giving some Sl,sz,..., SN of arbitrary lengths, a query sequence extensions of this technique and outlining some open Q and a tolerance e, we want to identify the data problems. sequences Si (1 < i < N) that cent ain matching subsequences (i.e. subsequences with distance < c 2 Background from Q). Report those data sequences, along with To the best of the authors’ knowledge, this is the first the correct offsets within the data sequences that best work that examines indexing methods for approximate match the query sequence. (We assume that we are subsequence matching in time-series databases. The given a function D(S, Q), which gives the distance of following work is related, in different respects: the sequences S and Q. For example, V() can be the Euclidean distance.) indexing in text [13] and DNA databases [6]. Text and DNA strings can be viewed as l-dimensional The case of ‘whole match’ queries can be handled as sequences; however, they consist of discrete symbols follows [2]: A distance-preserving transform, such as w opposed to continuous numbers, which makes a the Discrete Fourier transform (DFT), can be used difference when we do the feature extraction. to extract ~ features from sequences (eg., the first ~ DFT coefficients), thus mapping them into points ‘whole matching’ approximate queries on time- in the f-dimensional feature space. Subsequently, sequences [2] or on color images [14, 21] or even on any spatial access method (such as R*-trees) can be 3-d MRI brain scans [8]. In all these methods, the used to search for range fapproximate queries. This idea is to use f feature extraction functions to map approach exploits the assumption that data sequences a whole sequence or image into a point in the (d- and query sequences SH have the same length. Here, imensional) feature space [18]; then, spatial access we generalize the problem and present a method to methods may be used to search for similar sequences answer approximate-match queries for subsequences of or images. The resulting index that contains points arbitrary length Len(Q) . The ideal method should in feature space is called F – iradez [2]. fulfill the following requirements: The F-index works as follows: Given N sequences, all ● it should be fast. Sequential scanning and distance of the same length n, we apply the n-point Discrete calculation at each and every possible offset will be Fourier Transform (DFT) and we keep the first few too slow for large databases. coefficients. Let’s assume that we keep f numbers - thus, each sequence is mapped into a point in an f- ● it should be ‘correct’. In other words, it should dimensional space. These points are organized in an return all the qualifying subsequences, without R*-tree, for faster searching. In the typical query, the missing any (i.e., no ‘false dismissals’). Notice user specifies a query sequence Q (of length n again) and that ‘false alarms’ are acceptable, since they can be a tolerance c, requesting all the data sequences that are discarded easily through a post-processing step. within distance e from Q. To resolve this query, (a) we ● the proposed method should require a small space apply the n-point DFT on the sequence Q, we keep the overhead. f features, thus mapping Q into a f-dimensional point ~t in feature space; (b) we use the F-index to retrieve ✠ the method should be dynamic. It should be easy to all the points within distance c from qf; (c) we discard insert and delete sequences, as well as to append new the false alarms (see more explanations in Lemma 1), measurements at the end of a given data sequence. and we return the rest to the user. Here, we generalize the ‘F-index’ method, which was ● the method should handle data sequences of varying designed to handle ‘whole matching’ queries, Our goal length, as well as queries of varying length. is to handle subsequence queries, by mapping data se- The remainder of the paper is organized as follows, quences into a few rectangles in feature space. Since we Section 2 gives some background material on past rely on spatial access methods as the eventual indexing related work, on spatial access methods and on the mechanism, we mention that several multidimensional Discrete Fourier Transform. Section 3 focuses on indexing methods have been proposed, forming three subsequence matching; we propose a new indexing classes: R*-trees [9] and the rest of the R-tree family mechanism and we show how off-th~shelf spatial access [15, 17, 28]; linear quadtrees [26, 24]; and grid-files [22]. methods (and specifically the R*-tree) can be used. To guarantee that the ‘F-index’ method above does Section 4 discusses performance results obtained from not result in any false dismissals, the dist ante in feature experiments on real and synthetic data, which show space should match or underestimate the distance the effectiveness of our method. Section 5 summarizes between two objects. Mathematically, let 01 and 02 be 420

3.two objects (e.g., same-length sequences) with distance where j is the imaginary unit j = H. The signal Z fUIICtiOII Dobject () (e.g., the Euclidean distance) and can be recovered by the inverse transform: F(O1), 17(02) be their feature vectors (e.g., their n-1 first few Fourier coefficients), with distance function Zi=l/@~xFeXp(jzZF~/~) iDO, l,. ..,l–l ~;y~e,() (e.g., the Euclidean distance, again). Then F=o (6) Lemma 1 To guarantee no false dismissals for range X~ is a complex number (with the exception of X., queries, the feature extraction function F() should which is a real, if the signal E is real). The energy E(Z) satisfy the following formula: of a sequence Z is defined as the sum of energies (squares of the amplitude [Zi 1) at every point of the sequence: Djeawre(F(@), F(02)) < ~object(@> 02) (1) Proofl Let Q be the query object, O be a qualifying object, and c be the tolerance. We want to prove that if the object O qualifies for the query, then it will be retrieved when we issue a range query on the feature A fundamental observation for the correctness of our space. That is, we want to prove that method is Parseval’s theorem [23], which states that the DFT preserves the energy of ~ signal: ~object(Q~ 0) < ~ ~ D~eature (F(Q), F(0)) < e (2) Theorem (Parseval). Let X be the Discrete Fourier However, this is obvious, since Transform of the sequence &. Then we have: n-1 Djeata~.(l’(Q)j F(o))< Dobj~~t(Q, 0) < ‘ (s) ~ lzi12= ~ IXF12 (8) D i=O F=o Following [2], we use the Euclidean distance as the Since the DFT is a linear transformation [23], Parseval’s distance function between two sequences, that is, the theorem implies that the DFT also preserves the sum of squared differences. Formally, for two sequences Euclidean distance between two signals F and @ S and Q of the same length 1, we define their distance D(S, Q) as 9(5, y) = qi, i) (9) 1/2 D(S, Q) As an example s ( ~(S[~ i=l of feature - Q[i])2 extraction ) function F() (4) we where ~ respectively. and ~ are Fourier We keep the first few (2-3) coefficients of the DFT as transforms of & and J the features, following the recommendation in [2]. This choose the Discrete Fourier llansform (DFT), for two ‘truncation’ results in under-estimating the distance of reasons: (a) it has been used successfully for ‘whole two sequences (because we ignore positive terms from matching’ [2] and (b) it provides a good, intuitive equation 4) and thus it introduces no false dismissals, example to make the presentation more clear. It should according to Lemma 1. be noted that our method is independent of the specific The truncation will introduce only false alarms, feature extraction function F(), as long as F() satisfies which, for practical sequences, we expect to be few. The the condition of Lemma 1 (Eq. 1). If the distance among reason is that most real sequences fall in the class of objects (data sequences) is the Euclidean distance, the ‘colored noise’, which has a skewed energy spectrum, condition of Lemma 1 is satisfied, by any orthonormal of the form O(F(- b]). This implies that the first few transform, such as, the Discrete Cosine transform coefficients contain most of the energy. Thus, the (DCT) [31], the wavelet transform [25] etc. Next, we first few coefficients give good estimates of the actual give the definition and some properties of the DFT distance of the two sequences. For b = 1, we have the transformation. pink noise, which, according to Birkhoff’s theory [2~ The n-point Discrete Fourier !fkansform [16, 23] of models signals like musical scores and other works of asignal~= [~i]l i = 0, ...,1-1 is defined to be a art. For b = 2, we have the brown noise (also known . sequence X of n complex numbers XF, F = 0, ..., n – 1, as random walk or brownian walk) which models stock given by movements and exchange rates (eg., [10, 20]). For b >2 n-1 we have the black noise whose spectrum is even more XF = l/fi~zi exp(-j2~Fi/n) F = 0,1,.. .,n– 1 skewed than the spectrum of brown noise; black noise i=O models successfully signals like the water level of a river (5) as it varies over time [20]. 421

4. application. For example, in stock price databases, Symbols Definitions. analysts are interested in weekly or monthly patterns N Number of data sequences. because shorter patterns are susceptible to noise [12]. Si The i-th data sequence (1 < i < N). Notice that we never lose the ability to answer shorter Len(S) Length of sequence S. than w queries, because we can always resort to S[k] Thek-the entry of sequence S. sequential scanning. S[i : j] Subsequence of S, including entries in Generalizing the reasoning of the method for ‘whole matching’, we use a sliding window of size w and place positions i through j. it at every possible position (offset), on every data Q A query sequence. sequence. For each such placement of the window, w Minimum query sequence length. we extract the features of the subsequence inside the V(Q, S) (Euclidean) distance between sequences window. Thus, a data sequence of length Len(S) is Q and S of equal length. mapped to a trail in feature space, consisting of Tolerance (max. acceptable distance). Len(S) -w+l points: one point for each possible offset ; Number of featurea. of the sliding window. Figure l(a) gives an example me Marginal cost of a point. of trails. Assume that we have two sequences, S1 and Sz (not shown in the figure), and that we keep the first ~=2 features (eg, the amplitude of the first and second Table 1: Summary of Symbols and Definitions coefficient of the w-point DFT). When the window of length w is placed at offset=O on S1, we obtain the first point of the trail Cl; as the window slidee over S1, we This concludes our discussion on prior work, which obtain the rest of the points of the trail Cl. The trail concentrated on ‘whole match’ queries. Next, we C’z is derived by S2 in the same manner. describe in detail how we handie the requests for Figure 4 gives an example of trails, using a real time- matching subsequences. series (stock-price movements). One straightforward way to index these trails would 3 Proposed Method be to keep track of the individual points of each trail, Here, we examine the problem of subsequence matching. storing them in a spatial access method. We call Specifically, the problem is defined as follows: this method ‘1-naive’ method, where ‘I’ stands for ‘Index’ (as opposed to sequential scanning). When ● We are given a collection of N sequences of real presented with a query of length w and tolerance c, numbers S1, Sz, SN, each one of potentially different we could extract the features of the query and search length. the spatial access method for a range query with radius ● The user specifies query subsequence Q of length e; the retrieved points would correspond to promising Len(Q) (which may vary) and the tolerance q subsequences; after discarding the false alarms (by that is, the maximum acceptable dis-similarity (= retrieving all those subsequences and calculating their distance). actual distance from the query) we would have the desired answer set, Notice that the method will not ● We want to find quickly all the sequences Si ( 1< miss any qualifying subsequence, because it satisfies the i < N), along wit-h the correct offsets k, such’ that condition of Lemma 1. the subsequence S~[k : k+ Len(Q) – 1] matches the However, storing the individual points of the trail in query sequence: D(Q, Si[k : k + Len(Q) – 1]) < E. an R*-tree is inefficient, both in terms of space as well The brute-force solution is to examine sequentially every as search speed. The reason is that, almost every point possible subsequence of the data sequences for a match. in a data sequence will correspond to a point in the We shall refer to this method by ‘SequentialScan’ ~-dimensional feature space, leading to an index with method. Next, we describe a method that uses a a 1:f increase in storage requirements. Moreover, the small space overhead, to achieve order of magnitudes search performance will also suffer because the R-tree savings over the ‘SequentialScan’ method. The main will become tall and slow. As we shall see in the section symbols used through the paper and their definitions with the experiments, the ‘I-naive’ method ended up are summarized in Table 1. being almost twice as slow as the ‘Sequentia/Scan’ . Thus, we want to improve the ‘l-naive’ method, by 3.1 Sketch of the approach - ‘ST-index’ making the representation of the trails more compact. Without loss of generality, we assume that the minimum The solution we propose exploits the fact that query length is w , where w (> 1) depends on the successive points of the trail will probably be similar, 422

5. El trails. Notice that it is possible that MBRs belonging to the same trail may overlap, ss C’2 illustrates. Thus, we propose to map a data sequence into a set of rectangles in feature space. This yields significant improvements with respect to space, as well as with respect to response time, as we shall see in section 4. Each MBR corresponds to a whole sub-trail, that is, points in feature space that correspond to successive positioning of the sliding window on the data sequences. For each such MBR we have to store t,ta,t, t,~d which are the offsets of the first and last C2 such positioning; ● a unique identifier for the data sequence ( and PI A MBR1 ● the extent of the MBR in each dimension m (~l~ow,~l~ig~, ~210w,J’2~ig~, . . .). These MBRs can be subsequently stored in a spatial access method. We have used R*-trees [9], in which case these MBRs are recursively grouped into parent MBRs, grandparent MBR,s etc. Figure l(b) shows how the eight leaf-level MBRs of Figure l(a) will be grouped to form two MBRs at the next higher level, assuming a fanout of 4 (i.e. at moat 4 items per non-leaf node). Note that the higher-level MBRs may contain leaf-level MBRs from different data sequences. For example, in Figure l(b) we remark how the left-side MBR1 contains a part of the south-east curve C2. Figure 2 shows the structure of a leaf node and a non-leaf node. Notice that the non-leaf nodes do not need to carry information about sequence-id’s or offsets (t,ta,~ and trod). Figure 1: Example of (a) dividing trails into sub-trails and MBRs, and (b) grouping of MBRs in larger ones. ..... Fl_miq Fl_max .. . .. Level F2_n@ llmix above leaves because the contents of the sliding window in nearby 1 offsets will be similar, We propose to divide the trail of a given data sequence into sub-trails and represent each Sequmce_id of them with its minimum bounding (hyper)-rectangle .. ... T.- T_end 0,.,. Leaf level (MBR). Thus, instead of storing thousands of points of Fl_min, Fl_max a given trail, we shall store only a few MBR.s. More F2.* F2_max important, at the same time we still guarantee ‘no false dismissals’: when a query arrives, we shall retrieve all Figure 2: Index node layout for the last two levels. the MBRs that intersect the query region; thus, we shall retrieve all the qualifying sub-trails, plus some false This completes the discussion of the structure of our alarms (sub-trails that do not intersect the query region, proposed index, We shall refer to it by ‘ST-index’ , for while their MBR does). ‘Sub-Trail index’. There are two questions that we have Figure l(a) gives an illustration of the proposed to answer, to complete the description of our method. approach. Two trails are drawn; the first curve, labeled Cl (in the north-west side), haa been divided into three ● Insertions: when a new data sequence is inserted, sub-trails (and MBRs), whereas the second one, labeled what is a good way to divide its trail in feature space C’2 (in the south-east side), has been divided in five sub- into sub-trails. 423

6. ● Queries: How to handle queries, and especially the ones that are longer than w. Method Description ‘SequentialScan’ Sequential scan of the whole These are the topics of the next two subsections, database. respectively. ‘I-naive’ Search using an ‘ST-index’ with 1 point per sub-trail. Fz ‘I-$xed’ Search using an ‘ST-indez’ with a fixed number of points per El ;5 sub-trail. :4 ‘I-adaptive’ Search using an ‘ST-indez’ with 1 P; P6 a variable number of points per sub-trail. ik P8 * P2 Table 2: Summary of searching methods and descrip- P9* PI * n tions — sub-trail length is 3 (i.e., = @). The resulting MBRs F1- (Figure 3(a)) are not as good as the MBR.s shown in m Figure 3(b). We collectively refer to all the above heuristics as the ‘I-fixed’ method, because they use QP3 $4 <5 an index, the with some fixed-length ‘I-naive’ method is a special method, when the sub-trail length Thus we are looking sub-trails. for a method Clearly, csse of the is set to 1. that will ‘I-fixed’ be able to adapt to the distribution of the points of the trail. We propose to group points into sub-trails by means of an ‘adaptive’ heuristic, which is based on a greedy c1PI l% algorithm. The algorithm tries to estimate uses a cost function, the number which of disk accesses for each of the options. The resulting indexing method will F] - be called ‘I-adaptive’. This is the lsst of the four alternatives we have introduced. Table 2 lists all of Figure 3: Packing points using (a) a fixed heuristic (sub- them, along with a brief description for each method. trail size = 3), and (b) an adaptive heuristic. To complete the description of the ‘I-adaptive ‘method, we have to define a cost function and the concept of marginal cost of a point. In [19] we developed a for- 3.2 Insertion - Methods to divide trails into mula which, given the sides ~ = (Ll, L2, . . . Ln ) of the sub-trails n-dimensional MBR of a node in an R-tre~, estimates the average number of disk accesses DA(L) that this As we saw before, each data sequence is mapped into node will contribute for the average range query: a ‘trail’ in feature space. Then the question arising is: how should we optimally divide a trail in feature space into sub-trails and eventually MBRs, so that the DA(Z) = fi(L~ + 0.5) (lo) number of disk accesses is minimized? A first idea i=l would be to pack points in sub-trails according to a The formula sssumes that the address space hss been pre-determined, fixed number (e.g., 50). However, there normalized to the unit hyper-cube ( [0, l)n). We use is no justifiable way to decide the optimal value of the expected number of disk accesses DA() as the cost this constant. Another idea would be to use a simple function. The marginal cost (me) of a point is defined as th of the stored sequence for this sub- follows: Consider a sub-trail of k points with an MBR of trail size (e.g. Len(S) ). However, both heuristics sizes Ll, ..., Ln. Then the marginal cost of each point may lead to poor results. Figure 3 illustrates the in this sub-trail is problem of having a pre-determined sub-trail size. It shows a trail with 9 points, and it assumes that the mc = DA(~)/k (11) 424

7.That is, we divide the cost of this MBR equally among ‘D(S, Q)<C ~ 9(S[i:j], Q[i:j])<c (l<i<j<i) the contained points. The algorithm is then as follows: (12) Proofi Since D() is the Euclidean distance, we have /* Algorithm Divide-to-Subtralls */ Assign the first point of the trail in a (trivial) sub-trail (13) k=l FOR each successive point IF it increases the marginal cost of the Since current sub-trail TliEIl start another sub-trail ~(S[k] - Q[k])2 < ~(S[k] - ~k])z (14) ELSE include it in the current sub-trail k=i k=] 3.3 Searching-Queries longer thanw we have In the previous subsection we discussed how to insert a new data sequence in the ‘ST-index’ , using an D(S[i : j], Q[i : j]) = f’(S[k] – Q[k])2 s c (15) ‘adaptive’ heuristic. Here reexamine howto search for k=i subsequences that match the query sequence Q within which completes the proof. ❑ tolerance c. If the query is the shortest allowable Using th~ ‘PrefizS;arch’ method, the query region in (Len(Q) = w), the searching algorithm is relatively feature space is a sphere of radius e, and therefore, it straightforward: has volume proportional to cf. Next, we show how to Algorithm ’SearchJ3hort’ reduce the volume of the query region and subsequently, the query sequence Q is mapped to a point qj in the number of false alarms. Without loss of generality, feature space; the query corresponds to aspherein we assume that Len(Q) is an integral multiple of w; feature space with centerqf andradiuse; if this is not the case, we use Lemma 2 and keep the longest prefix that is a multiple of w. we retrieve thesub-trails whose MBRs intersect the Len(Q) = p w (16) query region using our index; Then, we propose to split the longer query into p then, we examine the corresponding subsequences of pieces of length w each, process each sub-query and the data sequences, to discard the false alarms. merge the results of the sub-queries. This approach Notice that the retrieved MBRs of sub-trails is a takes full advantage of our ‘ST-index’ . Moreover, superset of the sub-trails we should actually retrieve; as we show, the tolerance specified for each sub-query can be reduced to c/@. The final result is that if a sub-trail intersects the query region, its MBR will definitely do so (while the reverse is not necessarily the total query volume in feature space is reduced. true). Thus themethod introduces no false dismissals. The following Lemma establishes the correctness of the proposed method. Consider two sequences Q and S of Handling longer queries (Len(Q) > w) is more the same length Len(Q) = Len(S) = p *w. Consider complicated. Thereason is that the ‘ST-index’’knows’ their p disjoint subsequences q~ = Q[i* w+ 1: (i+ 1) * w] only about subsequences of length w. A straightforward ands~ = S[i*w+l: (i+l)*w], where O~ i<p–1. approach would be to select a subsequence (e.g., the prefix) of Q of length w, and use our ‘ST-indez’ to Lemma 3 If Q and S agree within tolerance e then search for data subsequences that match the prefix of Q at least one of the pairs (si, qi) of corresponding sub- within tolerance ~. We call this method ‘PrejixSearch’. sequences agree within tolerance e~fi: This will clearly return a superset of the qualifying subsequences: a subsequence Tthat is within tolerance 9(Q, S) < c ~ V:~: (D(q~, S~) < c/@ (17) c of the query sequence Q (Len(Q) = Len(T)) will where V indicates disjunction. have all its (sub) subsequences withb tolerance < e from Proof. By contradiction: If all the pairs of subs~ the corresponding subsequence of Q. In general we can quences have distance > ej@, then, by adding all these prove the following lemma: distances, the distance of Q and S will be > c, which is a contradiction. More formally, since for i = O, ..., p – 1 Lemma 2 If two sequences S and Q with the same length 1 agree within tolerance c, then any pair (S[i : j], (i+l).w Q[i : j]) of corresponding subsequences agree within the (18) ‘Z(qi, ‘i)= ~ (WUI - ‘iLd)2 same tolerance. j=i*w+l 425

8.we have that was a real number, having a size of 4 bytes. Figure 4(a) shows a sample of 250 points of such a stock price Vi (D(q~, s~) > C/@) + (19) sequence. The system is written in ‘C’ under AIX, on an IBM RS/6000. Based on the results of [2], we used (i+l)*w ( only the first 3 frequencies of the DFT; thus our feature Vi ~ (!?ihl - ‘iti])’ > e2/P * (20) space has f=6 dimensions (real and imaginary parts of j=i*w+l ) each retained DFT coefficient). Figure 4(b) illustrates p*w a trail in feature space: it is a 2-dimensional ‘phase’ (21) plot of the amplitude of the O-th DFT coefficient vs. X(QUI - shl)’ >p.t2/p= ,2 j=l the amplitude of the l-st DFT coefficient. Figure 4(c) or similarly plots the amplitudes of the l-st and 2-rid DFT D(Q, S) > t (22) coefficients. For both figures the window size w was 512 points. The smooth evolution of both curves justifies which contradicts the hsmothesis. ❑ our method to group successive points of the feature The searching algori~hm that uses Lemma 3 will be space in MBRs. An R*-tree [9] was used to store the called ‘MultiPiece’ search. It works as follows: MBRs of each sub-trail in feature space. Algorithm ‘Search_Long ( ‘MultiPiece’ )’ We carried out experiments to measure the perfor- ● the query sequence Q is broken in p sub-queries mance of the most promising method: ‘I-adaptive’. which correspond to p spheres in feature space with For each setting, we asked queries with variable se- radius c/~; lectivities, and we measured the wall-clock time on a dedicated machine. More specifically, query se- ● we use our ‘ST-indez’ to retrieve the sub-trails quences were generated by taking random offsets into whose MBRs intersect at least one the sub-query the data sequence and obtaining subsequences of length regions; Len(Q) from those offsets. For each such query se- quence, a tolerance ~ was specified and the query was ● then, we examine the corresponding subsequences of run with that tolerance. This method was followed in the data sequences, to discard the false alarms. order to eliminate any bias in the results that may be Next, we compare the two search algorithms caused by the index structure, which is not uniform at ( ‘PrefixSearch’ and ‘MultiPiece’ ) with respect to the the leaf level. Unless otherwise specified, in all the ex- volume they require in feature space. The volume of an periments we used w = 512 and Len(Q) = w. f-dimensional sphere of radius c is given by We carried out four groups of experiments, as follows: K ~f (23) (a) Comparison of the the proposed ‘I-adaptive ‘method against the ‘I-naive’ method (the method that has where K is a constant (K = n for a 2-dimensional space, sub-trails, each one consisting of one point). etc). This is exactly the volume of the ‘R-@Mearch’ algorithm. The ‘MultiPiece’ algorithm yields p spheres, (b) Experiments to compare the response time of our each of volume proportional to method ( ‘1-adaptive’) with sequential scanning for queries of length w. K(e/@f (24) (c) Experiments with queries longer than w. for a total volume of (d) Experiments with a larger, synthetic data set, to K*p*#/@=K*Ef/#2-1 (25) see whether the superiority of our method holds for other dat asets. This means that the proposed ‘iWultiPiece ’ search method is likely to produce fewer false alarms, and Comparison with the ‘I-naive’ method. Figure therefore better response time than the ‘PrejixSearch’ 5 illustrates the index size as a function of the length method, whenever we have j > 2 features. of the sub-trails for the three alternatives ( ‘1-naive’ , ‘I-fixed’ and ‘I-adaptive ‘). Our method requires only 4 Experiments 5Kb, while the ‘I-naive’ method requires %24 MB. The We implemented the ‘ST-index’ method using the ‘I-fixed’ method gives varying results, according to the ‘adaptive’ heuristic ss described in section 3, and we length of its sub-trails. ran experiments on a stock prices database of 329,000 The large size of the index for the ‘I-naive’ method points, obtained from sf i. sant af e. edu. Each point hurts its search performance as well: in our experiments, 426

9. ‘&4 0.1 OM 2.006 0.06 265 I 2s45 0114 2.64 Om 2.SW 11,~ ~ o~ 0.12 0.1s a14 a16 0.16 0.17 0.18 0.10 a2 021 0.0SS0.00 O.WO0.070.075 o,M 0.0$ 50.020.0850,1 O,lM A@2mbafO+iID~~ An@u2n d M DITmdRc+eti (b) (c) Figure 4: (a) Sample stock-price sequence; (b) its trail in the feature space of the O-th and l-st DFT coefficients and (c) the equivalent trail of the l-st and 2-rid DFT coefficients ‘“+08 ~ ❑ I-neive 1e+07 100 , ● j g : ● ● “. . ..” .“ ● *-$,*.:. ● ● * ● ● .”● 10 , **8*” ● .;”” ..* ● ●. ● ●* ●*: ● ● loco” ● .’.:r~} 1000 < 1 10 100 1000 1000o ie-06 1945 O.owl 0.001 0.01 0.1 Average cub-trail length Query Selectivity Figure 5: Index space vs. the average sub-trail length Figure 6: Relative wall clock time VS, selectivity in log- (log-log scales) log scale (Len(Q) =w=512 points). the ‘I-naive’ method was approximately two times Response time - longer queries. Next we examine slower than sequentially scanning the entire database. the relative performance of the two methods for queries that are longer than w. As explained in the previous Response time - ‘Short’ queries. We start examin- section, in these cases we have to split the query and ing our method’s response time using the shortest ac- merge the results ( WultiPiece’ method). As illustrated ceptable queries, that is, queries of length equal to the in Figure 7, again the proposed ‘I-adaptive ‘method window size (Len(Q) = w). We used 512 points for outperforms the sequential scanning, from 2 to 40 times. Len(Q) and w. Figure 6 gives the relative response Synthetic data. In Figure 8 we examine our tech- time of the sequential scanning method (Z) vs. our in- nique’s performance against a time-series database con- dex assisted method (T,, where r stands for ‘R-tree’), sisting of 500,000 points produced by a random-walk by counting the actual wall-clock time for each method. The horizontal axis is the selectivity; both axes are in method. These points were generated with a starting value of 1.5, whereas the step increment on each step logarithmic scales. The query selectivity varies up to was + .001. Again we remark that our method outper- 10% which is fairly large in comparison with our 329,000 forms sequential scanning from 100 to 10 times approx- points time-series database. We see that our method imately for selectivitiea up to 10%. achieves from 3 up to 100 times better response time for selectivities in the range from 10-4 to 10%. 5 Conclusions We carried out similar experiments for Len(Q) = w = 256, 128 etc., and we observed similar behavior and We have presented the detailed design of a method that similar savings. Thus, we omit those plots for brevity. efficiently handles approximate (and exact) queries for Our conclusion is that our method consistently achieves subsequence matching on a stored collection of data large savings over the sequential scanning. sequences. This method generalizes the work in [2], 427

10. I mo ● it is provably ‘correct’, that is, it never misses Relative Measured Wall Clock Time ● 1 qualifying subsequences (Lemmaa 1-3) Notice that the proposed method can be used with any set of feature-extraction functions (in addition to DFT), I as well as with any spatial access method that handles rectangles. Future work includes the extension of this method for 2-dimensional gray-scale images, and in general for n- dimensional vector-fields (such as 2-d color images to answer queries by image content as in [21], or 3-d MRI brain scans to help detect patterns of brain activation I I L-oe 1s-05 O.0001 0.01 0.1 as discussed in [7] etc.) Query Saieetivity References Figure 7: Relative wall clock time vs. selectivity in a [1] R. Agrawal, T. Imielinski, and A. Swami. Database log-log scale (w=128, Len(Q) =512 points). mining: a performance perspective. IEEE tins. lWO \ on Knowledge and Data Engineen”ng, 5(6):914–925, -1 Relative Measured Well Clock Time ● 1993. [2] Rakesh Agrawal, Christos Faloutsos, and Arun Swami. Efficient similarity search in sequence databases. In FODO Conference, Evanston, Illi- nois, October 1993. also available through anonymous ftp, from olym- [3] Rakesh Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, and Arun Swami. An interval classifier I I for database mining applications. VLDB Conf. Los 10-05 o.omi 0.01 0.1 Proc., pages 560-573, August 1992. Query Seledivily [4] Rakesh Agrawal, Tomasz Imielinski, and Arun Figure 8: Relative wall clock time vs. selectivity for Swami. Mining association rules between sets of random walk data in a log-log scale (Len(Q) =w=512 items in large databases. Proc. ACM SIGMOD, points). pages 207-216, May 1993. [5] K.K. A1-Taha, R.T. Snodgrasa, and M.D. Soo. which examined the ‘whole matching’ case (i.e., all Bibliography on spatiotemporal databases. ACM queries and all the sequences in the time-series database SIGMOD Record, 22(1):59-67, March 1993. had to have the same length). The idea in the proposed method is to map a data sequence into a set of boxes in [6] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, feature space; subsequently, these can be stored in any and D .J, Lipman. A basic local alignment search spatial access method, such as the R*-tree. tool. Journal of Molecular Biology, 1990. The main contribution is that we have designed in detail the fimt, to our knowledge, indexing method for [7] Manish Arya, William Cody, Christos I?aloutsos, Joel Richardson, and Arthur Toga. Qbism: a pro- subsequence matching. The method has the following totype 3-d medical image database system. IEEE desirable features: Data Engineering Bulletin, 16(1):38–42, March ● it achieves orders of magnitude savings over the se- 1993, quential scanning, as it was showed by our experi- [8] Manish Arya, William Cody, Christos Faloutsos, ments on real data, Joel Richardson, and Arthur Toga. Qbism: Ex- ● it requires small space overhead, tending a dbms to support 3d medical images. Tenth Int. C’onf. on Data Engineering (ICDE), ● it is dynamic, and February 1994. (to appear). 428

11. [9] N. Beckmann, H.-P. Kriegel, R. Schneider, and Also available as IBM Research Report RJ 9203 B. Seeger. The r*-tree: an efficient and robust (81511), Feb. 1, 1993, Computer Science. access method for points and rectangles. ACM SIGMOD, pages 322-331, May 1990. [22] J. Nievergelt, H. Hinterberger, and K.C. Sevcik. The grid file: an adaptable, symmetric multikeyfile [10] Christopher Chatfield. The Analysis of Time structure. ACM TODS, 9(1):38–71, March 1984. Series: an Introduction. Chapman and Hall, [23] Alan Victor Oppenheim and Ronald W. Schafer. London & New York, 1984. Third Edition. Digital SignaJ Processing. Prentice-Hall, Engl& [11] Mathematical Committee on Physical and NSF En- wood Cliffs, N. J., 1975. gineering Sciences. Grand Challenges: High Per- [24] J. Orenstein. Spatial query processing in an object- formance Computing and Communications. Na- oriented database system. Proc. ACM SIGMOD, tional Science Foundation, 1992. The FY 1992 U.S. pages 326–336, May 1986. Research and Development Program. [25] Mary Beth Ruskai, Gregory Beylkin, Ronald Coif- [12] Robert D. Edwards and John Magee. Technical man, Ingrid Daubechies, Stephane Mallat, Yves Analysis of Stock llends. John Magee, Springfield, Meyer, and Louise Raphael. Wavelets and Their Massachusetts, 1966. 5th Edition, second printing. Applications. Jones and Bartlett Publishers, [13] C, Faloutsos. Access methods for text. ACM Boston, MA, 1992. Computing Surveys, 17(1):49-74, March 1985. [26] H. Samet. The Design and Analysis of Spatial Data [14] C. Faloutsos, W. Equitz, M. Flickner, W. Niblack, Structures. Addison-Wesley, 1989. D. Petkovic, and R. Barber. Efficient and effective [27] Manfred Schroeder. Fractals, Chaos, Power Laws: querying by image content. Journal of Ini?elligent Minutes From an Injinite Paradise, W,H. Freeman Information Systems, 1993. (to appear). and Company, New York, 1991. [15] A. Guttman. R-trees: a dynamic index structure [28] T, Sellis, N. Roussopoulos, and C. Faloutsos. The for spatial searching. Proc, ACM SIGMOD, pages r+ tree: a dynamic index foi multi-dimensional 47-57, June 1984. objects. In Proc. l%h International Conference on [16] Richard Wesley Hamming. Digital Filters. VLDB, pages 507-518, England,, September 1987. Prentice-Hall Signal Processing Series, Englewood also available aa SRC-TR-87-32, UMIACS-TR-87- Cliffs, N.J., 1977. 3, CS-TR-1795. [17] H. V. Jagadish. Spatial search with polyhedra. [29] R. Stare and Richard Snodgrass. A bibliography Proc. Sixth IEEE Int’1 Conf. on Data Engineering, on temporal databases. IEEE Bulletin on Data February 1990. Engineering, 11(4), December 1988. [18] H,V. Jagadish. A retrieval technique for similar [30] Dimitris Vassiliadis. The input-state space ap- shapes. Proc. ACM SIGMOD Conf., pages 208- proach to the prediction of auroral geomagnetic ac- 217, May 1991. tivity from solar wind variables. Int. Workshop on Applications of Arts’’cial Intelligence in Solar Ter- [19] Ibrahim Kamel and Christos Faloutsos. On packing restrial Physics, September 1993. r-trees. Second Ini. Conf on Information and Knowledge Management (CIKM), November 1993. [31] Gregory K. Wallace. The jpeg still picture compres- sion standard. CA CM, 34(4):31-44, April 1991. [20] B. Mandelbrot. Fractal Geomety of Nature. W.H. Freeman, New York, 1977. [21] Wayne Niblack, Ron Barber, Will Equitz, Myron Flickner, Eduardo Glasman, Dragutin Petkovic, Peter Yanker, Christos Faloutsos, and Gabriel Taubin. The qbic project: Querying images by content using color, texture and shape. SPIE 1993 Intl. Symposium on Electronic Imaging: Science and Technology, Conf. 1908, Storage and Retrieval for Image and Video Databases, February 1993. 429