Querying and Mining of Time Series Data: ExperimentalComparison

The last decade has witnessed a tremendous growths of interests in applications that deal with querying and mining of time series data. Numerous representation methods for reduction and similarity measures geared towards time series have been introduced. Each individual work introducing a particular method has made specific claims and, aside from the occasional theoretical justi-fications, provided quantitative experimental observations.However, for the most part, the comparative aspects of these experiments were too narrowly focused on demonstrating the benefits of the proposed methods over some of the previously introduced ones. In order to provide a comprehensive validation, we conducted an extensive set of time series experiments re-implementing 8 different

1. Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures ∗ † ‡ Hui Ding§ Goce Trajcevski§ Peter Scheuermann§ Xiaoyue Wang¶ Eamonn Keogh¶ §hdi117, goce, peters@eecs.northwestern.edu ¶xwang, eamonn@cs.ucr.edu Northwestern University University of California, Riverside Evanston, IL 60208 Riverside, CA 92517 ABSTRACT based services etc. As a consequence, in the last decade The last decade has witnessed a tremendous growths of in- there has been a dramatically increasing amount of interest terests in applications that deal with querying and min- in querying and mining such data which, in turn, resulted ing of time series data. Numerous representation methods in a large amount of work introducing new methodologies for dimensionality reduction and similarity measures geared for indexing, classification, clustering and approximation of towards time series have been introduced. Each individ- time series [13, 17, 22]. ual work introducing a particular method has made spe- Two key aspects for achieving effectiveness and efficiency cific claims and, aside from the occasional theoretical justi- when managing time series data are representation methods fications, provided quantitative experimental observations. and similarity measures. Time series are essentially high di- However, for the most part, the comparative aspects of these mensional data [17] and directly dealing with such data in experiments were too narrowly focused on demonstrating its raw format is very expensive in terms of processing and the benefits of the proposed methods over some of the previ- storage cost. It is thus highly desirable to develop repre- ously introduced ones. In order to provide a comprehensive sentation techniques that can reduce the dimensionality of validation, we conducted an extensive set of time series ex- time series, while still preserving the fundamental charac- periments re-implementing 8 different representation meth- teristics of a particular data set. In addition, unlike canoni- ods and 9 similarity measures and their variants, and testing cal data types, e.g., nominal or ordinal variables, where the their effectiveness on 38 time series data sets from a wide distance definition is straightforward, the distance between variety of application domains. In this paper, we give an time series needs to be carefully defined in order to reflect overview of these different techniques and present our com- the underlying (dis)similarity of such data. This is partic- parative experimental findings regarding their effectiveness. ularly desirable for similarity-based retrieval, classification, Our experiments have provided both a unified validation of clustering and other mining procedures of time series [17]. some of the existing achievements, and in some cases, sug- Many techniques have been proposed in the literature for gested that certain claims in the literature may be unduly representing time series with reduced dimensionality, such as optimistic. Discrete Fourier Transformation (DFT) [13], Single Value Decomposition (SVD) [13], Discrete Cosine Transformation (DCT) [29], Discrete Wavelet Transformation (DWT) [33], 1. INTRODUCTION Piecewise Aggregate Approximation (PAA) [24], Adaptive Today time series data are being generated at an unprece- Piecewise Constant Approximation (APCA) [23], Chebyshev dented speed from almost every application domain, e.g., polynomials (CHEB) [6], Symbolic Aggregate approXima- daily fluctuations of stock market, traces of dynamic pro- tion (SAX) [30], Indexable Piecewise Linear Approximation cesses and scientific experiments, medical and biological ex- (IPLA) [11] and etc. In conjunction with this, there are over perimental observations, various readings obtained from sen- a dozen distance measures for similarity of time series data in sor networks, position updates of moving objects in location- the literature, e.g., Euclidean distance (ED) [13], Dynamic ∗Research supported by the Northrop Grumman Corp. con- Time Warping (DTW) [5, 26], distance based on Longest tract: P.O.8200082518 Common Subsequence (LCSS) [39],Edit Distance with Real †Research supported by the NSF grant IIS − 0325144/003 Penalty (ERP) [8], Edit Distance on Real sequence (EDR) [9], ‡Research supported by NSF Career Award 0237918 DISSIM [14], Sequence Weighted Alignment model (Swale) [31], Spatial Assembling Distance (SpADe) [12] and similar- Permission to make digital or hard copies of portions of this work for ity search based on Threshold Queries (TQuEST) [4]. Many personal or classroom use is granted without fee provided that copies of these work and some of their extensions have been widely Permission are not made to copy without fee or distributed forallprofit or part orof this material commercial is grantedand advantage provided cited in the literature and applied to facilitate query pro- that that the copies copies bearare notnotice this madeand or distributed for direct the full citation commercial on the first page.advantage, the VLDB copyright notice and the title cessing and data mining of time series data. Copyright for components of this worof the publication k owned by othersand its date than VLDB appear, However, with the multitude of competitive techniques, and notice ismust Endowment givenbethat copying is by permission of the Very Large Data honored. Base Endowment. To copy otherwise, we believe that there is a strong need to compare what Abstracting with c redit is permitted. Toor to republish, copy otherwise,totopost on servers republish, or to post on servers or to redistribute to lists requires prior specificfrom the to redistribute to lists, requires a fee and/or special permission might have been omitted in the comparisons. Every newly- publisher, ACM. introduced representation method or distance measure has permission and/or a fee. Request permission to republish from: VLDB ‘08, August 24-30, 2008, Auckland, New Zealand Publications Dept., ACM, Inc. Fax +1 (212)869-0481 or claimed a particular superiority. However, it has been demon- Copyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00. permissions@acm.org. PVLDB '08, August 23-28, 2008, Auckland, New Zealand Copyright 2008 VLDB Endowment, ACM 978-1-60558-306-8/08/08 1542

2.strated that some empirical evaluations have been inade- a classification of the major techniques arranged in a hier- quate [25] and, worse yet, some of the claims are even con- archy. tradictory. For example, one paper claims “wavelets out- perform the DFT” [34], another claims “DFT filtering per- • Data Adaptive formance is superior to DWT” [19] and yet another claims Piecewise Polynomials o   Interpolation* “DFT-based and DWT-based techniques yield comparable re-   Regression sults” [41]. Clearly these claims cannot all be true. The risk o Adaptive Piecewise Constant Approximation* is that this may not only confuse newcomers and practition- o Singular Value Decomposition* Symbolic ers of the field, but also cause a waste of time and research o   Natural Language efforts due to assumptions based on incomplete or incorrect   Strings • Non-Lower Bounding claims. • SAX* Motivated by these observations, we have conducted the • Clipped Data* most extensive set of time series experiments to-date, re- o Trees • Non-Data Adaptive evaluating the state-of-the-art representation methods and o Wavelets* similarity measures for time series that appeared in high o Random Mappings quality conferences and journals. Specifically, we have re- o Spectral   DFT* implemented 8 different representation methods for time se-   DCT* ries, and compared their pruning power over various time   Chebyshev Polynomials* series data sets. We have re-implemented 9 different simi- o Piecewise Aggregate Approximation* larity measures and their variants, and compared their ef- fectiveness using 38 real world data sets from highly diverse Figure 1: A Hierarchy of Representation Methods application domains. All our source code implementations The representations annotated with an asterisk (∗) in Fig- and the data sets are publicly available on our website [1]. ure 1 have the very desirable property of allowing lower The rest of this paper is organized as follows. Section 2 bounding. That is to say, we can define a distance mea- reviews the concept of time series, and gives an overview surement on the reduced-size (i.e., compressed) representa- of the definitions of different representation techniques and tions that is guaranteed to be less than or equal to the true similarity measures investigated in this work. Section 3 and distance measured on the raw data. It is this lower bound- Section 4 present the main contribution of this work – the ing property that allows using representations to index the results of the extensive experimental evaluations of differ- data with a guarantee of no false negatives [13]. The list ent representation methods and similarity measures, respec- of representations considered in this study includes (in ap- tively. Section 5 concludes the paper and discusses possible proximate order of introduction) DFT, DCT, DWT, PAA, future extensions of the work. APCA, SAX, CHEB and IPLA. The only lower bounding omissions from our experiments below are the eigenvalue 2. PRELIMINARIES analysis techniques such as SVD and PCA [29]. While such Typically, most of the existing work on time series assume techniques give optimal linear dimensionality reduction, we that time is discrete. For simplicity and without any loss believe they are untenable for large data sets. For example, of generality, we make the same assumption here. Formally, while [38] notes that they can transform 70000 time series in a time series data is defined as a sequence of pairs T = under 10 minutes, the assumption is that the data is memory [(p1 , t1 ), (p2 , t2 ), ..., (pi , ti ), ..., (pn , tn )] (t1 < t2 < ... < ti < resident. However, transforming out-of-core (disk resident) ... < tn ), where each pi is a data point in a d-dimensional data sets using these methods becomes unfeasible. Note that data space, and each ti is the time stamp at which pi occurs. the available literature seems to agree with us on this point. If the sampling rates of two time series are the same, one For (at least) DFT, DWT and PAA, there are more than can omit the time stamps and consider them as sequences a dozen projects each that use the representations to index of d-dimensional data points. Such a sequence is called the more than 100000 objects for query-by-humming [18, 45], raw representation of the time series. In reality however, Mo-Cap indexing [7] etc. However we are unaware of any sampling rates of time series may be different. Furthermore, similarly scaled projects that use SVD. some data points of time series may be dampened by noise or even completely missing, which poses additional challenges 2.2 Similarity Measures for Time Series to the processing of such data. For a given time series, its In this section, we review the 9 similarity measures eval- number of data points n is called the length. The portion of a uated in this work, summarized in Figure 2. Given two time series between two points pi and pj (inclusive) is called time series T1 and T2 , a similarity function Dist calcu- a segment and is denoted as sij . In particular, a segment lates the distance between the two time series, denoted by between two consecutive points is called a line segment. Dist(T1 , T2 ). In the following we will refer to distance mea- In the following subsections, we briefly review the repre- sures that compare the i−th point of one time series to the sentation methods and similarity measures studied in this i−th point of another as lock-step measures (e.g., Euclidean work. We observe that this is not meant to be a complete distance and the other Lp norms), and distance measures survey for the respective field and is only intended to pro- that allow comparison of one-to-many points (e.g., DTW) vide the readers with a necessary background for following and one-to-many/one-to-none points (e.g., LCSS) as elastic our experimental evaluations. measures. The most straightforward similarity measure for time se- 2.1 Representation Methods for Time Series ries, is the Euclidean Distance [13] and its variants, based There are many time series representations proposed to on the common Lp -norms [43]. In particular, in this work support similarity search and data mining. Figure 1 shows we used L1 (Manhattan), L2 (Euclidean) and L∞ (Maxi- 1543

3. • Lock-step Measure along the temporal dimension, using a warping threshold δ. o Lp-norms A lower-bounding measure and indexing technique for LCSS ✁ L1-norm (Manhattan Distance) ✁ L2-norm (Euclidean Distance) are introduced in [40]. ✁ Linf-norm EDR [9] is another similarity measure based on the edit o DISSIM distance. Similar to LCSS, EDR also uses a threshold pa- • Elastic Measure o Dynamic Time Warping (DTW) rameter ε, except its role is to quantify the distance between o Edit distance based measure a pair of points to 0 or 1. Unlike LCSS, EDR assigns penal- ✁ Longest Common SubSequence (LCSS) ties to the gaps between two matched segments according ✁ Edit Sequence on Real Sequence (EDR) ✁ Swale to the lengths of the gaps. ✁ Edit Distance with Real Penalty (ERP) The ERP distance [8] attempts to combine the merits of • Threshold-based Measure DTW and EDR, by using a constant reference point for com- o Threshold query based similarity search (TQuEST) puting the distance between gaps of two time series. Essen- • Pattern-based Measure o Spatial Assembling Distance (SpADe) tially, if the distance between two points is too large, ERP simply uses the distance value between one of those point and the reference point. Figure 2: A Summary of Similarity Measures Recently, a new approach for computing the edit distance based similarity measures was proposed in [31]. Whereas traditional tabular dynamic programming was used for com- mum) norms (c.f. [43]). In the sequel, the terms Euclidean puting DTW, LCSS, EDR and ERP, a matching threshold distance and L2 norm will be used interchangeably. Besides is used to divide the data space into grid cells and, subse- being relatively straightforward and intuitive, Euclidean dis- quently, matching points are found by hashing. The similar- tance and its variants have several other advantages. The ity model Swale is proposed that rewards matching points complexity of evaluating these measures is linear, and they and penalizes gaps. In addition to the matching threshold ε, are easy to implement and indexable with any access method Swale requires the tuning of two parameters: the matching and, in addition, are parameter-free. Furthermore, as we reward weight r and the gap penalty weight p. will present, the Euclidean distance is surprisingly compet- The TQuEST distance [4] introduced a rather novel ap- itive with other more complex approaches, especially if the proach to computing the similarity measure between time se- size of the training set/database is relatively large. However, ries. The idea is that, given a threshold parameter τ , a time since the mapping between the points of two time series is series is transformed into a sequence of so-called threshold- fixed, these distance measures are very sensitive to noise and crossing time intervals, where the points within each time misalignments in time, and are unable to handle local time interval have a value greater than a given τ . Each time in- shifting, i.e., similar segments that are out of phase. terval is then treated as a point in a two dimensional space, The DISSIM distance [14] aims at computing the similar- where the starting time and ending time constitute the two ity of time series with different sampling rates. However, dimensions. The similarity between two time series is then the original similarity function is numerically too difficult defined as the Minkowski sum of the two sequences of time to compute, and the authors proposed an approximated dis- interval points. tance with a formula for computing the error bound. Finally, SpADe [12] is a pattern-based similarity measure Inspired by the need to handle time warping in similar- for time series. The algorithm finds out matching segments ity computation, Berndt and Clifford [5] introduced DTW, within the entire time series, called patterns, by allowing a classic speech recognition tool, to the data mining com- shifting and scaling in both the temporal and amplitude munity, in order to allow a time series to be “stretched” dimensions. The problem of computing similarity value be- or “compressed” to provide a better match with another tween time series is then transformed to the one of finding time series. Several lower bounding measures have been in- the most similar set of matching patterns. One disadvantage troduced to speed up similarity search using DTW [21, 26, of SpADe is that it requires tuning a number of parameters, 27, 44], and it has been shown that the amortized cost for such as the temporal scale factor, amplitude scale factor, computing DTW on large data sets is linear [21, 26]. The pattern length, sliding step size etc. original DTW distance is also parameter free, however, as has been reported in [26, 40] enforcing a temporal constraint δ on the warping window size of DTW not only improves its 3. COMPARISON OF TIME SERIES REP- computation efficiency, but also improves its accuracy for measuring time series similarity, as extended warping may RESENTATIONS introduce pathological matchings between two time series We compare all the major time series representations, in- and distort the true similarity. The constraint warping is cluding SAX, DFT, DWT, DCT, PAA, CHEB, APCA and also utilized for developing the lower-bounding distance [26] IPLA. Note that any of these representations can be used to as well as for indexing time series based on DTW [40]. index the Euclidean Distance, the Dynamic Time Warping, Another group of similarity measures for time series are and at least some of the other elastic measures. While var- developed based on the concept of the edit distance for ious subsets of these representations have been compared strings. The best known such distance is the LCSS distance, before, this is the first time they are all been compared utilizing the longest common subsequence model [3, 39]. To together. An obvious question to consider is what metric adapt the concept of matching characters in the settings of should should be used for comparison. We believe that wall time series, a threshold parameter ε was introduced, stat- clock time is a poor choice, because it may be open to im- ing that two points from two time series are considered to plementation bias [25]. Instead, we believe that using the match if their distance is less than ε. The work reported tightness of lower bounds (TLB) is a very meaningful mea- in [39] also considered constraining the matching of points sure [24], and this also appears to be the current consensus 1544

4.in the literature [6, 8, 11, 21–23, 26, 35, 40]. We note that all the representation methods studied in this paper allow lower SAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA bounding. 0.8 T LB = LowerBoundDist(T, S)/T rueEuclideanDist(T, S) 0.6 0.4 0.2 1920 where T and S are the two time series. The advantage of 0 1440 960 480 TLB is two-fold: foetal_ecg (excerpt) 8 10 1. It is a completely implementation-free measure, inde- 4 6 1 20 40 pendent of hardware and software choices, and is there- fore completely reproducible. 2. The TLB allows a very accurate prediction of index- Figure 4: The tightness of lower bounds for various ing performance. time series representations on a relatively bursty data If the value of TLB is zero, then any indexing technique is set (see inset) condemned to retrieving every time series from the disk. If the value of TLB is one, then after some trivial processing in main memory, we could simply retrieve one object from disk and guarantee that we had the true nearest neighbor. Note that the speedup obtained is generally non-linear in TLB, that is to say, if one representation has a lower bound that is twice as large as another, we can usually expect a much greater than two-fold decrease in disk accesses. We randomly sampled T and S (with replacement) 1000 times Figure 5: The tightness of lower bounds for various for each combination of parameters. We vary the time series time series representations on a periodic data set of length {480, 960, 1440, 1920} and the number of coefficients tide levels per time series available to the dimensionality reduction ap- proach {4, 6, 8, 10} (each coefficient takes 4 bytes). For SAX, is very little to choose between representations in terms of we hard coded the cardinality to 256. Figure 3 shows the pruning power. result of one such experiment with an ECG data set. 4. COMPARISON OF TIME SERIES SIMI- LARITY MEASURES In this section we present our experimental evaluation on the accuracy of different similarity measures. 4.1 The Effect of Data Set Size on Accuracy and Speed We first discuss an extremely important finding which, in Figure 3: The tightness of lower bounds for various some circumstances makes some of the previous findings on time series representations on an ECG data set efficiency, and the subsequent findings on accuracy, moot. The result of this experiment may at first be surprising, it This finding has been noted before [35], but does not seem shows that there is very little difference between represen- to be appreciated by the database community. tations, in spite of claims to the contrary in the literature. For an elastic distance measure, both the accuracy of clas- However, we believe that most of these claims are due to sification (or precision/recall of similarity search), and the some errors or bias in the experiments. For example, it amortized speed, depend critically on the size of the data was recently claimed that DFT is much worse than all the set. In particular, as data sets get larger, the amortized other approaches [11], however it appears that the complex speed of elastic measures approaches that of lock-step mea- conjugate property of DFT was not exploited. As another sures, however the accuracy/precision of lock-step measures example, it was claimed “it only takes 4 to 6 Chebyshev co- approaches that of the elastic measures. This observation efficients to deliver the same pruning power produced by 20 has significant implications for much of the research in the APCA coefficients” [6], however this claim has since been literature. Many papers claim something like “I have shown withdrawn by the authors [2]. Of course there is some vari- on these 80 time series that my elastic approach is faster ability and difference depending on data set. For example, than DTW and more accurate that Euclidean distance, so if on highly periodic data set the spectral methods are better, you want to index a million time series, use my method”. and on bursty data sets APCA can be significantly better, However our observation suggests that even if the method as shown in Figure 4. is faster than DTW, the speed difference will decrease for In contrast, in Figure 5 we can see that highly periodic larger data sets. Furthermore, for large data sets, the differ- data can slightly favor the spectral representations (DCT, ences in accuracy/precision will also diminish or disappear. DFT, CHEB) over the polynomial representations (SAX, To demonstrate our claim we conducted experiments on two APCA, DWT/PAA, IPLA). highly warped data sets that are often used to highlight the However it is worth noting that the differences presented superiority of elastic measures, Two-Patterns and CBF. Be- in these figures are the most extreme cases found in a search cause these are synthetic data sets we have the luxury of over 80 diverse data sets from the publicly available UCR creating as many instances as we like, using the data gen- Time Series Data Mining Archive [20]. In general, there eration algorithms proposed in the original papers [15, 16]. 1545

5.However, it is critical to note that the same effect can be that need to be tuned for each similarity measure. We then seen on all the data sets considered in this work. For each present the results of our experiments and discuss several problem we created 10000 test time series, and increasingly interesting findings. large training data sets of size 50, 100, 200, . . ., 6400. We measured the classification accuracy of 1NN for the various 4.2.1 Accuracy Evaluation Framework data sets (explained in more detail in Section 4.2.1), using Accuracy evaluation answers one of the most important both Euclidean distance and DTW with 10% warping win- questions about a similarity measure: why is this a good dow, Figure 6 shows the results. measure for describing the (dis)similarity between time se- ries? Surprisingly, we found that accuracy evaluation is usually insufficient in existing literature: it has been either based on subjective evaluation, e.g., [4, 9], or using clus- tering with small data sets which are not statistically sig- nificant, e.g., [31, 40]. In this work, we use an objective evaluation method recently proposed [25]. The idea is to use a one nearest neighbor (1NN) classifier [17, 32] on la- belled data to evaluate the efficacy of the distance measure used. Specifically, each time series has a correct class la- bel, and the classifier tries to predict the label as that of its nearest neighbor in the training set. There are several ad- vantages with this approach. First, it is well known that the underlying distance metric is critical to the performance of 1NN classifier [32], hence, the accuracy of the 1NN classifier Figure 6: The error rate for 1-Nearest Neighbor Clas- directly reflects the effectiveness of the similarity measure. sification for increasingly large instantiations of two Second, the 1NN classifier is straightforward to implement classic time series benchmarks and is parameter free, which makes it easy for anyone to Note that for small data sets, DTW is significantly more reproduce our results. Third, it has been proved that the accurate than Euclidean distance in both cases. However, error ratio of 1NN classifier is at most twice the Bayes er- for CBF, by the time we have a mere 400 time series in our ror ratio [36]. Finally, we note that while there have been training set, there is no statistically significant difference. attempts to classify time series with decision trees, neural For Two-Patterns it takes longer for Euclidean Distance to networks, Bayesian networks, supporting vector machines converge to DTW’s accuracy, nevertheless, by the time we etc., the best published results (by a large margin) come have seen a few thousand objects there is no statistically from simple nearest neighbor methods [42]. significant difference. This experiment can also be used to demonstrate our Algorithm 1 Time Series Classification with 1NN Classifier claim that amortized speed of a (lower-boundable) elastic method approaches that of Euclidean distance. Recall that Input: Labelled time series data set T, similarity measure Euclidean distance has a time complexity of O(n) and that operator SimDist, number of crosses k a single DTW calculation has a time complexity of O(nw), Output: Average 1NN classification error ratio and stan- where w is the warping window size. However for similarity dard deviation search or 1NN classification, the amortized complexity of 1: Randomly divide T into k stratified subsets T1 , . . . , Tk DTW is O((P · n) + (1 − P ) · nw), where P is the fraction of 2: Initialize an array ratios[k] DTW calculations pruned by a linear time lower bound such 3: for Each subset Ti of T do as LB Keogh. A similar result can be achieved for LCSS as 4: if SimDist requires parameter tuning then well. In the Two-Pattern experiments above, when classi- 5: Randomly split Ti into two equal size stratified sub- fying with only 50 objects, P = 0.1, so we are forced to do sets Ti1 and Ti2 many full DTW calculations. However, by the time we have 6: Use Ti1 for parameter tuning, by performing a 6400 objects, we empirically find out that P = 0.9696, so leave-one-out classification with 1NN classifier about 97% of the objects are disposed of in the same time 7: Set the parameters to values that yields the mini- as takes to do a Euclidean distance calculation. To ground mum error ratio from the leave-one-out tuning pro- this into concrete numbers, it takes less that one second to cess find the nearest neighbor to a query in the database of 6400 8: Use Ti as the training set, T − Ti as the testing set Two-Pat time series, on our off-the-shelf desktop, even if we 9: ratio[i] ← the classification error ratio with 1NN clas- use the pessimistically wide warping window. This time is sifier for just sequential search with a lower bound, no attempt 10: return Average and standard deviation of ratios[k] was made to index the data. To summarize, many of the claims over who has the fastest To evaluate the effectiveness of each similarity measure, or most accurate distance measure have been biased by the we use a cross-validation algorithm as described in Algo- lack of tests on very (or even slightly) large data sets. rithm 1, based on the approach suggested in [37]. We first use a stratified random split to divide the input data set into 4.2 Accuracy of Similarity Measures k subsets for the subsequent classification (line 1), to min- In this section, we evaluate the accuracy of the similar- imize the impact of skewed class distribution. The number ity measures introduced in Section 2. We first explain the of cross validations k is dependent on the data sets and we methodology of our evaluation, as well as the parameters explain shortly how we choose the proper value for k. We 1546

6.then carry out the cross validation, using one subset at a ries n. An additional parameter for LCSS, which is also time for the training set of the 1NN classifier, and the rest used in EDR and Swale, is the matching threshold ε. We k − 1 subsets as the testing set (lines 3 − 9). If the similarity search for the optimal threshold starting from 0.02 · Stdv up measure SimDist requires parameter tuning, we divide the to Stdv, where Stdv is the standard deviation of the data training set into two equal size stratified subsets, and use set. Swale has another two parameters, the matching reward one of the subset for parameter tuning (lines 4 − 7). We weight and the gap penalty weight. We fix the matching re- perform an exhaustive search for all the possible (combina- ward weight to 50 and search for the optimal penalty weight tions of) value(s) of the similarity parameter, and conduct from 0 to 50, as suggested by the authors. Although the a leave-one-out classification test with a 1NN classifier. We warping window size can also be constrained for EDR, ERP record the error ratios of the leave-one-out test, and use and Swale, due to limited computing resource and time, we the parameter values that yield the minimum error ratio. only consider full matching for these distance measures in Finally, we report the average error ratio of the 1NN classi- our current experiments. For TQuEST, we search for the fication over the k cross validations, as well as the standard optimal querying threshold from Avg − Stdv to Avg + Stdv, deviation (line 10). where Avg is the average of the time series data set. For Algorithm 1 requires that we provide an input k for the SpADe, we tune four parameters based on the original im- number of cross validations. In our experiments, we need to plementation and use the parameter tuning strategy, i.e. take into consideration the impact of training data set size search range, step size, as suggested by the authors. In discussed in Section 4.1. Therefore, our selection of k for Table 1, plength is the length of the patterns, ascale and each data set tries to strike a balance between the following tscale are the maximum amplitude and temporal scale dif- factors: ferences allowed respectively, and slidestep is the minimum 1. The training set size should be selected to enable dis- temporal difference between two patterns. criminativity, i.e., one can tell the performance difference between different distance measures. 4.3 Analysis of Classification Accuracy 2. The number of items in the training set should be In order to provide a comprehensive evaluation, we per- large enough to represent each class. This is especially form the experiments on 38 diverse time series data sets, important when the distance measure need parameter provided by the UCR Time Series repository [20], which tuning. make up somewhere between 90% and 100% of all publicly 3. The number of cross validations should be between available, labelled time series data sets in the world. For sev- 5 − 20 in order to minimize bias and variation, as recom- eral years everyone in the data mining/database community mended in [28]. has been invited to contribute data sets to this archive, and The actual number of splits is empirically selected such that 100% of the donated data sets have been archived. This en- the training error for 1NN Euclidean distance (which we use sures that the collection represents the interest of the data as a comparison reference) is not perfect, but significantly mining/database community, and not just one group. All better than the default rate. the data sets have been normalized to have a maximum Several of the similarity measures that we investigated scale of 1.0 and all the time series are z-normalized. The require the setting of one or more parameters. The proper entire simulation was conducted on a computing cluster in values for these parameters are key to the effectiveness of the the Northwestern University, with 20 multi-core worksta- measure. However, most of the time only empirical values tions running for over a month. The results are presented are provided for each parameter in isolation. In our experi- in Table 2. Due to limited space, we only show the average ments, we perform an exhaustive search for all the possible error ratio of the similarity measures on each data set. More values of the parameters, as described in Table 1. detailed results, such as the standard deviation of the cross validations, are hosted on our web site [1]. Parameter Min Value Max Value Step Size To provide a more intuitive illustration of the performance DTW.δ 1 25% · n 1 LCSS.δ 1 25% · n 1 of the similarity measures compared in Table 2, we now use LCSS.ε 0.02 · Stdv Stdv 0.02 · Stdv scatter plots to conduct pair-wise comparisons. In a scatter EDR.ε 0.02 · Stdv Stdv 0.02 · Stdv plot, the error ratios of the two similarity measures under Swale.ε 0.02 · Stdv Stdv 0.02 · Stdv comparison are used as the x and y coordinates of a dot, Swale.reward 50 50 - where each dot represent a particular data set. Where a Swale.penalty 0 reward 1 scatter plot has the label “A vs B”, a dot above line indicates TQuEST.τ Avg − Stdv Avg + Stdv 0.02 · Stdv that A is more accurate than B (since these are error ratios). SpADe.plength 8 64 8 The further a dot is from the line, the greater the margin SpADe.ascale 0 4 1 SpADe.tscale 0 4 1 of accuracy improvement. The more dots on one side of SpADe.slidestep plength/32 plength/8 plength/32 the line indicates that the worse one similarity measure is compared to the other . First, we compare the different variances of Lp -norms. Table 1: Parameter Tuning for Similarity Measures Figure 7 shows that the Euclidean distance and the Man- hattan distance have a very close performance, while both For DTW and LCSS measures, a common optional pa- largely outperforms the L∞ -norm. This is as expected since rameter is the window size δ that constrains the temporal the L∞ -norm uses the maximum distance between two sets warping, as suggested in [40]. In our experiments we con- of time series points, and is more sensitive to noise [17]. sider both the version of distance measures without warping Next we illustrate the performance of DTW against Eu- and with warping. For the latter case, we search for the best clidean. Figure 8 (a) shows that full DTW is clearly supe- warping window size up to 25% of the length of the time se- rior over Euclidean on the data sets we tested. Figure 8 (b) 1547

7. (a) Euclidean vs L norm (b) Euclidean vs L norm 1 ∞ 1 1 (a) Euclidean vs TQuEST (b) Full DTW vs TQuEST 1 1 0.8 0.8 0.8 0.8 L norm L norm 0.6 0.6 TQuEST TQuEST 0.6 0.6 ∞ 1 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0.5 1 0 0.5 1 Euclidean Euclidean 0 0 0 0.5 1 0 0.5 1 Euclidean Full DTW Figure 7: Accuracy of Various Lp -norms, above the line Euclidean outperforms L1 norm/L∞ norm Figure 10: Accuracy of TQuEST, above the line Eu- (a) Euclidean vs Full DTW (b) Constrained DTW vs Full DTW 1 1 clidean/Full DTW outperforms TQuEST 0.8 0.8 (a) Euclidean vs Full LCSS (b) Full DTW vs Full LCSS 1 1 Full DTW Full DTW 0.6 0.6 0.8 0.8 0.4 0.4 Full LCSS Full LCSS 0.6 0.6 0.2 0.2 0.4 0.4 0 0 0 0.5 1 0 0.5 1 Euclidean Constrained DTW 0.2 0.2 0 0 0 0.5 1 0 0.5 1 Figure 8: Accuracy of DTW, above the line Eu- Euclidean Full DTW clidean/Constrained DTW outperforms Full DTW Figure 11: Accuracy of LCSS, above the line Eu- shows that the effectiveness of constrained DTW is the same clidean/Full DTW outperforms Full LCSS (or even slightly better) than that of full DTW. This means that we could generally use the constrained DTW instead (a) Euclidean vs EDR (b) Full DTW vs EDR 1 1 of DTW, to reduce the time for computing the distance and to utilize proposed lower bounding techniques [26]. 0.8 0.8 Unless otherwise stated, in the following we compare the 0.6 0.6 EDR EDR rest of the similarity measures against Euclidean distance and full DTW, since Euclidean distance is the fastest and 0.4 0.4 most straightforward measure, and DTW is the oldest elastic 0.2 0.2 measure. 0 0 0 0.5 1 0 0.5 1 Euclidean Full DTW (a) Euclidean vs DISSIM (b) Full DTW vs DISSIM 1 1 0.8 0.8 Figure 12: Accuracy of EDR, above the line Eu- clidean/Full DTW outperforms EDR DISSIM DISSIM 0.6 0.6 (a) Euclidean vs ERP (b) Full DTW vs ERP 0.4 0.4 1 1 0.2 0.2 0.8 0.8 0 0 0.6 0.6 0 0.5 1 0 0.5 1 ERP ERP Euclidean Full DTW 0.4 0.4 0.2 0.2 Figure 9: Accuracy of DISSIM, above the line Eu- clidean/Full DTW outperforms DISSIM 0 0 0 0.5 1 0 0.5 1 Euclidean Full DTW The performance of DISSIM against that of Euclidean and DTW are shown in Figure 9. It can be observed that the accuracy of DISSIM is slightly better than Euclidean Figure 13: Accuracy of ERP, above the line Eu- distance, however, it is apparently inferior to DTW. clidean/Full DTW outperforms ERP The performance of TQuEST against that of Euclidean and DTW are shown in Figure 10. On most of the data Figure 13 respectively. All three distances outperform Eu- sets, TQuEST is worse than Euclidean and DTW distances. clidean distance by a large percentage. On the other hand, While the outcome of this experiment cannot account for while it is commonly believed that these edit distance based the usefulness of TQuEST, it indicates that there is a need similarity measures are superior to DTW [8, 10, 12], our ex- to investigate the characteristics of the data set for which periments show that this is generally not the case. Only TQuEST is a favorable measure. EDR is potentially slightly better than full DTW, whereas The performance of LCSS, EDR and ERP against that of the performance of LCSS and ERP are very close to DTW. Euclidean and DTW are shown in Figure 11, Figure 12 and Even for EDR, a more formal analysis using two-tailed, 1548

8. Full LCSS vs Constrained LCSS 1 Given the small tuning data sets, it is very difficult to pick the right values. However, we note again that the outcome 0.8 of this experiment cannot account for the utility of SpADe. Constrained LCSS For example, one major contribution of SpADe is to detect 0.6 interesting patterns online for stream data. 0.4 In summary, we found through experiments that there is no clear evidence that there exists one similarity measure 0.2 that is superior to others in the literature, in terms of accu- racy. While some similarity measures are more effective on 0 0 0.2 0.4 0.6 Full LCSS 0.8 1 certain data sets, they are usually inferior on some others data sets. This does not mean that the time series commu- Figure 14: Accuracy of Constrained LCSS, above the nity should settle with the existing similarity measures, – line Full LCSS outperforms Constrained LCSS quite the contrary. However, we believe that more caution need to be exercised to avoid making the kind of mistakes paired t-test is required to reach any statistically signifi- we illustrate in the Appendix. cant conclusion [37]. We also studied the performance of constrained LCSS, as shown in Figure 14. It can be ob- 5. CONCLUSION & FUTURE WORK served that the constrained version of LCSS is even slightly In this paper, we conducted an extensive experimental better the unconstrained one, while it also reduces the com- consolidation on the state-of-the-art representation meth- putation cost and gives rise to an efficient lower-bounding ods and similarity measures for time series data. We re- measure [40]. implemented and evaluated 8 different dimension-reduction representation methods, as well as 9 different similarity mea- (a) Euclidean vs Swale (b) Full LCSS vs Swale sures and their variants. Our experiments are carried on 38 1 1 diverse time series data sets from various application do- 0.8 0.8 mains. Based on the experimental results we obtained, we make the following conclusions. 0.6 0.6 Swale Swale 1. The tightness of lower bounding, thus the pruning 0.4 0.4 power, thus the indexing effectiveness of the different 0.2 0.2 representation methods for time series data have, for the most part, very little difference on various data sets. 0 0 0 0.5 1 0 0.5 1 2. For time series classification, as the size of the train- Euclidean Full LCSS ing set increases, the accuracy of elastic measures con- verge with that of Euclidean distance. However, on small Figure 15: Accuracy of Swale, above the line Eu- data sets, elastic measures, e.g., DTW, LCSS, EDR and clidean/Full LCSS outperforms Swale ERP etc. can be significantly more accurate than Eu- Next, we compare the performance of Swale against that clidean distance and other lock-step measures, e.g., L∞ - of Euclidean distance and LCSS, as Swale aims at improving norm, DISSIM. the effectiveness of LCSS and EDR. The results are shown 3. Constraining the warping window size for elastic mea- in Figure 15, and suggests that Swale is clearly superior to sures, such as DTW and LCSS, can reduce the compu- Euclidean distance, and yields an almost identical accuracy tation cost and enable effective lower-bounding, while as LCSS. yielding the same or even better accuracy. 4. The accuracy of edit distance based similarity mea- (a) Euclidean vs SpADe (b) Full DTW vs SpADe sures, such as LCSS, EDR and ERP are very close to that 1 1 of DTW, a 40 year old technique. In our experiments, 0.8 0.8 only EDR is potentially slightly better than DTW. 5. The accuracy of several novel types of similarity mea- SpADe SpADe 0.6 0.6 sures, such as TQuEST and SpADe, are on general infe- 0.4 0.4 rior to elastic measures. 0.2 0.2 6. If a similarity measure is not accurate enough for the 0 0 task, getting more training data really helps. 0 0.5 1 0 0.5 1 Euclidean Full DTW 7. If getting more data is not possible, then trying the other measures might help, however, extreme care must be taken to avoid overfitting. Figure 16: Accuracy of SpADe, above the line Eu- As an additional comment, but not something that can be clidean/Full DTW outperforms SpADe conclusively validated from our experiments, we would like Finally, we compare the performance of SpADe against to bring an observation which, we hope, may steer some in- that of Euclidean distance and DTW. The results are shown teresting directions of future work. Namely, when pair-wise in Figure 16. In general, the accuracy of SpADe is close to comparison is done among the methods, in few instances we that of Euclidean but is inferior to DTW distance, although have one method that has worse accuracy than the other in on some data sets SpADe outperforms the other two. We majority of the data sets, but in the ones that it is better, it believe one of the biggest challenges for SpADe is that it does so by a large margin. Could it be due to some intrinsic has a large number of parameters that need to be tuned. properties of the data set? If so, could it be that those prop- 1549

9.erties may have a critical impact on which distance measure SIGMOD Conference, 2005. should be applied? We believe that in the near future, the ¨ [10] L. Chen, M. T. Ozsu, and V. Oria. Using multi-scale research community will generate some important results histograms to answer pattern existence and shape along these lines. match queries. In SSDBM, 2005. As an immediate extension, we plan to conduct more rig- [11] Q. Chen, L. Chen, X. Lian, Y. Liu, and J. X. Yu. orous statistical analysis on the experimental results we ob- Indexable PLA for Efficient Similarity Search. In tained. We will also extend our evaluation on the accuracy VLDB, 2007. of the similarity measures to more realistic settings, by al- [12] Y. Chen, M. A. Nascimento, B. C. Ooi, and A. K. H. lowing missing points in the time series and adding noise to Tung. SpADe: On Shape-based Pattern Detection in the data. Another extension is to validate the effectiveness Streaming Time Series. In ICDE, 2007. of some of the existing techniques in expediting similarity [13] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. search using the respective distance measures. Fast Subsequence Matching in Time-Series Databases. In SIGMOD Conference, 1994. 6. ACKNOWLEDGEMENT & DISCLAIMER [14] E. Frentzos, K. Gratsias, and Y. Theodoridis. We would like to sincerely thank all the donors of the data Index-based most similar trajectory search. In ICDE, sets, without which this work would not have been possible. 2007. We would also like to thank Lei Chen, Jignesh Patel, Beng [15] P. Geurts. Pattern Extraction for Time Series Chin Ooi, Yannis Theodoridis and their respective team Classification. In PKDD, 2001. members for generously providing the source code to their [16] P. Geurts. Contributions to Decision Tree Induction: similarity measures as references and for providing timely bias/variance tradeoff and time series classification. help on this project. We would like to thank Michalis Vla- PhD thesis, University of Lige, Belgium, May 2002. chos, Themis Palpanas, Chotirat (Ann) Ratanamahatana [17] Jiawei Han and Micheline Kamber. Data Mining: and Dragomir Yankov for useful comments and discussions. Concepts and Techniques. Morgan Kaufmann Needless to say, any errors herein are the responsibility of the Publishers, CA, 2005. authors only. We would also like to thank Gokhan Memik [18] I. Karydis, A. Nanopoulos, A. N. Papadopoulos, and and Robert Dick at Northwestern University for providing Y. Manolopoulos. Evaluation of similarity searching the computing resources for our experiments. methods for music data in P2P networks. IJBIDM, We have made the best effort to faithfully re-implement 1(2), 2005. all the algorithms, and evaluate them in a fair manner. The [19] K. Kawagoe and T. Ueda. A Similarity Search purpose of this project is to provide a consolidation of exist- Method of Time Series Data with Combination of ing works on querying and mining time series, and to serve Fourier and Wavelet Transforms. In TIME, 2002. as a starting point for providing references and benchmarks for future research on time series. We welcome all kinds [20] E. Keogh, X. Xi, L. Wei, and C. Ratanamahatana. of comments on our source code and the data sets of this The UCR Time Series dataset. In project [1], and suggestions on other experimental evalua- http://www.cs.ucr.edu/∼eamonn/time series data/, tions. 2006. [21] E. J. Keogh. Exact Indexing of Dynamic Time Warping. In VLDB, 2002. 7. REFERENCES [22] E. J. Keogh. A Decade of Progress in Indexing and [1] Additional Experiment Results for Representation and Mining Large Time Series Databases. In VLDB, 2006. Similarity Measures of Time Series. http://www.ece.northwestern.edu/∼hdi117/tsim.htm. [23] E. J. Keogh, K. Chakrabarti, S. Mehrotra, and M. J. Pazzani. Locally Adaptive Dimensionality Reduction [2] R.T.Ng (2006), Note of Caution. for Indexing Large Time Series Databases. In http://www.cs.ubc.ca/∼rng/psdepository/cheby SIGMOD Conference, 2001. Report2.pdf. [24] E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and [3] H. Andr´e-J¨ onsson and D. Z. Badal. Using signature S. Mehrotra. Dimensionality Reduction for Fast files for querying time-series data. In PKDD, 1997. Similarity Search in Large Time Series Databases. [4] J. Aßfalg, H.-P. Kriegel, P. Kr¨oger, P. Kunath, Knowl. Inf. Syst., 3(3), 2001. A. Pryakhin, and M. Renz. Similarity search on time [25] E. J. Keogh and S. Kasetty. On the Need for Time series based on threshold queries. In EDBT, 2006. Series Data Mining Benchmarks: A Survey and [5] D. J. Berndt and J. Clifford. Using dynamic time Empirical Demonstration. Data Min. Knowl. Discov., warping to find patterns in time series. In KDD 7(4), 2003. Workshop, 1994. [26] E. J. Keogh and C. A. Ratanamahatana. Exact [6] Y. Cai and R. T. Ng. Indexing spatio-temporal indexing of dynamic time warping. Knowl. Inf. Syst., trajectories with chebyshev polynomials. In SIGMOD 7(3), 2005. Conference, 2004. [27] S.-W. Kim, S. Park, and W. W. Chu. An Index-Based [7] M. Cardle. Automated motion editing. In Technical Approach for Similarity Search Supporting Time Report, Computer Laboratory, University of Warping in Large Sequence Databases. In ICDE, 2001. Cambridge, UK, 2004. [28] R. Kohavi. A study of cross-validation and bootstrap [8] L. Chen and R. T. Ng. On the marriage of lp-norms for accuracy estimation and model selection. In and edit distance. In VLDB, 2004. IJCAI, 1995. ¨ [9] L. Chen, M. T. Ozsu, and V. Oria. Robust and fast [29] F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently similarity search for moving object trajectories. In 1550

10. supporting ad hoc queries in large datasets of time which has a single parameter that we adjusted to get the sequences. In SIGMOD Conference, 1997. best performance. Figure 17 compares the results of our [30] J. Lin, E. J. Keogh, L. Wei, and S. Lonardi. algorithm with Euclidean distance. Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov., 15(2), 2007. [31] M. D. Morse and J. M. Patel. An efficient and accurate method for evaluating time series similarity. In SIGMOD Conference, 2007. [32] Pang-Ning Tan and Michael Steinbach and Vipin Kumar. Introduction to Data Mining. Addison-Wesley, Reading, MA, 2005. [33] K. pong Chan and A. W.-C. Fu. Efficient Time Series Matching by Wavelets. In ICDE, 1999. [34] I. Popivanov and R. J. Miller. Similarity Search Over Time-Series Data Using Wavelets. In ICDE, 2002. [35] C. A. Ratanamahatana and E. J. Keogh. Three myths about dynamic time warping data mining. In SDM, Figure 17: The classification accuracy of ANA com- 2005. pared to Euclidean distance [36] Richard O. Duda and Peter E. Hart. Pattern As we can see, the ANA algorithm is consistently bet- Classification and Scene Analysis. John Wiley & Sons, ter than Euclidean distance, often significantly so. Further- 1973. more, ANA is as fast as Euclidean distance, is indexable [37] S. Salzberg. On Comparing Classifiers: Pitfalls to and only has a single parameter. Given this, can a paper Avoid and a Recommended Approach. Data Min. on ANA be published in a good conference or journal? It Knowl. Discov., 1(3), 1997. is time to explain how ANA works. We downloaded the [38] M. Steinbach, P.-N. Tan, V. Kumar, S. A. Klooster, mitochondrial DNA of a monkey, Macaca mulatta. We con- and C. Potter. Discovery of climate indices using verted the DNA to a string of integers, with A (Adenine) clustering. In KDD, 2003. = 0, C (Cytosine) = 1, G (Guanine) = 2 and T (Thymine) = 3. So the DNA string GAT CA . . . becomes 2, 0, 3, 1, 0, . . .. [39] M. Vlachos, D. Gunopulos, and G. Kollios. Given that we have a string of 16564 integers, we can use Discovering similar multidimensional trajectories. In the first n integers as weighs when calculating the weights ICDE, 2002. of the Euclidean distance between our time series of length [40] M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and n. So ANA is nothing more than the weighed Euclidean E. J. Keogh. Indexing Multidimensional Time-Series. distance, weighed by monkey DNA. More concretely, if we VLDB J., 15(1), 2006. have a string S: S = 3, 0, 1, 2, 0, 2, 3, 0, 1, . . . and some time [41] Y.-L. Wu, D. Agrawal, and A. E. Abbadi. A series, say of length 4, then the weight vector W with p = 1 Comparison of DFT and DWT based Similarity is {3, 0, 1, 2}, and the ANA distance is simply: Search in Time-Series Databases. In CIKM, 2000. [42] X. Xi, E. J. Keogh, C. R. Shelton, L. Wei, and C. A. 4 Ratanamahatana. Fast time series classification using AN A(A, B, W ) = (Ai − Bi )2 × Wi . numerosity reduction. In ICML, 2006. i=1 [43] B.-K. Yi and C. Faloutsos. Fast Time Sequence After we test the algorithm, if we are not satisfied with the Indexing for Arbitrary Lp Norms. In VLDB, 2000. result, we simply shift the first location in the string, so that [44] B.-K. Yi, H. V. Jagadish, and C. Faloutsos. Efficient we are using locations 2 to n + 1 of the weight string. We retrieval of similar time sequences under time warping. continue shifting till the string is exhausted and reported In ICDE. IEEE Computer Society, 1998. the best result in Figure 17. At this point the reader will [45] Y. Zhu and D. Shasha. Warping Indexes with hopefully say “but that’s not fair, you cannot change the Envelope Transforms for Query by Humming. In parameter after seeing the results, and report the best re- SIGMOD Conference, 2003. sults”. However, we believe this effect may explain many of the apparent improvements over DTW, and in some cases, APPENDIX the authors explicitly acknowledge this [10]. Researchers are adjusting the parameters after seeing the results on the test A SOBERING EXPERIMENT set. In summary, based on the experiments conduced in this Our comparative experiments have shown that while elas- paper and all the reproducible fair experiments in the liter- tic measures are, in general, better than lock-step measures, ature, there is no evidence of any distance measure that is there is little difference between the various elastic measures. systematically better than DTW in general. Furthermore, This result explicitly contradicts many papers that claim to there is at best very scant evidence that there is any dis- have a distance measure that is better than DTW, the orig- tance measure that is systematically better than DTW in inal and simplest elastic measure. How are we to reconcile on particular problems (say, just ECG data, or just noisy this two conflicting claims? We believe the following demon- data). stration will shed some light on the issue. We classified 20 Finally, we are in a position to explain what ANA stand of the data sets hosted at the UCR archive, using the sug- for. It is an acronym for Arbitrarily Naive Algorithm. gested two-fold splits that were establish several years ago. We used a distance measure called ANA (explained below) 1551

11. Data Set # of crosses ED L1 -norm L∞ -norm DISSIM TQuEST DTW DTW (c)a EDR ERP LCSS LCSS (c) Swale Spade 50words 5 0.407 0.379 0.555 0.378 0.526 0.375 0.291 0.271 0.341 0.298 0.279 0.281 0.341 Adiac 5 0.464 0.495 0.428 0.497 0.718 0.465 0.446 0.457 0.436 0.434 0.418 0.408 0.438 Beef 2 0.4 0.55 0.583 0.55 0.683 0.433 0.583 0.4 0.567 0.402 0.517 0.384 0.5 Car 2 0.275 0.3 0.3 0.217 0.267 0.333 0.258 0.371 0.167 0.208 0.35 0.233 0.25 CBF 16 0.087 0.041 0.534 0.049 0.171 0.003 0.006 0.013 0 0.017 0.015 0.013 0.044 chlorineconcentration 9 0.349 0.374 0.325 0.368 0.44 0.38 0.348 0.388 0.376 0.374 0.368 0.374 0.439 cinc ECG torso 30 0.051 0.044 0.18 0.046 0.084 0.165 0.006 0.011 0.145 0.057 0.023 0.057 0.148 Coffee 2 0.193 0.246 0.087 0.196 0.427 0.191 0.252 0.16 0.213 0.213 0.237 0.27 0.185 diatomsizereduction 10 0.022 0.033 0.019 0.026 0.161 0.015 0.026 0.019 0.026 0.045 0.084 0.028 0.016 ECG200 5 0.162 0.182 0.175 0.16 0.266 0.221 0.153 0.211 0.213 0.171 0.126 0.17 0.256 ECGFiveDays 26 0.118 0.107 0.235 0.103 0.181 0.154 0.122 0.111 0.127 0.232 0.187 0.29 0.265 FaceFour 5 0.149 0.144 0.421 0.172 0.144 0.064 0.164 0.045 0.042 0.144 0.046 0.134 0.25 FacesUCR 11 0.225 0.192 0.401 0.205 0.289 0.06 0.079 0.05 0.028 0.046 0.046 0.03 0.315 fish 5 0.319 0.293 0.314 0.311 0.496 0.329 0.261 0.107 0.216 0.067 0.16 0.171 0.15 Gun Point 5 0.146 0.092 0.186 0.084 0.175 0.14 0.055 0.079 0.161 0.098 0.065 0.066 0.007 Haptics 5 0.619 0.634 0.632 0.64 0.669 0.622 0.593 0.466 0.601 0.631 0.58 0.581 0.736 InlineSkate 6 0.665 0.646 0.715 0.65 0.757 0.557 0.603 0.531 0.483 0.517 0.525 0.533 0.643 ItalyPowerDemand 8 0.04 0.047 0.044 0.043 0.089 0.067 0.055 0.075 0.05 0.1 0.076 0.082 0.233 Lighting2 5 0.341 0.251 0.389 0.261 0.444 0.204 0.32 0.088 0.19 0.199 0.108 0.16 0.272 Lighting7 2 0.377 0.286 0.566 0.3 0.503 0.252 0.202 0.093 0.287 0.282 0.116 0.279 0.557 1552 MALLAT 20 0.032 0.041 0.079 0.042 0.094 0.038 0.04 0.08 0.033 0.088 0.091 0.09 0.167 MedicalImages 5 0.319 0.322 0.36 0.329 0.451 0.286 0.281 0.36 0.309 0.349 0.357 0.348 0.434 Motes 24 0.11 0.082 0.24 0.08 0.211 0.09 0.118 0.095 0.106 0.064 0.077 0.073 0.103 OliveOil 2 0.15 0.236 0.167 0.216 0.298 0.1 0.118 0.062 0.132 0.135 0.055 0.097 0.207 OSULeaf 5 0.448 0.488 0.52 0.474 0.571 0.401 0.424 0.115 0.365 0.359 0.281 0.403 0.212 plane 6 0.051 0.037 0.033 0.042 0.038 0.001 0.032 0.001 0.01 0.016 0.062 0.023 0.006 SonyAIBORobotSurface 16 0.081 0.076 0.106 0.088 0.135 0.077 0.074 0.084 0.07 0.228 0.155 0.205 0.195 SonyAIBORobotSurfaceII 12 0.094 0.084 0.135 0.071 0.186 0.08 0.083 0.092 0.062 0.238 0.089 0.281 0.322 StarLightCurves 9 0.142 0.143 0.151 0.142 0.13 0.089 0.086 0.107 0.125 0.118 0.124 0.12 0.142 SwedishLeaf 5 0.295 0.286 0.357 0.299 0.347 0.256 0.221 0.145 0.164 0.147 0.148 0.14 0.254 Symbols 30 0.088 0.098 0.152 0.093 0.078 0.049 0.096 0.02 0.059 0.053 0.055 0.058 0.018 synthetic control 5 0.142 0.146 0.227 0.158 0.64 0.019 0.014 0.118 0.035 0.06 0.075 0.06 0.15 Trace 5 0.368 0.279 0.445 0.286 0.158 0.016 0.075 0.15 0.084 0.118 0.142 0.108 0 TwoLeadECG 25 0.129 0.154 0.151 0.163 0.266 0.033 0.07 0.065 0.071 0.146 0.154 0.149 0.017 TwoPatterns 5 0.095 0.039 0.797 0.036 0.747 0 0 0.001 0.01 0 0 0 0.052 wafer 7 0.005 0.004 0.021 0.005 0.014 0.015 0.005 0.002 0.006 0.004 0.004 0.004 0.018 WordsSynonyms 5 0.393 0.374 0.53 0.375 0.529 0.371 0.315 0.295 0.346 0.294 0.28 0.274 0.322 yoga 11 0.16 0.161 0.181 0.167 0.216 0.151 0.151 0.112 0.133 0.109 0.134 0.43 0.13 Table 2: Error Ratio of Different Similarity Measures on 1NN Classifier a DTW (c) denotes DTW with constrained warping window, same for LCSS.